fix: make DB migrations idempotent and stamp alembic after init_database#852
Draft
sayakmaity wants to merge 3 commits into
Draft
fix: make DB migrations idempotent and stamp alembic after init_database#852sayakmaity wants to merge 3 commits into
sayakmaity wants to merge 3 commits into
Conversation
…database - Initial revision adopts pre-existing schemas (skips initial.sql when hosted_model_inference.endpoints already exists) instead of failing on DuplicateSchema. - All add-column revisions guard add_column/drop_column with inspector existence checks, so they no-op on schemas created via create_all. - init_database entrypoint stamps 'alembic head' after a successful create_all so initialized databases are never left unstamped, and no longer prints the full connection URL (host/database only).
Bump backoffLimit 0 -> 2 on both database hook Jobs now that the migrations are idempotent and safe to retry; bump chart to 0.2.8.
Unconditionally running `alembic stamp head` in the init entrypoint regressed the normal upgrade path: a database already stamped at the previous head would be fast-forwarded to the new head by the init job (hook weight -2) before the migration job (weight -1) ran, so pending add-column revisions were silently skipped forever and the app failed at runtime with UndefinedColumn. Stamp only when this run initialized the schema from scratch: - skip if public.alembic_version_model_engine already records a revision (migrations manage this DB; let `alembic upgrade head` run) - skip if hosted_model_inference.endpoints existed before create_all (legacy unstamped create_all DB, possibly from an older image missing columns; leave it unstamped so the adopting initial revision and guarded add-column revisions bring it to head) Verified against a throwaway postgres:13: fresh init stamps head; stamped-at-previous-head DB is left alone and the migration job applies the pending revision; at-head re-run no-ops; legacy unstamped schema is adopted by the migration job.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
The chart ships two pre-install/pre-upgrade hook Jobs for the database — an init job (
db.runDbInitScript, hook weight -2, runsinit_databasewhich doesCREATE SCHEMA IF NOT EXISTS+ SQLAlchemyBase.metadata.create_all) and a migration job (db.runDbMigrationScript, hook weight -1, runsalembic upgrade head). As written they cannot be safely enabled together, and each has failure modes on its own:public.alembic_version_model_engine). The next time the migration job runs, alembic replays from base: the initial revision executesinitial.sql, whose bareCREATE SCHEMA hosted_model_inferencefails withDuplicateSchema, and the whole helm operation fails.create_allcreates tables from the current models, i.e. including every column the add-column migrations would add. Stamping only the initial revision then leaves 7 migrations that all fail withDuplicateColumn.backoffLimit: 0on both Jobs means any transient pod failure (node drain, image pull hiccup, brief DB unavailability) fails the entire install/upgrade with no retry.What changed
Migrations (
model-engine/model_engine_server/db/migrations/alembic/versions/)fa3267c80731:upgrade()now inspects the bind first — ifhosted_model_inference.endpointsalready exists, it logs an adoption message and returns without executinginitial.sql. Existing databases are adopted instead of erroring.b574e9711e35,f55525c81eb5,e580182d6bfd,221aa19d3f32,62da4f8b3403,a1b2c3d4e5f6,c4d5e6f7a8b9): everyop.add_columnis guarded by an inspector column-existence check (small self-contained_has_columnhelper per revision file, no app imports), anddowngrade()drops are mirror-guarded. Multi-column revisions guard each column independently.model-engine/model_engine_server/entrypoints/init_database.pyinit_database_and_engine, the entrypoint runsalembic stamp head(subprocess from the migrations directory, same invocation style asrun_database_migration.sh;env.pyresolves the DB URL itself fromML_INFRA_DATABASE_URLor cloud secrets, both inherited by the subprocess) — but only when this run initialized the schema from scratch. Stamping is skipped in two cases:public.alembic_version_model_enginealready records a revision. Migrations already manage this database; stamping would fast-forward the version table past unapplied revisions (e.g. a new add-column revision shipped in this image) and the migration job would then no-op, silently skipping that migration — and every future one — leaving the app to fail at runtime withUndefinedColumn. Instead the migration job'salembic upgrade headapplies the pending revisions as before.hosted_model_inference.endpointsexisted beforecreate_allran (recorded on the first attempt that can connect, before any retry).create_allonly creates missing tables and never adds columns to existing ones, so a pre-existing unstamped schema (e.g. created by an older image'screate_all) may be missing columns; it is left unstamped so the adopting initial revision plus the guarded add-column revisions bring it to head.model-engine/model_engine_server/db/migrations/READMEcharts/model-enginebackoffLimit: 0→2on bothdatabase_init_job.yamlanddatabase_migration_job.yaml(migrations run in transactions and are now idempotent, so retry is safe).0.2.7→0.2.8. Novalues.yamldefault changes.Behavior matrix
DuplicateSchema, helm failsalembic upgrade headapplies all revisions (worked)create_all, unstampedDuplicateSchemacreate_all, unstampedDuplicateSchemaDuplicateColumnRollout notes
init_database.py), not the chart. A cluster must run an image containing this commit beforerunDbMigrationScriptcan be enabled unconditionally against databases created by the init path; chart0.2.8only adds retry headroom (backoffLimit: 2).runDbInitScripton clusters that already use migrations does not interfere with pending migrations on upgrade.alembic downgradeeither.Testing
python3 -m py_compileon all edited Python files — pass.black==24.8.0(repo config),ruff==0.6.8,isort==5.13.2oninit_database.py— clean.helm lint charts/model-engine -f values_circleci.yaml— pass (lint with only default values fails on unrelated pre-existing templates, e.g.service_template_config_map.yaml, onmaintoo).helm template ... --set db.runDbInitScript=true --set db.runDbMigrationScript=true --show-onlyfor both Job templates — both render withbackoffLimit: 2.postgres:13container (alembic 1.8.1 / SQLAlchemy 2.0.48, matchingrequirements.txt), running the actualpython -m model_engine_server.entrypoints.init_databaseentrypoint andrun_database_migration.shwithML_INFRA_DATABASE_URL:create_alland stampsc4d5e6f7a8b9 (head); a subsequentrun_database_migration.shis a no-op.a1b2c3d4e5f6) with thestatus_reasonrevision pending: init entrypoint logs the skip and leaves the version row untouched;run_database_migration.shthen appliesa1b2c3d4e5f6 → c4d5e6f7a8b9and the column appears.status_reasoncolumn dropped, version table dropped): init entrypoint skips stamping; migration prints the adoption message, adds the missing column, ends at head.alembic upgrade headapplies all 8 revisions; re-run is a no-op. Stamped-at-initial DB with only 2 of the 7 columns present: upgrade skips existing columns and adds the missing ones.alembic downgrade basewith one column pre-dropped: guarded drops succeed.env.py) — the latter is exercised only via theML_INFRA_DATABASE_URLpath locally; the cloud path relies onenv.pyresolving the URL identically forstampandupgrade, which is unchanged code. Also not exercised:alembicmust be onPATHfor the stamp subprocess — true in the Dockerfile images (pip-installed console script), verified locally by putting the venv bin onPATH.