Skip to content

fix: make DB migrations idempotent and stamp alembic after init_database#852

Draft
sayakmaity wants to merge 3 commits into
mainfrom
sayakmaity/model-engine-idempotent-db-migrations
Draft

fix: make DB migrations idempotent and stamp alembic after init_database#852
sayakmaity wants to merge 3 commits into
mainfrom
sayakmaity/model-engine-idempotent-db-migrations

Conversation

@sayakmaity

@sayakmaity sayakmaity commented Jul 2, 2026

Copy link
Copy Markdown
Collaborator

Problem

The chart ships two pre-install/pre-upgrade hook Jobs for the database — an init job (db.runDbInitScript, hook weight -2, runs init_database which does CREATE SCHEMA IF NOT EXISTS + SQLAlchemy Base.metadata.create_all) and a migration job (db.runDbMigrationScript, hook weight -1, runs alembic upgrade head). As written they cannot be safely enabled together, and each has failure modes on its own:

  1. Init never stamps alembic. A database created by the init job has the full schema but no alembic version row (public.alembic_version_model_engine). The next time the migration job runs, alembic replays from base: the initial revision executes initial.sql, whose bare CREATE SCHEMA hosted_model_inference fails with DuplicateSchema, and the whole helm operation fails.
  2. Both flags on = broken first install. On a fresh database the init job creates everything (unstamped) at weight -2, then the migration job at weight -1 replays from base and hits the same collision. So the flags are mutually exclusive in practice, silently.
  3. Even a stamped-at-initial database still breaks. create_all creates tables from the current models, i.e. including every column the add-column migrations would add. Stamping only the initial revision then leaves 7 migrations that all fail with DuplicateColumn.
  4. backoffLimit: 0 on both Jobs means any transient pod failure (node drain, image pull hiccup, brief DB unavailability) fails the entire install/upgrade with no retry.

What changed

Migrations (model-engine/model_engine_server/db/migrations/alembic/versions/)

  • Initial revision fa3267c80731: upgrade() now inspects the bind first — if hosted_model_inference.endpoints already exists, it logs an adoption message and returns without executing initial.sql. Existing databases are adopted instead of erroring.
  • All 7 add-column revisions (b574e9711e35, f55525c81eb5, e580182d6bfd, 221aa19d3f32, 62da4f8b3403, a1b2c3d4e5f6, c4d5e6f7a8b9): every op.add_column is guarded by an inspector column-existence check (small self-contained _has_column helper per revision file, no app imports), and downgrade() drops are mirror-guarded. Multi-column revisions guard each column independently.

model-engine/model_engine_server/entrypoints/init_database.py

  • After a successful init_database_and_engine, the entrypoint runs alembic stamp head (subprocess from the migrations directory, same invocation style as run_database_migration.sh; env.py resolves the DB URL itself from ML_INFRA_DATABASE_URL or cloud secrets, both inherited by the subprocess) — but only when this run initialized the schema from scratch. Stamping is skipped in two cases:
    • public.alembic_version_model_engine already records a revision. Migrations already manage this database; stamping would fast-forward the version table past unapplied revisions (e.g. a new add-column revision shipped in this image) and the migration job would then no-op, silently skipping that migration — and every future one — leaving the app to fail at runtime with UndefinedColumn. Instead the migration job's alembic upgrade head applies the pending revisions as before.
    • hosted_model_inference.endpoints existed before create_all ran (recorded on the first attempt that can connect, before any retry). create_all only creates missing tables and never adds columns to existing ones, so a pre-existing unstamped schema (e.g. created by an older image's create_all) may be missing columns; it is left unstamped so the adopting initial revision plus the guarded add-column revisions bring it to head.
  • The final success log now prints only host/database instead of the full connection URL — routine log hygiene.

model-engine/model_engine_server/db/migrations/README

  • Documents the new self-adopting/idempotent behavior, the conditional-stamp rules, and when manual stamping is still needed.

charts/model-engine

  • backoffLimit: 02 on both database_init_job.yaml and database_migration_job.yaml (migrations run in transactions and are now idempotent, so retry is safe).
  • Chart version 0.2.70.2.8. No values.yaml default changes.

Behavior matrix

DB state Flags Before After
Fresh (empty) init + migration on Init creates schema unstamped; migration replays from base → DuplicateSchema, helm fails Init creates schema and stamps head; migration is a no-op at head
Fresh (empty) migration only alembic upgrade head applies all revisions (worked) Same, unchanged
Stamped at previous head, new revision in image init + migration on (routine upgrade) Init no-ops, migration applies N→head (worked) Init detects the existing revision and skips stamping; migration applies N→head. Same outcome, preserved
Created via init/create_all, unstamped migration turned on later Replay from base → DuplicateSchema Initial revision adopts the schema, guarded revisions no-op, DB ends stamped at head
Created via older image's create_all, unstamped init + migration on Replay from base → DuplicateSchema Init sees the pre-existing schema and skips stamping; migration adopts it and guarded revisions add any missing columns, ends at head
Stamped at initial only, columns already present migration on First add-column revision → DuplicateColumn Guards skip existing columns, missing ones are added, ends at head
Already stamped at head migration on No-op (worked) Same, unchanged

Rollout notes

  • The behavioral fix lives in the image (revision files + init_database.py), not the chart. A cluster must run an image containing this commit before runDbMigrationScript can be enabled unconditionally against databases created by the init path; chart 0.2.8 only adds retry headroom (backoffLimit: 2).
  • The init job never stamps a database that already has an alembic revision, so enabling runDbInitScript on clusters that already use migrations does not interfere with pending migrations on upgrade.
  • Retry interaction: if an init pod creates the schema but dies before stamping, the retry pod sees the schema as pre-existing and leaves it unstamped — the migration job then adopts it to head (fail-safe direction; nothing is ever over-stamped).
  • No values defaults changed; both Jobs remain opt-in via the existing flags.
  • Downgrades are guarded the same way, so a partially-converged schema won't fail alembic downgrade either.

Testing

  • python3 -m py_compile on all edited Python files — pass. black==24.8.0 (repo config), ruff==0.6.8, isort==5.13.2 on init_database.py — clean.
  • helm lint charts/model-engine -f values_circleci.yaml — pass (lint with only default values fails on unrelated pre-existing templates, e.g. service_template_config_map.yaml, on main too).
  • helm template ... --set db.runDbInitScript=true --set db.runDbMigrationScript=true --show-only for both Job templates — both render with backoffLimit: 2.
  • Empirical verification against a throwaway postgres:13 container (alembic 1.8.1 / SQLAlchemy 2.0.48, matching requirements.txt), running the actual python -m model_engine_server.entrypoints.init_database entrypoint and run_database_migration.sh with ML_INFRA_DATABASE_URL:
    • Fresh DB: init entrypoint runs create_all and stamps c4d5e6f7a8b9 (head); a subsequent run_database_migration.sh is a no-op.
    • Stamped at previous head (a1b2c3d4e5f6) with the status_reason revision pending: init entrypoint logs the skip and leaves the version row untouched; run_database_migration.sh then applies a1b2c3d4e5f6 → c4d5e6f7a8b9 and the column appears.
    • Re-run of the init entrypoint on an at-head DB: skip message, version unchanged.
    • Legacy simulation (schema present, status_reason column dropped, version table dropped): init entrypoint skips stamping; migration prints the adoption message, adds the missing column, ends at head.
    • Fresh DB: plain alembic upgrade head applies all 8 revisions; re-run is a no-op. Stamped-at-initial DB with only 2 of the 7 columns present: upgrade skips existing columns and adds the missing ones. alembic downgrade base with one column pre-dropped: guarded drops succeed.
  • Not run: the full test suite, the FIPS image build, and cloud-secret URL resolution (AWS/GCP/Azure paths in env.py) — the latter is exercised only via the ML_INFRA_DATABASE_URL path locally; the cloud path relies on env.py resolving the URL identically for stamp and upgrade, which is unchanged code. Also not exercised: alembic must be on PATH for the stamp subprocess — true in the Dockerfile images (pip-installed console script), verified locally by putting the venv bin on PATH.

…database

- Initial revision adopts pre-existing schemas (skips initial.sql when
  hosted_model_inference.endpoints already exists) instead of failing on
  DuplicateSchema.
- All add-column revisions guard add_column/drop_column with inspector
  existence checks, so they no-op on schemas created via create_all.
- init_database entrypoint stamps 'alembic head' after a successful
  create_all so initialized databases are never left unstamped, and no
  longer prints the full connection URL (host/database only).
Bump backoffLimit 0 -> 2 on both database hook Jobs now that the
migrations are idempotent and safe to retry; bump chart to 0.2.8.
Unconditionally running `alembic stamp head` in the init entrypoint
regressed the normal upgrade path: a database already stamped at the
previous head would be fast-forwarded to the new head by the init job
(hook weight -2) before the migration job (weight -1) ran, so pending
add-column revisions were silently skipped forever and the app failed
at runtime with UndefinedColumn.

Stamp only when this run initialized the schema from scratch:
- skip if public.alembic_version_model_engine already records a
  revision (migrations manage this DB; let `alembic upgrade head` run)
- skip if hosted_model_inference.endpoints existed before create_all
  (legacy unstamped create_all DB, possibly from an older image missing
  columns; leave it unstamped so the adopting initial revision and
  guarded add-column revisions bring it to head)

Verified against a throwaway postgres:13: fresh init stamps head;
stamped-at-previous-head DB is left alone and the migration job applies
the pending revision; at-head re-run no-ops; legacy unstamped schema is
adopted by the migration job.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant