Skip to content

fix(memos-local-plugin): two-phase migration to prevent crash-loop on large databases#1789

Open
chiefmojo wants to merge 1 commit into
MemTensor:mainfrom
chiefmojo:fix/large-db-migration-crash
Open

fix(memos-local-plugin): two-phase migration to prevent crash-loop on large databases#1789
chiefmojo wants to merge 1 commit into
MemTensor:mainfrom
chiefmojo:fix/large-db-migration-crash

Conversation

@chiefmojo
Copy link
Copy Markdown

Problem

Migration 007 (namespace-visibility) in v2.0.2–v2.0.5 runs UPDATE … SET share_scope='private' plus CREATE INDEX on the traces table inside a single db.tx(). On databases larger than ~500MB, this exceeds the host gateway's kill timeout. SQLite rolls back the entire transaction — including the schema_migrations INSERT — so migration 007 is never recorded and the bridge restarts into the same hang forever.

Small databases (~43MB) complete within the timeout, which is why this only manifests on larger installs. Tested against a 687MB database with ~98,000 traces.

Fix

Two-phase migration:

  • Phase 1 (inside transaction, milliseconds): ADD COLUMN only on all 12 namespace tables, plus DROP INDEX uq_skills_name. The schema_migrations record for v7 commits here.
  • Phase 2 (after migration loop, outside any transaction): Batched UPDATE in 2,000-row chunks (each its own implicit transaction) for share_scope backfill, then CREATE INDEX IF NOT EXISTS for all 18 owner/share indexes. ensureNamespaceColumns is called unconditionally on every boot so new tables in the namespace list get their columns.

Restart safety: If the bridge is killed during Phase 2, the v7 schema_migrations record survives (Phase 1 committed). Next boot skips Phase 1 entirely and resumes Phase 2 where it left off. The crash-loop is broken.

Verification

  • Confirmed the DB on the affected instance had exactly migrations 1–6 applied — migration 007 was never committed despite hundreds of boot attempts.
  • After the fix, Phase 1 committed in 2.4 seconds. Phase 2 backfill and index creation complete across bridge restarts.
  • Bridge boots cleanly with all 18 indexes created and pipeline.ready.

Related

Closes #1787

… large databases

Migration 007 (namespace-visibility) runs UPDATE ... SET share_scope
and CREATE INDEX on the traces table inside a single db.tx(). On
databases larger than ~500MB, this exceeds the host gateway kill
timeout, SQLite rolls back the entire transaction (including the
schema_migrations INSERT), and the bridge restarts into the same
hang forever.

This splits the migration into two phases:
  Phase 1 (inside transaction, ms): ADD COLUMN only on 12 namespace
    tables plus DROP INDEX uq_skills_name. The schema_migrations
    record commits here.
  Phase 2 (after migration loop, outside any transaction): Batched
    UPDATE in 2,000-row chunks (each its own implicit transaction)
    for share_scope backfill, then CREATE INDEX IF NOT EXISTS for
    all 18 owner/share indexes. Phase 2 also calls
    ensureNamespaceColumns unconditionally so new tables added to
    the namespace list get their columns on every boot.

Restart-safe: if the bridge is killed during Phase 2, the v7
schema_migrations record survives (Phase 1 committed). Next boot
skips Phase 1 entirely and resumes Phase 2 where it left off.

Closes MemTensor#1787
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

v2.0.2+ regression: bootstrapMemoryCoreFull() hangs with 100% CPU on databases >500MB

2 participants