feat(migrate-prod): prod → multiplayer migration (ETL + plan + runbook)#79
Draft
cevian wants to merge 10 commits into
Draft
feat(migrate-prod): prod → multiplayer migration (ETL + plan + runbook)#79cevian wants to merge 10 commits into
cevian wants to merge 10 commits into
Conversation
One-time, in-place migration from the old org/engine/role + RLS model (deployed at server/v0.2.5) to the new auth/core/space model (PR #71), all within the single existing database. PROD_MIGRATION_PLAN.md captures the full old→new mapping, decisions (in-place, reuse slugs, rename-aside), the phased run procedure, and the "verify against live prod" checklist (no DB access yet — drafted off code). packages/migrate-prod implements it, reusing the new code's own provisioning (migrateAuth/migrateCore/provisionSpace) and core SQL functions rather than re-implementing DDL: - Phase A: provision auth+core beside the live accounts schema; migrate identities → auth.users + core.principal (id preserved), oauth links, and live sessions (token_hash copied verbatim — same sha256 scheme). - Phase B (per engine, one txn): rename old me_<slug> aside, provision a fresh one, build the roster + tree-access grants from org membership / superuser / tree_owner / tree_grant / role_membership, same-DB copy memories (carrying embeddings). - Phase C: explicit dropLegacy/dropAccounts teardown. Not migrated (by constraint): api keys (argon2, unrecoverable — agents re-issue), oauth tokens, device-flow rows. Grant {actions}→level mapping is intentionally lossy/over-permissive (documented). Tested end-to-end against a real Postgres (simple + complex scenarios: multi-member org, RBAC role→group, explicit grants, dangling identity, invitations, deleted/orphan engines) plus unit tests for the mapping. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Operational step-by-step for executing the migration: pre-flight checklist (privileges, backup, §9 verification), rollback plan, the maintenance-window mode (recommended) vs per-engine zero-downtime mode, Phase-C teardown, and reconciliation/verification SQL. Grounds the steps in how prod actually deploys/migrates: the new server auto-migrates idempotently on boot (so run the ETL first to avoid the helm --wait --atomic crashloop), and connects via DATABASE_URL with a temporary ENGINE_DATABASE_URL fallback (so the single-DB cutover needs no chart connection change). Flags the one hard break (API-key re-issue) and the cross-schema privilege requirement for the ETL connection. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Prod actually runs two separate databases (DB_ACCOUNTS + DB_SHARD), not one,
so the migration now targets a brand-new third database instead of an in-place
schema swap.
- ETL takes three connections {accounts, shard, target}; sources are read-only.
- No rename-aside / collision handling (sources and target are different DBs);
each engine slug is reused verbatim as its space slug.
- Memories are copied cross-database by streaming (cursor over DB_SHARD →
batched insert into the target) instead of insert…select; meta re-sent via
sql.json to dodge the postgres.js text-in-jsonb double-encoding footgun.
- Removed the dropLegacy/dropAccounts teardown helpers — sources are never
modified, so rollback is just repointing the app at the old databases and
decommissioning them is out of band.
- run.ts reads DB_ACCOUNTS / DB_SHARD / DATABASE_URL(target).
- Plan + runbook rewritten for the three-DB topology (fresh target, cross-DB
copy, repoint-to-sources rollback, chart DB-secret repoint, cross-DB
verification queries).
Tests updated: the integration test stands in one physical DB for all three
connections (source schemas carry a distinct prefix so they don't collide with
the target). typecheck + lint clean; 13 integration + 5 unit pass.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
6a06e0c to
3bce9e1
Compare
Added survey.ts: a READ-ONLY §9 verification tool (DB_ACCOUNTS + DB_SHARD)
that checks DDL drift and surveys the data shape. Run against live prod:
- two distinct physical clusters confirmed → cross-DB ETL is correct
- no DDL drift; 32 identities, 34 active engines, 62,111 memories, 0 orphans
- 33/34 orgs single-owner; 1 multi-member org; 1 RBAC role
- 6 service users (login, no identity), all confirmed to be each owner's
own coding agents (claude/codex/sidekick/…)
Service-user handling (decision): map each to a kind='a' agent owned by the
engine's org owner, joined to the space, with its grants re-created (clamped
under the owner's owner@root). Dangling identities (none in prod) still drop
with a warning. Memory copy kept per-row (cursor fetched in batches).
Fixture + test cover the new service-user→agent path; plan updated with the
§9 results and the decision (§4.1, §10). typecheck + lint clean; 14
integration + 5 unit pass.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Drop the per-engine zero-downtime mode (Mode B) and the modes section; the runbook is now a single linear maintenance-window cutover. Fold in the §9 survey facts: humans keep working (sessions migrate) so only agents (former service users) need re-issued keys; expected reconciliation numbers (~32 users, 34 active engines, ~62k memories, 0 orphans, 0 skipped/warnings); note the row-by-row copy runtime. Fix the plan's stale runbook cross-refs. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The target needs only an empty database (+ creatable/installed extensions + a schema-creating role) — not a pre-migrated one. The ETL runs migrateAuth/ migrateCore + provisionSpace itself. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add MigrateOptions.engineSlugs (run.ts: MIGRATE_ENGINES env) to restrict Phase B to a subset of engines — Phase A still migrates all identities. Lets you smoke test the ETL against a throwaway target with the real (read-only) prod sources: a fast few-engine pass first, then a full rehearsal. Requested slugs that aren't active engines are reported in skippedEngines. Runbook §0 documents the rehearsal procedure (incl. target reset SQL for re-runs). Tests cover the filter + the not-found-slug case. typecheck + lint clean; 16 integration + 5 unit pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A subset-aware, read-only verifier across DB_ACCOUNTS/DB_SHARD/target: identity↔ user counts + the auth.users == core.principal invariant, oauth/session copy, per-space memory counts (target vs source shard), ≥1 admin per space, every member's effective build_tree_access non-empty, and the Tiger-Den access-parity spot-check (owner→owner@root, member→group grants). Prints a ✓/✗ checklist, exits non-zero on failure. Runbook §5 points at it. Verified the smoke-test target (Tiger Den + one small engine): all 12 checks passed — 32 identities/users/principals, 18 sessions, memory counts match, and the role→group access resolves for the collaborator. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Replace the per-row cross-DB insert with one batched insert per cursor fetch: read every column as text, then `insert … select … from unnest($1::text[], …)` with scalar casts in the projection (meta::jsonb, tree::ltree, temporal::tstzrange, embedding::halfvec). Validated the cast pattern locally (jsonb objects, halfvec, ltree incl. root, tstzrange, nulls all preserved). Cuts ~62k target round-trips to ~125 → the memory copy drops from tens of minutes to a few. Behavior unchanged: the row-level enqueue trigger still fires per inserted row (null-embedding rows enqueue), counts/embeddings/tree paths identical. Docs (plan §5, runbook) updated; 16 integration + unit suite green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Full rehearsal against a test target (real prod sources, read-only): 34 spaces, 62,111 memories, 0 skipped, 0 warnings; verify.ts 114/114 checks pass (counts reconcile per space incl. the 20,862-row one; service-user→agent confirmed). Wall-clock 14m36s but ~3% CPU — I/O-bound (~1.8 GB of halfvec over the WAN), so run the real cutover in-region. Runbook timing notes corrected from "a few minutes" to the measured number. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
One-time migration of production from the old org/engine/role + RLS model
(deployed at
server/v0.2.5, SHAa6cfabf) to the new auth/core/space modelintroduced in #71. Verified against live prod (read-only).
Topology
Production runs two separate clusters —
DB_ACCOUNTS(identity) andDB_SHARD(memories, oneme_<slug>schema per engine) — and the ETL writes athird, new database (
auth+core+ per-spaceme_<slug>). Sources areread-only; rollback = repoint the app at them.
What's here
PROD_MIGRATION_PLAN.md— design, full old→new mapping, decisions, runprocedure, and the §9 survey results.
PROD_MIGRATION_RUNBOOK.md— cutover: pre-flight, rollback, modes,decommission, cross-DB verification SQL.
packages/migrate-prod— the ETL (migrateProdToMultiplayer(conns)over{accounts, shard, target};run.tsrunner) +survey.ts(the read-only§9 tool). Reuses the new code's own provisioning +
coreSQL functions.Approach
Fresh target DB (no collisions; slugs reused verbatim). Provision
auth/core,migrate identities; per engine create+provision the space, build roster/grants,
and stream-copy memories cross-database (
metaviasql.json). Run the ETLfirst, then point
DATABASE_URLat the target (the app's idempotent bootmigration becomes a no-op, dodging the
helm --wait --atomiccrashloop).Key outcomes
auth.users+core.principal(id preserved); oauth links;sessions migrate (same
sha256(token)bytea — users stay logged in).tree_owner/tree_grant→ grants (lossy, over-permissive, documented); RBACroles → groups.
the owner's coding agent), with grants preserved.
one hard, user-visible break.
§9 — verified against live prod (read-only)
0 pending invites, only 4 memories without embeddings.
1 RBAC role; 6 service users (all owners' coding agents) → mapped to agents.
Tested
tsc+ lint clean; 14 integration + 5 unit pass. The integration test standsin one physical DB for all three connections and covers the simple + complex
scenarios (multi-member org, RBAC role→group, service-user→agent, grants, dangling
identity, invitations, deleted/orphan engines).
🤖 Generated with Claude Code