Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
458 changes: 458 additions & 0 deletions PROD_MIGRATION_PLAN.md

Large diffs are not rendered by default.

224 changes: 224 additions & 0 deletions PROD_MIGRATION_RUNBOOK.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,224 @@
# Prod cutover runbook — old → multiplayer

Operational steps to cut production from the old org/engine/role model
(`server/v0.2.5`) to the new auth/core/space model, via a **maintenance window**:
stop the old app, run the ETL once into a new database, repoint + deploy the new
app, verify, re-issue agent keys. The **why** and the full mapping live in
`PROD_MIGRATION_PLAN.md`; the ETL is `packages/migrate-prod`. This doc is the
**how-to-operate**.

> ⚠️ **One hard, user-visible break: API keys do not migrate.** Old keys are
> argon2-hashed and engine-scoped. **Humans keep working** (sessions migrate —
> plan §2.3). Only **agents** need a re-issued key (`apiKey.create` → update
> `ME_API_KEY` where the agent runs). The agents are the old "service users"; run
> `survey.ts` for the current list (at the last §9 survey: 6 agents across 3
> owners). Plan the comms before starting.

The migration reads **two source databases** — `DB_ACCOUNTS` (identity) and
`DB_SHARD` (memories) — and writes a **third, new database**. The sources are
never modified.

Grounded facts this runbook relies on (verified in code):

- The new server **auto-migrates on boot** (`packages/server/start.ts`):
`bootstrapSpaceDatabase` → `migrateCore` → `migrateAuth` → `remigrateSpaces`
(re-runs `migrateSpace` for every `core.space`). All idempotent. ⇒ **run the
ETL first**, so the new app's boot migration is a no-op and we avoid the
`helm --wait --atomic` crashloop-then-rollback failure mode (CLAUDE.md footgun).
- The new server connects via `DATABASE_URL` (falling back to the legacy
`ENGINE_DATABASE_URL`, `start.ts:202`). The target is a **new** database, so the
chart's database secret **must be repointed** at it (the old `ENGINE_DATABASE_URL`
points at `DB_SHARD`).
- Prod is a k8s Deployment `memory-engine` in namespace `memory-engine`, deployed
by pushing a `server/v*` tag (→ `timescale/tiger-agents-deploy`). The new app
image must be built from a commit that includes PR #71 (the multiplayer model).

---

## 0. Rehearsal / smoke test (do this first, against a throwaway target)

The ETL never writes the sources, so you can rehearse it against the **real** prod
sources (read-only) writing to a **throwaway target** — no window, no risk.

1. Provision an empty test target (extensions creatable — see §1) and put its URL
in `.env.local` as `DATABASE_URL` (alongside `DB_ACCOUNTS`/`DB_SHARD`; Bun
auto-loads `.env.local`).
2. **Fast first pass — a few engines.** `MIGRATE_ENGINES` limits Phase B to a
subset (Phase A still migrates all identities). Include the multi-member engine
to exercise the role→group + member-roster path:
```
MIGRATE_ENGINES=5ld4wito9c8o,<a-small-slug> bun packages/migrate-prod/run.ts
```
Validates connectivity to all three DBs, target extensions/privileges, the
control plane, and the complex path — in seconds.
3. **Full rehearsal.** Drop the `MIGRATE_ENGINES` var and run the whole thing:
```
bun packages/migrate-prod/run.ts
```
This copies all ~62k memories (batched inserts, ~500/statement) from prod
(read-only), putting read load on the source shard. A measured remote-host
rehearsal took **~15 min** — it's I/O-bound (≈1.8 GB of halfvec data over the
WAN + HNSW index builds, ~3% CPU), so running **in-region** is materially
faster. It validates timing and full correctness end-to-end.
4. **Verify** the target with the §5 queries (counts, ≥1 admin per space, access
spot-check). Confirm the report has `skippedEngines: []` and `warnings: []`.
5. **To re-run**, reset the target first — the ETL is not idempotent on a dirty
target (duplicate ids/slugs). Against the **test target only**:
```sql
do $$ declare s text; begin
for s in
select nspname from pg_namespace
where nspname in ('auth','core') or nspname ~ '^me_[a-z0-9]{12}$'
loop execute format('drop schema if exists %I cascade', s); end loop;
end $$;
```

---

## 1. Pre-flight (all must pass before you start)

- [ ] **Data verified (§9).** The DDL/data survey is done (plan §9.1). **Re-run
`survey.ts` at cutover** for fresh numbers — expect roughly: ~32 users, 34
active engines, ~62k memories, 0 orphans, every org with an owner.
- [ ] **Empty target database created** — just an empty database; the ETL
connects but never `CREATE DATABASE`. **Do not pre-migrate it:** the ETL runs
`migrateAuth`/`migrateCore` + per-engine `provisionSpace`, which create the
`auth`/`core`/`me_<slug>` schemas and `create extension if not exists`
(`citext`, `ltree`, `vector`/pgvector, `pg_textsearch` in `public`). So
ensure those 4 extensions are installed **or** the ETL's role may create
them (often superuser-only on managed Postgres), and that the role can
create schemas (it then owns them).
- [ ] **Connections + privileges.** The ETL opens three connections:
`DB_ACCOUNTS` (read `accounts.*`), `DB_SHARD` (read `me_<slug>.*`), and the
target (create + own `auth`/`core`/`me_<slug>`). Confirm each role has the
access it needs (see the Appendix privilege query).
- [ ] **ETL runnable against prod.** It's a separate workspace package
(`@memory.build/migrate-prod`), not in the server image. Either run it from a
maintenance host with the repo + bun and network access to all three
databases, or as a one-off in-cluster Job whose image includes it, with
`DB_ACCOUNTS`/`DB_SHARD`/`DATABASE_URL` from secrets.
- [ ] **Snapshots** of both source databases taken (the ETL doesn't modify them,
but snapshot before any prod operation).
- [ ] **New app image built** from a post-PR#71 commit (a `server/v*` tag), ready
to deploy but **not yet rolled out**; the chart DB-secret repoint (→ new DB)
prepared.
- [ ] **Agent re-issue plan** ready: the list of agents (former service users,
from `survey.ts`) and where each reads `ME_API_KEY`, plus the comms.

---

## 2. Rollback

The sources are read-only throughout, so rollback is trivial **until you
decommission them (§4)**:

1. Repoint the chart's DB secret back to `DB_SHARD`/`DB_ACCOUNTS` and redeploy the
old `server/v0.2.5` image (or scale the new app to 0 and the old app back up).
The old app sees its original, untouched databases.
2. Drop/abandon the new target database.

**Hard fallback:** restore a source snapshot (only needed if something external
mutated a source — the ETL never does). After §4, rollback = restore snapshots.

**Abort criteria during the run:** any ETL error that isn't a clearly-understood,
single-engine data anomaly; reconciliation counts that don't match (§5); the new
app failing its boot health check.

---

## 3. Cutover

1. **Announce** the window; freeze agent/CLI writes.
2. **Stop the old app** (halts writes to the sources):
```
kubectl -n memory-engine scale deploy/memory-engine --replicas=0
kubectl -n memory-engine rollout status deploy/memory-engine --timeout=120s
```
3. **Final snapshot** of the source databases (post-quiesce).
4. **Run the ETL** (three connections; target = the new DB):
```
DB_ACCOUNTS="postgresql://…" \
DB_SHARD="postgresql://…" \
DATABASE_URL="postgresql://…" \ # the NEW target database
bun packages/migrate-prod/run.ts
```
The memory copy is batched (~62k rows, ~500/insert; the largest engine ~20.9k
in one transaction). A remote rehearsal measured **~15 min** (I/O-bound, ≈1.8 GB
over the WAN) — **run in-region to cut that down**. It prints a JSON report —
check against the §9 survey: `engines` ≈ 34, `skippedEngines` = 0, `warnings` = 0
(no dangling users / nested roles in prod). Investigate any surprise before
proceeding.
5. **Verify** with the §5 reconciliation queries. **Gate:** counts match and every
space has ≥1 admin. If not → abort + rollback (§2).
6. **Repoint + deploy the new app.** Update the chart DB secret to the new
database, then push the new `server/v*` tag (→ tiger-agents-deploy). On boot it
runs the idempotent `migrateCore`/`migrateAuth`/`remigrateSpaces` (no-ops over
the ETL's work) and begins serving.
```
kubectl -n memory-engine rollout status deploy/memory-engine --timeout=300s
```
If boot crashloops on migration, the DB state is wrong — capture `kubectl logs`
**before** Helm rolls back, then rollback (§2).
7. **Smoke test:** a known user logs in (session should still be valid); lists
spaces; reads a memory; runs a search (BM25 + semantic). The worker should
drain `embedding_queue` for the few null-embedding memories.
8. **Re-issue agent API keys** and update each `ME_API_KEY`. Old keys are dead.
9. **Open the window.** Monitor logs/metrics + the embedding backfill for a soak.
10. **Decommission the old databases** — only after the soak (§4).

---

## 4. Decommission the old databases (after the soak)

There is no teardown SQL — the migration never modified the sources. Once cutover
is confirmed and the rollback window has closed, decommission `DB_ACCOUNTS` and
`DB_SHARD` per your infra (final snapshot, then drop/retire the instances). Keep
them intact until then so rollback (§2) stays a simple repoint.

Optional chart cleanup afterward: remove the now-unused `ACCOUNTS_DATABASE_URL`/
`ACCOUNTS_MASTER_KEY` and the pool-split values, leaving a single `DATABASE_URL`.

---

## 5. Appendix — verification queries

`verify.ts` runs all of the below automatically (read-only across the three DBs)
and prints a ✓/✗ checklist — `bun packages/migrate-prod/verify.ts`. It's
subset-aware (only checks spaces present in the target), so it works for a
rehearsal too. The raw queries follow for ad-hoc checks; run each `select`
against the connection named in its comment.

**Privilege pre-flight** (run as the **target** role, before starting):
```sql
select has_database_privilege(current_user, current_database(), 'create') as can_create_schema;
```
(And confirm the `DB_ACCOUNTS`/`DB_SHARD` roles can `select` the `accounts` /
`me_<slug>` schemas.)

**Reconciliation:**
```sql
-- identities (DB_ACCOUNTS) vs users + user principals (target) — all equal
select count(*) from accounts.identity; -- DB_ACCOUNTS
select count(*) from auth.users; -- target
select count(*) from core.principal where kind = 'u'; -- target

-- active engines (DB_ACCOUNTS) vs spaces (target) — equal (0 orphans in prod)
select count(*) from accounts.engine where status = 'active'; -- DB_ACCOUNTS
select count(*) from core.space; -- target

-- every space has >= 1 effective admin (target) — expect 0 rows
select s.slug, count(*) filter (where ps.admin) as admins
from core.space s
left join core.principal_space ps on ps.space_id = s.id
group by s.slug having count(*) filter (where ps.admin) = 0;

-- per engine: memory counts must match between DB_SHARD and the target
-- select count(*) from me_<slug>.memory; -- DB_SHARD (source)
-- select count(*) from me_<slug>.memory; -- target (new)
```

**Spot-check access parity** (sample a few members): the new
`core.build_tree_access(member_id, space_id)` reachable set (target) should cover
what the old `me_<slug>.tree_access(user_id, action)` allowed (DB_SHARD). Pick a
known owner (expect `owner@root`) and the one collaborator in the multi-member
space (expect only their granted subtrees + `home`).
11 changes: 11 additions & 0 deletions bun.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

26 changes: 26 additions & 0 deletions packages/migrate-prod/index.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
/**
* One-time prod → multiplayer migration. See PROD_MIGRATION_PLAN.md.
*
* `migrateProdToMultiplayer(conns)` runs the whole ETL across the three
* databases (Phases A+B); `migrateControlPlane`/`migrateEngine` are the per-phase
* functions. The source databases are never modified, so there is no teardown
* SQL — rollback is repointing the app at the old databases.
*/

export { mapActionsToLevel, type OldAction, orgRoleIsAdmin } from "./mapping";
export {
type Connections,
type EngineReport,
type MigrateOptions,
type MigrationReport,
migrateControlPlane,
migrateEngine,
migrateProdToMultiplayer,
} from "./migrate";
export {
DEFAULT_CONFIG,
type MigrationConfig,
prefixed,
sourceSpaceSchema,
targetSpaceSchema,
} from "./schemas";
35 changes: 35 additions & 0 deletions packages/migrate-prod/mapping.test.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
import { describe, expect, test } from "bun:test";
import { mapActionsToLevel, orgRoleIsAdmin } from "./mapping";

describe("mapActionsToLevel", () => {
test("read-only → read (1)", () => {
expect(mapActionsToLevel(["read"], false)).toBe(1);
});

test("any write action → write (2), additive over read", () => {
expect(mapActionsToLevel(["create"], false)).toBe(2);
expect(mapActionsToLevel(["update"], false)).toBe(2);
expect(mapActionsToLevel(["delete"], false)).toBe(2);
expect(mapActionsToLevel(["read", "create"], false)).toBe(2);
expect(
mapActionsToLevel(["read", "create", "update", "delete"], false),
).toBe(2);
});

test("with_grant_option (delegation) → owner (3), regardless of actions", () => {
expect(mapActionsToLevel(["read"], true)).toBe(3);
expect(mapActionsToLevel([], true)).toBe(3);
});

test("empty action set defaults to read (1)", () => {
expect(mapActionsToLevel([], false)).toBe(1);
});
});

describe("orgRoleIsAdmin", () => {
test("owner and admin are admins; member is not", () => {
expect(orgRoleIsAdmin("owner")).toBe(true);
expect(orgRoleIsAdmin("admin")).toBe(true);
expect(orgRoleIsAdmin("member")).toBe(false);
});
});
36 changes: 36 additions & 0 deletions packages/migrate-prod/mapping.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
import { ACCESS, type AccessLevel } from "@memory.build/engine/core";

/** Old engine grant actions (`me_<slug>.tree_grant.actions`). */
export type OldAction = "read" | "create" | "update" | "delete";

const WRITE_ACTIONS = new Set<string>(["create", "update", "delete"]);

/**
* Map an old `tree_grant` (its action set + `with_grant_option`) to a single new
* `tree_access` level. See PROD_MIGRATION_PLAN.md §4.3 — this is intentionally
* **lossy and over-permissive**:
* - `with_grant_option = true` (delegation) → owner (3)
* - any of {create, update, delete} present → write (2), which is additive and
* therefore also grants read, even if the old grant lacked `read`
* - read-only (or, defensively, an empty set) → read (1)
*
* The old 4-action granularity has no lossless new representation; a `{delete}`-
* only or `{read,create}` grant widens to full write. Flagged as a conscious
* choice; the §9 prod survey checks whether any such non-trivial grants exist.
*/
export function mapActionsToLevel(
actions: readonly string[],
withGrantOption: boolean,
): AccessLevel {
if (withGrantOption) return ACCESS.owner;
if (actions.some((a) => WRITE_ACTIONS.has(a))) return ACCESS.write;
return ACCESS.read;
}

/** The org role drives both space-admin and whole-space ownership. */
export type OldOrgRole = "owner" | "admin" | "member";

/** Org owner/admin were engine superusers → space admin + owner@root. */
export function orgRoleIsAdmin(role: string): boolean {
return role === "owner" || role === "admin";
}
Loading