docs: add always-on performance pass to the PR review guide (#4040)

matt-aitken · Matt Aitken · web-flow · commit 1d6f9002e90d · 2026-06-26T12:20:59.000+01:00
## Summary

Adds a dedicated, always-on Performance section to the repo PR review
guide (`.claude/REVIEW.md`), so every review weighs new work against
table size, hot paths, and how deep or wide the data it walks can get.

It names the tables to treat as huge (the `TaskRun` family in Postgres,
`task_events_v1`/`v2` in ClickHouse), the hot paths that warrant extra
scrutiny (trigger and batch trigger, dequeue, execution-snapshot writes,
OTEL ingestion, trace and run-list reads), the deep and wide shapes that
turn one run into a large tree or batch, and five named anti-patterns
with severities: per-level re-scans, dropping the partition-pruning
predicate, unbounded `IN` lists, sequential per-level round-trips, and
losing the single-query fast path.

Co-authored-by: Matt Aitken &lt;matt@trigger.dev&gt;
diff --git a/.claude/REVIEW.md b/.claude/REVIEW.md
@@ -15,12 +15,44 @@ Reserve 🔴 for things that would page someone or block a rollback. In this cod
 - **Queue / concurrency correctness.** RunQueue, MarQS (V1, legacy), redis-worker — any change to enqueue / dequeue / locking semantics. Re-derive the invariant on paper before flagging or accepting.
 - **Missing index on a hot table.** New Prisma queries against `TaskRun`, `TaskRunExecutionSnapshot`, `JobRun`, `Project`, etc. must use an existing index. Check `internal-packages/database/prisma/schema.prisma` for the relevant `@@index` lines — don't guess and don't propose `EXPLAIN`.
 - **Recovery-path queries.** Any `TaskRun.findFirst` / `findMany` added to a schedule, run-recovery, or restart loop. Recovery fan-outs (Redis crash, restart storms) turn "rare indexed query" into a DB incident. 🔴 even if indexed.
-- **Aggregations on hot tables.** No `COUNT` / `GROUP BY` on `TaskRun` or other multi-million-row tables. Use Redis or ClickHouse for counts.
+- **Aggregations on hot tables.** No `COUNT` / `GROUP BY` on `TaskRun` or other tables that can reach billions of rows. Use Redis or ClickHouse for counts.
 - **Prod Redis blast-radius.** New code paths that `SCAN` with broad patterns (`*foo*`) on prod-shaped Redis, or `EVAL` Lua with `SCAN` loops inside. Both are 🔴.
 - **`@trigger.dev/core` direct import** from anywhere outside the SDK package. Always import from `@trigger.dev/sdk`. Core direct imports are 🔴 — they break the public API contract.
 - **Heavy execute-deps imported into request-handler bundles.** Specifically `chat.handover` and similar split-bundle entry points must not transitively import the agent task's execute path. Watch for new imports added at module top-level of route files.
 - **V1 engine code modified in a "V2 only" PR.** The `apps/webapp/app/v3/` directory contains both. If the PR description says V2-only but it touches `triggerTaskV1`, `cancelTaskRunV1`, `MarQS`, etc. — 🔴.
 
+## Performance (always review)
+
+Every PR gets a performance pass — not just the ones that look perf-sensitive. For each new query or unit of work, weigh three things: (a) the size of the table it hits, (b) whether it sits on a hot path, (c) whether the data it walks can be deep or wide (run trees, batches). The 🔴 bullets above on indexes, recovery-path queries, aggregations, and Redis `SCAN` are part of this pass — the rest below extends it.
+
+**Treat these tables as large — no scans, no `COUNT` / `GROUP BY`, no unbounded fetch:**
+
+- **Postgres — the `TaskRun` family:** `TaskRun`, `TaskRunExecutionSnapshot`, `Waitpoint`, `BatchTaskRun` and their join tables. Assume billions of rows.
+- **ClickHouse — `task_events_v1` / `task_events_v2`.** Partitioned by `toDate(inserted_at)`; `ORDER BY (environment_id, toUnixTimestamp(start_time), trace_id)`. Note `span_id` / `parent_span_id` are NOT in the sort key — span-id lookups can't skip granules, only `environment_id` + a `start_time` window can.
+
+**Hot paths — extra scrutiny on any added query or work:**
+
+- **Trigger + batch trigger** (`triggerTask.server.ts`, `batchTriggerV3.server.ts`) — see `apps/webapp/CLAUDE.md`; do not add DB queries to these.
+- **Dequeue / RunQueue** (`dequeueSystem.ts`, run-queue read/lock paths) — runs on every execution.
+- **Execution-snapshot creation in the run engine** — any engine function that writes a `TaskRunExecutionSnapshot` runs per state transition; a new query there multiplies by run volume.
+- **OTEL ingestion** (`otel.v1.traces.ts`, `otel.v1.logs.ts`) — write volume scales with customer span counts.
+- **Trace + run-list reads** (trace view, run list, span detail) — read paths over the large tables above.
+
+**Deep / wide shapes — one run can explode into a huge tree or batch; code that walks them is the trap:**
+
+- Trace span subtrees (deeply nested child runs → deep span trees).
+- Batch + parent/child fan-out (one run triggers thousands of children).
+- Waitpoint / run-dependency chains.
+- Tag / attribute many-to-many joins against the run/event tables.
+
+**Anti-patterns (severity):**
+
+- **Per-level fan-out that re-scans a large table once per tree depth** → 🔴. A BFS issuing one query per level (e.g. `parent_span_id IN {thisLevel}`) re-reads the same granules D times for a depth-D tree. Prefer one windowed query + an in-memory tree build.
+- **Dropping the partition-pruning predicate** — `inserted_at` for ClickHouse, the `createdAt` window for partitioned Postgres — to "widen" a lookup → 🔴. Without it the query scans every partition. Keep a bounded window even for ancestor / backfill lookups.
+- **Unbounded `IN (...)` built from a result set** (a BFS frontier, a batch's child ids) → 🟡. It can reach the row cap (`MAXIMUM_TRACE_SUMMARY_VIEW_COUNT` defaults to 25k). Cap or chunk to ≤1–2k ids per query.
+- **Sequential per-level round-trips** where one recursive or windowed query would do → 🟡. N levels = N round-trip latencies stacked.
+- **Replacing a single bounded query with a multi-query walk for _every_ call** (not just a rare fallback) → 🔴 on a hot read path, 🟡 elsewhere. Keep the cheap single-query path; branch into the expensive walk only when the cheap one comes up short.
+
 ## Always check
 
 - **Tests use testcontainers, not mocks.** Vitest with `redisTest` / `postgresTest` / `containerTest` from `@internal/testcontainers`. Any new `vi.mock(...)` on Redis, Postgres, BullMQ, or other infra is wrong here — 🔴 if added in production-path tests, 🟡 if isolated unit test.