You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
docs: add always-on performance pass to the PR review guide (#4040)
## Summary
Adds a dedicated, always-on Performance section to the repo PR review
guide (`.claude/REVIEW.md`), so every review weighs new work against
table size, hot paths, and how deep or wide the data it walks can get.
It names the tables to treat as huge (the `TaskRun` family in Postgres,
`task_events_v1`/`v2` in ClickHouse), the hot paths that warrant extra
scrutiny (trigger and batch trigger, dequeue, execution-snapshot writes,
OTEL ingestion, trace and run-list reads), the deep and wide shapes that
turn one run into a large tree or batch, and five named anti-patterns
with severities: per-level re-scans, dropping the partition-pruning
predicate, unbounded `IN` lists, sequential per-level round-trips, and
losing the single-query fast path.
Co-authored-by: Matt Aitken <matt@trigger.dev>
Copy file name to clipboardExpand all lines: .claude/REVIEW.md
+33-1Lines changed: 33 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -15,12 +15,44 @@ Reserve 🔴 for things that would page someone or block a rollback. In this cod
15
15
-**Queue / concurrency correctness.** RunQueue, MarQS (V1, legacy), redis-worker — any change to enqueue / dequeue / locking semantics. Re-derive the invariant on paper before flagging or accepting.
16
16
-**Missing index on a hot table.** New Prisma queries against `TaskRun`, `TaskRunExecutionSnapshot`, `JobRun`, `Project`, etc. must use an existing index. Check `internal-packages/database/prisma/schema.prisma` for the relevant `@@index` lines — don't guess and don't propose `EXPLAIN`.
17
17
-**Recovery-path queries.** Any `TaskRun.findFirst` / `findMany` added to a schedule, run-recovery, or restart loop. Recovery fan-outs (Redis crash, restart storms) turn "rare indexed query" into a DB incident. 🔴 even if indexed.
18
-
-**Aggregations on hot tables.** No `COUNT` / `GROUP BY` on `TaskRun` or other multi-million-row tables. Use Redis or ClickHouse for counts.
18
+
-**Aggregations on hot tables.** No `COUNT` / `GROUP BY` on `TaskRun` or other tables that can reach billions of rows. Use Redis or ClickHouse for counts.
19
19
-**Prod Redis blast-radius.** New code paths that `SCAN` with broad patterns (`*foo*`) on prod-shaped Redis, or `EVAL` Lua with `SCAN` loops inside. Both are 🔴.
20
20
-**`@trigger.dev/core` direct import** from anywhere outside the SDK package. Always import from `@trigger.dev/sdk`. Core direct imports are 🔴 — they break the public API contract.
21
21
-**Heavy execute-deps imported into request-handler bundles.** Specifically `chat.handover` and similar split-bundle entry points must not transitively import the agent task's execute path. Watch for new imports added at module top-level of route files.
22
22
-**V1 engine code modified in a "V2 only" PR.** The `apps/webapp/app/v3/` directory contains both. If the PR description says V2-only but it touches `triggerTaskV1`, `cancelTaskRunV1`, `MarQS`, etc. — 🔴.
23
23
24
+
## Performance (always review)
25
+
26
+
Every PR gets a performance pass — not just the ones that look perf-sensitive. For each new query or unit of work, weigh three things: (a) the size of the table it hits, (b) whether it sits on a hot path, (c) whether the data it walks can be deep or wide (run trees, batches). The 🔴 bullets above on indexes, recovery-path queries, aggregations, and Redis `SCAN` are part of this pass — the rest below extends it.
27
+
28
+
**Treat these tables as large — no scans, no `COUNT` / `GROUP BY`, no unbounded fetch:**
29
+
30
+
-**Postgres — the `TaskRun` family:**`TaskRun`, `TaskRunExecutionSnapshot`, `Waitpoint`, `BatchTaskRun` and their join tables. Assume billions of rows.
31
+
-**ClickHouse — `task_events_v1` / `task_events_v2`.** Partitioned by `toDate(inserted_at)`; `ORDER BY (environment_id, toUnixTimestamp(start_time), trace_id)`. Note `span_id` / `parent_span_id` are NOT in the sort key — span-id lookups can't skip granules, only `environment_id` + a `start_time` window can.
32
+
33
+
**Hot paths — extra scrutiny on any added query or work:**
34
+
35
+
-**Trigger + batch trigger** (`triggerTask.server.ts`, `batchTriggerV3.server.ts`) — see `apps/webapp/CLAUDE.md`; do not add DB queries to these.
36
+
-**Dequeue / RunQueue** (`dequeueSystem.ts`, run-queue read/lock paths) — runs on every execution.
37
+
-**Execution-snapshot creation in the run engine** — any engine function that writes a `TaskRunExecutionSnapshot` runs per state transition; a new query there multiplies by run volume.
- Batch + parent/child fan-out (one run triggers thousands of children).
45
+
- Waitpoint / run-dependency chains.
46
+
- Tag / attribute many-to-many joins against the run/event tables.
47
+
48
+
**Anti-patterns (severity):**
49
+
50
+
-**Per-level fan-out that re-scans a large table once per tree depth** → 🔴. A BFS issuing one query per level (e.g. `parent_span_id IN {thisLevel}`) re-reads the same granules D times for a depth-D tree. Prefer one windowed query + an in-memory tree build.
51
+
-**Dropping the partition-pruning predicate** — `inserted_at` for ClickHouse, the `createdAt` window for partitioned Postgres — to "widen" a lookup → 🔴. Without it the query scans every partition. Keep a bounded window even for ancestor / backfill lookups.
52
+
-**Unbounded `IN (...)` built from a result set** (a BFS frontier, a batch's child ids) → 🟡. It can reach the row cap (`MAXIMUM_TRACE_SUMMARY_VIEW_COUNT` defaults to 25k). Cap or chunk to ≤1–2k ids per query.
53
+
-**Sequential per-level round-trips** where one recursive or windowed query would do → 🟡. N levels = N round-trip latencies stacked.
54
+
-**Replacing a single bounded query with a multi-query walk for _every_ call** (not just a rare fallback) → 🔴 on a hot read path, 🟡 elsewhere. Keep the cheap single-query path; branch into the expensive walk only when the cheap one comes up short.
55
+
24
56
## Always check
25
57
26
58
-**Tests use testcontainers, not mocks.** Vitest with `redisTest` / `postgresTest` / `containerTest` from `@internal/testcontainers`. Any new `vi.mock(...)` on Redis, Postgres, BullMQ, or other infra is wrong here — 🔴 if added in production-path tests, 🟡 if isolated unit test.
0 commit comments