Skip to content

Fix /internal/metrics ClickHouse OOM#1457

Open
BilalG1 wants to merge 3 commits into
devfrom
fix/metrics-clickhouse-oom
Open

Fix /internal/metrics ClickHouse OOM#1457
BilalG1 wants to merge 3 commits into
devfrom
fix/metrics-clickhouse-oom

Conversation

@BilalG1
Copy link
Copy Markdown
Collaborator

@BilalG1 BilalG1 commented May 20, 2026

Fixes Sentry STACK-BACKEND-16H — the /api/v1/internal/metrics endpoint was triggering the cluster's 10.8 GiB OvercommitTracker kill on tenants with months of $token-refresh history.

Root cause

Three queries in loadAnalyticsOverview plus loadUsersByCountry did GROUP BY user_id over the events table with no lower event_at bound, so their hash table working set scaled with cumulative-distinct-users-ever-seen instead of the 30-day metrics window.

Changes

  • Add 30-day event_at lower bound to loadUsersByCountry and to the analyticsUserJoin inner subquery (used by dailyEvents, totalVisitors, topReferrers).
  • New getClickhouseAdminClientForMetrics() factory in lib/clickhouse.tsx with connection-level safety net: per-query + per-user memory caps, external GROUP BY spill, and join_algorithm: 'grace_hash,parallel_hash,hash' (grace_hash measured to give 48% memory reduction at zero latency cost — see benchmark notes in the file).
  • Inline comment + concrete next steps for the long-term fix (option C: stamp is_anonymous at ingest on page-view/click events, then drop the join entirely).
  • Extend scripts/benchmark-internal-metrics.ts with the historical-seed knob and three new modes (BENCH_BACKFILL_COMPARE, BENCH_JOIN_ALGO_COMPARE, plus the existing BENCH_ROUTE_QUERIES updated) used to validate the choices above.

Benchmark — pre-PR vs post-PR

Synthetic seed: 300k users × 9 events spread over 365 days (~2.7M events).

pre-PR post-PR delta
Sum peak memory 2.18 GiB 515 MiB 4.3× less
Max query duration 1293 ms 101 ms 12.8× faster
Sum CPU duration 5119 ms 394 ms 13× less work
Sum bytes read 3.87 GiB 929 MiB 4.3× less I/O

Per-query at 300k users:

  • analyticsOverview:dailyEvents 561 → 44 MiB (12.8× less)
  • analyticsOverview:totalVisitors 560 → 50 MiB (11.2× less)
  • analyticsOverview:topReferrers 546 → 50 MiB (10.9× less)
  • loadUsersByCountry 388 → 44 MiB (8.9× less)

Caveats

  • loadDailyActiveSplitFromClickhouse still scans all-history on its min(event_at) subquery. It can't be naively bounded — first_date is used to classify entities as new vs reactivated, and a 30d bound would silently mislabel old-but-active entities as "new." The new SETTINGS cap+spill it; the proper fix is option C (documented inline).
  • A user with a page-view but no $token-refresh in the last 30 days now falls through to coalesce(NULL, 0) and is classified non-anonymous. Token-refresh fires every few minutes per active session, so this is rare but not impossible (embedded SDKs that poll less frequently, sessions straddling the 30d boundary).
  • max_memory_usage_for_user: 9 GB trades "cluster-wide OvercommitTracker kill of a random query" for "clean per-user memory error attributed to the specific query." After our 30d bounds, no query is anywhere near 9 GB.

Test plan

  • pnpm typecheck passes
  • pnpm lint passes
  • pnpm test run apps/e2e/tests/backend/endpoints/api/v1/internal-metrics.test.ts — 9/10 pass; the 1 failure (risk_scores snapshot drift) reproduces on clean dev and is unrelated
  • pnpm test run apps/e2e/tests/backend/endpoints/api/v1/analytics-{events,events-batch,query}.test.ts apps/e2e/tests/backend/endpoints/api/v1/token-refresh-events.test.ts apps/e2e/tests/backend/performance/metrics.test.ts — all passing tests pass; 10 pre-existing PRODUCT_DOES_NOT_EXIST setup failures reproduce on clean dev
  • Benchmark BENCH_ROUTE_QUERIES=1 at 300k users shows the deltas above

Summary by CodeRabbit

  • Chores
    • Improved internal metrics collection to use metrics-specific DB settings for more reliable, safer analytical reads.
    • Added guardrails to metrics queries to enforce time-window bounds and avoid unbounded scans.
    • Expanded benchmark modes (backfill and join-algo comparisons), extended perf seeding, and improved logging/retry behavior to capture more complete stats and reduce missing log rows.

Review Change Stack

Sentry STACK-BACKEND-16H — the endpoint was triggering the cluster's
10.8 GiB OvercommitTracker kill on tenants with months of $token-refresh
history. Three queries did GROUP BY user_id over the events table without
a lower event_at bound, so their working set scaled with cumulative-users-
ever-seen instead of with the 30-day metrics window.

Changes:
- Add 30-day event_at lower bound to loadUsersByCountry and to the
  analyticsUserJoin inner subquery (used by dailyEvents, totalVisitors,
  topReferrers).
- New getClickhouseAdminClientForMetrics factory in lib/clickhouse.tsx
  applying memory caps (max_memory_usage, max_memory_usage_for_user) +
  external GROUP BY spill + join_algorithm fallback list with grace_hash
  first.
- Document loadDailyActiveSplitFromClickhouse's unbounded min(event_at)
  subquery + the long-term fix (option C: stamp is_anonymous at ingest +
  ASOF backfill).
- Extend the benchmark script with BENCH_HISTORICAL_DAYS, BENCH_BACKFILL_COMPARE,
  and BENCH_JOIN_ALGO_COMPARE modes used to validate the choices above.

Measured at 300k users seeded across 365 days:
  Sum peak memory: 2.18 GiB -> 515 MiB (4.3x less)
  Max query duration: 1293 ms -> 101 ms (12.8x faster)
@vercel
Copy link
Copy Markdown

vercel Bot commented May 20, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
stack-auth-hosted-components Ready Ready Preview, Comment May 20, 2026 10:47pm
stack-auth-mcp Ready Ready Preview, Comment May 20, 2026 10:47pm
stack-auth-skills Ready Ready Preview, Comment May 20, 2026 10:47pm
stack-backend Ready Ready Preview, Comment May 20, 2026 10:47pm
stack-dashboard Ready Ready Preview, Comment May 20, 2026 10:47pm
stack-demo Ready Ready Preview, Comment May 20, 2026 10:47pm
stack-docs Ready Ready Preview, Comment May 20, 2026 10:47pm
stack-preview-backend Ready Ready Preview, Comment May 20, 2026 10:47pm
stack-preview-dashboard Ready Ready Preview, Comment May 20, 2026 10:47pm

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 20, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 00d4197c-58f3-4885-98ad-db2f0b552846

📥 Commits

Reviewing files that changed from the base of the PR and between adf8f32 and a24be20.

📒 Files selected for processing (2)
  • apps/backend/scripts/benchmark-internal-metrics.ts
  • apps/backend/src/lib/clickhouse.tsx

📝 Walkthrough

Walkthrough

Adds METRICS_CLICKHOUSE_SETTINGS and getClickhouseAdminClientForMetrics, applies the metrics client to internal metrics queries with explicit event_at time-window bounds, and extends the benchmark script with backfill and join-algorithm comparison modes that run using the metrics client and new query variants.

Changes

ClickHouse metrics client and route updates

Layer / File(s) Summary
ClickHouse metrics client factory
apps/backend/src/lib/clickhouse.tsx
createClickhouseClient now accepts optional clickhouse_settings and forwards them to the ClickHouse client. METRICS_CLICKHOUSE_SETTINGS defines memory/bytes caps and join algorithm preferences. getClickhouseAdminClientForMetrics() factory is exported to create admin clients preconfigured with these metrics-safe settings.
Internal metrics route client and time-window integration
apps/backend/src/app/api/latest/internal/metrics/route.tsx
All ClickHouse queries switch to getClickhouseAdminClientForMetrics. loadUsersByCountry accepts now and computes since/untilExclusive bounds passed to the query. Event filters in loadAnalyticsOverview, loadDailyActiveSplitFromClickhouse, and helper functions are tightened with explicit event_at bounds. Handler wiring threads now through the call chain.

Benchmark script metrics client and comparison modes

Layer / File(s) Summary
Benchmark metrics client integration
apps/backend/scripts/benchmark-internal-metrics.ts
Script imports getClickhouseAdminClientForMetrics. runRouteQuery and runAndCollect accept useMetricsClient flag to route execution through the metrics-configured client. Route query benchmarking invocations enable metrics client usage for production-realistic measurements.
Benchmark query variants and enhancements
apps/backend/scripts/benchmark-internal-metrics.ts
SQL for loadUsersByCountry and analyticsOverview joins adds lower event_at >= since bounds to subqueries. New ROUTE_QUERIES_AFTER variant enforces the 30-day metrics window. Data seeding gains BENCH_HISTORICAL_DAYS control. Stats retry delays increase. Option-D cleanup column dropping is added best-effort.
Backfill and join algorithm comparison benchmarks
apps/backend/scripts/benchmark-internal-metrics.ts
Backfill comparison mode creates/populates/drops option-D column and benchmarks options A vs B/C/E vs D. Join algorithm comparison mode runs analyticsOverview across multiple join algorithms in bounded and unbounded cases with error handling. main() conditionally seeds and runs these modes from env flags.

Sequence Diagram(s)

sequenceDiagram
  participant Handler as Metrics Handler
  participant LoadUsers as loadUsersByCountry
  participant LoadOverview as loadAnalyticsOverview
  participant CH as ClickHouse (metrics client)
  Handler->>Handler: compute now
  Handler->>LoadUsers: pass now
  LoadUsers->>LoadUsers: compute since,
  LoadUsers->>CH: query with event_at bounds and memory caps
  Handler->>LoadOverview: pass now
  LoadOverview->>CH: analyticsUserJoin with event_at >= since
  CH-->>Handler: results with resource limits applied
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

A rabbit tuned the ClickHouse pipes,
With bounded times and capped delights,
It seeded rows and ran the test,
Compared the joins to find the best,
And danced on bytes beneath moonlight 🐇✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 21.88% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'Fix /internal/metrics ClickHouse OOM' directly and clearly summarizes the main change—fixing a ClickHouse out-of-memory issue in the /internal/metrics endpoint, which aligns with the changeset's primary objective.
Description check ✅ Passed The PR description is comprehensive and thorough, covering root cause analysis, detailed changes across multiple files, extensive benchmark results, caveats, and a complete test plan. While the repository's template is minimal, the description substantially exceeds requirements.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/metrics-clickhouse-oom

Warning

Review ran into problems

🔥 Problems

Stopped waiting for pipeline failures after 30000ms. One of your pipelines takes longer than our 30000ms fetch window to run, so review may not consider pipeline-failure results for inline comments if any failures occurred after the fetch window. Increase the timeout if you want to wait longer or run a @coderabbit review after the pipeline has finished.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
apps/backend/scripts/benchmark-internal-metrics.ts (1)

1618-1625: 💤 Low value

Consider catching specific error types instead of bare catch.

The bare catch block here violates the coding guideline about never using try-catch-all. While the intent is clear (gracefully handle unsupported join algorithms), consider catching ClickHouseError specifically and logging unexpected errors for debugging.

Suggested improvement
+import { ClickHouseError } from "`@clickhouse/client`";
+
 async function tryRunAndReadStats(rq: RouteQuery, p: QueryParams, now: Date): Promise<QueryStats | null> {
   try {
     const qid = await runRouteQuery(rq, p, now);
     return await readStats(qid);
-  } catch {
+  } catch (e) {
+    // Expected: ClickHouse errors for unsupported join algorithms
+    if (!(e instanceof ClickHouseError)) {
+      console.error(`  unexpected error in ${rq.name}:`, e);
+    }
     return null;
   }
 }

As per coding guidelines: "NEVER try-catch-all" for TypeScript files.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@apps/backend/scripts/benchmark-internal-metrics.ts` around lines 1618 - 1625,
The tryRunAndReadStats function currently uses a bare catch; change it to catch
ClickHouseError specifically when calling runRouteQuery/readStats (refer to
tryRunAndReadStats, runRouteQuery, readStats) and return null for that error
type, but rethrow or at least log unexpected errors—i.e., handle ClickHouseError
as the expected graceful path and ensure other exceptions are logged
(processLogger/errorLogger or throw) so they are not silently swallowed.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@apps/backend/scripts/benchmark-internal-metrics.ts`:
- Around line 1618-1625: The tryRunAndReadStats function currently uses a bare
catch; change it to catch ClickHouseError specifically when calling
runRouteQuery/readStats (refer to tryRunAndReadStats, runRouteQuery, readStats)
and return null for that error type, but rethrow or at least log unexpected
errors—i.e., handle ClickHouseError as the expected graceful path and ensure
other exceptions are logged (processLogger/errorLogger or throw) so they are not
silently swallowed.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: ccfd81bf-e670-4b38-abcb-238367df7188

📥 Commits

Reviewing files that changed from the base of the PR and between 104f347 and 9f70d0d.

📒 Files selected for processing (3)
  • apps/backend/scripts/benchmark-internal-metrics.ts
  • apps/backend/src/app/api/latest/internal/metrics/route.tsx
  • apps/backend/src/lib/clickhouse.tsx

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 20, 2026

Greptile Summary

This PR fixes a ClickHouse OOM at /api/v1/internal/metrics by adding a 30-day event_at lower bound to loadUsersByCountry and the analyticsUserJoin inner subquery, and by introducing getClickhouseAdminClientForMetrics() with connection-level memory caps, external GROUP BY spill, and grace_hash join algorithm as a safety net. The benchmark results show a 4.3× memory reduction and 12.8× latency improvement at 300k users; loadDailyActiveSplitFromClickhouse remains unbounded (acknowledged, capped by the new settings) pending a longer-term backfill option.

  • Route fix (route.tsx): adds event_at >= {since:DateTime} to the unbounded GROUP BY user_id subqueries in loadUsersByCountry and analyticsUserJoin; all eight ClickHouse call sites now use getClickhouseAdminClientForMetrics().
  • Client factory (clickhouse.tsx): METRICS_CLICKHOUSE_SETTINGS exports max_memory_usage = 8 GB, max_memory_usage_for_user = 9 GB, max_bytes_before_external_group_by = 6 GB, and join_algorithm = grace_hash,parallel_hash,hash as a belt-and-suspenders guard.
  • Benchmark (benchmark-internal-metrics.ts): adds BENCH_BACKFILL_COMPARE and BENCH_JOIN_ALGO_COMPARE modes, a BENCH_HISTORICAL_DAYS historical-seed knob, and increases the readStats retry budget for large seeds.

Confidence Score: 4/5

Safe to merge; the 30-day bounds and connection-level settings directly address the OOM, and the post-fix memory profile is well within the cluster ceiling.

The two main production changes — bounded event_at scans and the metrics client factory — are straightforward, well-benchmarked, and don't touch any data-write paths. The max_bytes_before_external_group_by value (6 GB) is above the ClickHouse-recommended 50% of max_memory_usage, which could cause the merge phase to exceed the per-query limit if the spill path is ever exercised; and max_memory_usage_for_user is a user-level aggregate shared across all concurrent admin queries, not just the metrics fan-out.

apps/backend/src/lib/clickhouse.tsx — the METRICS_CLICKHOUSE_SETTINGS values, particularly max_bytes_before_external_group_by relative to max_memory_usage.

Important Files Changed

Filename Overview
apps/backend/src/lib/clickhouse.tsx Adds getClickhouseAdminClientForMetrics() with connection-level SETTINGS (memory caps, external GROUP BY spill, grace_hash); max_bytes_before_external_group_by is set at 75% of max_memory_usage rather than the recommended 50%
apps/backend/src/app/api/latest/internal/metrics/route.tsx All getClickhouseAdminClient() calls replaced with getClickhouseAdminClientForMetrics(); loadUsersByCountry and analyticsUserJoin gain 30-day event_at lower bounds; loadDailyActiveSplitFromClickhouse remains unbounded (documented caveat)
apps/backend/scripts/benchmark-internal-metrics.ts Adds BENCH_BACKFILL_COMPARE and BENCH_JOIN_ALGO_COMPARE benchmark modes; extends seed with BENCH_HISTORICAL_DAYS knob; adds undocumented BENCH_GRACE_HASH_COMPARE alias; header comment "Three modes" is now stale

Comments Outside Diff (1)

  1. apps/backend/scripts/benchmark-internal-metrics.ts, line 5-7 (link)

    P2 The header comment says "Three modes" and its env-knobs list is missing several knobs added in this PR: BENCH_BACKFILL_COMPARE, BENCH_JOIN_ALGO_COMPARE, BENCH_HISTORICAL_DAYS, and the undocumented BENCH_GRACE_HASH_COMPARE alias for BENCH_JOIN_ALGO_COMPARE. A developer running the script for the first time would not know these modes exist without reading deep into the file.

    Prompt To Fix With AI
    This is a comment left during a code review.
    Path: apps/backend/scripts/benchmark-internal-metrics.ts
    Line: 5-7
    
    Comment:
    The header comment says "Three modes" and its env-knobs list is missing several knobs added in this PR: `BENCH_BACKFILL_COMPARE`, `BENCH_JOIN_ALGO_COMPARE`, `BENCH_HISTORICAL_DAYS`, and the undocumented `BENCH_GRACE_HASH_COMPARE` alias for `BENCH_JOIN_ALGO_COMPARE`. A developer running the script for the first time would not know these modes exist without reading deep into the file.
    
    
    
    How can I resolve this? If you propose a fix, please make it concise.
Prompt To Fix All With AI
Fix the following 3 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 3
apps/backend/src/lib/clickhouse.tsx:78-79
ClickHouse recommends setting `max_bytes_before_external_group_by` to roughly **half** of `max_memory_usage` to leave enough headroom for the post-spill merge phase. At 6 GB / 8 GB = 75%, if a GROUP BY hash table fills close to the 6 GB threshold, the merge-back step plus other in-flight query state (hash join buffer, result buffer) can push total query memory above the 8 GB per-query hard cap, converting a spill into an OOM error rather than a graceful fallback. Following the 50% guideline (4 GB) provides a safer margin, especially if the spill path is ever exercised by a very large tenant.

```suggestion
  max_bytes_before_external_group_by: "4000000000",
  max_memory_usage: "8000000000",
```

### Issue 2 of 3
apps/backend/src/lib/clickhouse.tsx:77-86
`max_memory_usage_for_user` is a ClickHouse **user-level** aggregate — it tracks total memory across **all concurrent queries** from the same ClickHouse user (`stackframe`), not just the 12 fan-out queries within a single `/internal/metrics` request. That means non-metrics admin operations that run under the same user (schema queries, event-batch writes, DAU lookups) share this 9 GB budget with any in-flight metrics request. Under heavy concurrent load, a metrics query could exhaust the user-level budget and cause an unrelated admin query to receive a "memory limit exceeded" error. The 30-day bounds make individual queries tiny (~50 MiB each), so this is unlikely in practice today, but worth documenting as a known constraint so future callers don't inadvertently tip the budget.

### Issue 3 of 3
apps/backend/scripts/benchmark-internal-metrics.ts:5-7
The header comment says "Three modes" and its env-knobs list is missing several knobs added in this PR: `BENCH_BACKFILL_COMPARE`, `BENCH_JOIN_ALGO_COMPARE`, `BENCH_HISTORICAL_DAYS`, and the undocumented `BENCH_GRACE_HASH_COMPARE` alias for `BENCH_JOIN_ALGO_COMPARE`. A developer running the script for the first time would not know these modes exist without reading deep into the file.

```suggestion
 * Five modes, selected via env flags (modes 1–2 run by default):
 *
 *   1. MAU equivalence matrix (default ON; set BENCH_SKIP_MATRIX=1 to skip)
```

Reviews (1): Last reviewed commit: "Fix /internal/metrics ClickHouse OOM" | Re-trigger Greptile

Comment thread apps/backend/src/lib/clickhouse.tsx Outdated
Comment thread apps/backend/src/lib/clickhouse.tsx
Comment thread apps/backend/src/lib/clickhouse.tsx Outdated
Comment thread apps/backend/src/lib/clickhouse.tsx Outdated
Comment thread apps/backend/scripts/benchmark-internal-metrics.ts Outdated
Comment thread apps/backend/src/lib/clickhouse.tsx Outdated
@BilalG1 BilalG1 requested a review from N2D4 May 20, 2026 22:38
@BilalG1 BilalG1 assigned N2D4 and unassigned BilalG1 May 20, 2026
… user-level cap

- Drop max_bytes_before_external_group_by from 6 GB to 4 GB to match
  ClickHouse's 50%-of-max_memory_usage recommendation, leaving headroom
  for the post-spill merge phase.
- Replace the bare catch in tryRunAndReadStats with one that logs the
  error (AGENTS.md: NEVER try-catch-all).
- Clarify the METRICS_CLICKHOUSE_SETTINGS comment: max_memory_usage_for_user
  is enforced per ClickHouse user, so the budget is shared with all other
  admin queries through the same connecting user.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants