Fix /internal/metrics ClickHouse OOM#1457
Conversation
Sentry STACK-BACKEND-16H — the endpoint was triggering the cluster's 10.8 GiB OvercommitTracker kill on tenants with months of $token-refresh history. Three queries did GROUP BY user_id over the events table without a lower event_at bound, so their working set scaled with cumulative-users- ever-seen instead of with the 30-day metrics window. Changes: - Add 30-day event_at lower bound to loadUsersByCountry and to the analyticsUserJoin inner subquery (used by dailyEvents, totalVisitors, topReferrers). - New getClickhouseAdminClientForMetrics factory in lib/clickhouse.tsx applying memory caps (max_memory_usage, max_memory_usage_for_user) + external GROUP BY spill + join_algorithm fallback list with grace_hash first. - Document loadDailyActiveSplitFromClickhouse's unbounded min(event_at) subquery + the long-term fix (option C: stamp is_anonymous at ingest + ASOF backfill). - Extend the benchmark script with BENCH_HISTORICAL_DAYS, BENCH_BACKFILL_COMPARE, and BENCH_JOIN_ALGO_COMPARE modes used to validate the choices above. Measured at 300k users seeded across 365 days: Sum peak memory: 2.18 GiB -> 515 MiB (4.3x less) Max query duration: 1293 ms -> 101 ms (12.8x faster)
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
📝 WalkthroughWalkthroughAdds METRICS_CLICKHOUSE_SETTINGS and getClickhouseAdminClientForMetrics, applies the metrics client to internal metrics queries with explicit event_at time-window bounds, and extends the benchmark script with backfill and join-algorithm comparison modes that run using the metrics client and new query variants. ChangesClickHouse metrics client and route updates
Benchmark script metrics client and comparison modes
Sequence Diagram(s)sequenceDiagram
participant Handler as Metrics Handler
participant LoadUsers as loadUsersByCountry
participant LoadOverview as loadAnalyticsOverview
participant CH as ClickHouse (metrics client)
Handler->>Handler: compute now
Handler->>LoadUsers: pass now
LoadUsers->>LoadUsers: compute since,
LoadUsers->>CH: query with event_at bounds and memory caps
Handler->>LoadOverview: pass now
LoadOverview->>CH: analyticsUserJoin with event_at >= since
CH-->>Handler: results with resource limits applied
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Warning Review ran into problems🔥 ProblemsStopped waiting for pipeline failures after 30000ms. One of your pipelines takes longer than our 30000ms fetch window to run, so review may not consider pipeline-failure results for inline comments if any failures occurred after the fetch window. Increase the timeout if you want to wait longer or run a Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
apps/backend/scripts/benchmark-internal-metrics.ts (1)
1618-1625: 💤 Low valueConsider catching specific error types instead of bare catch.
The bare
catchblock here violates the coding guideline about never using try-catch-all. While the intent is clear (gracefully handle unsupported join algorithms), consider catchingClickHouseErrorspecifically and logging unexpected errors for debugging.Suggested improvement
+import { ClickHouseError } from "`@clickhouse/client`"; + async function tryRunAndReadStats(rq: RouteQuery, p: QueryParams, now: Date): Promise<QueryStats | null> { try { const qid = await runRouteQuery(rq, p, now); return await readStats(qid); - } catch { + } catch (e) { + // Expected: ClickHouse errors for unsupported join algorithms + if (!(e instanceof ClickHouseError)) { + console.error(` unexpected error in ${rq.name}:`, e); + } return null; } }As per coding guidelines: "NEVER try-catch-all" for TypeScript files.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@apps/backend/scripts/benchmark-internal-metrics.ts` around lines 1618 - 1625, The tryRunAndReadStats function currently uses a bare catch; change it to catch ClickHouseError specifically when calling runRouteQuery/readStats (refer to tryRunAndReadStats, runRouteQuery, readStats) and return null for that error type, but rethrow or at least log unexpected errors—i.e., handle ClickHouseError as the expected graceful path and ensure other exceptions are logged (processLogger/errorLogger or throw) so they are not silently swallowed.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@apps/backend/scripts/benchmark-internal-metrics.ts`:
- Around line 1618-1625: The tryRunAndReadStats function currently uses a bare
catch; change it to catch ClickHouseError specifically when calling
runRouteQuery/readStats (refer to tryRunAndReadStats, runRouteQuery, readStats)
and return null for that error type, but rethrow or at least log unexpected
errors—i.e., handle ClickHouseError as the expected graceful path and ensure
other exceptions are logged (processLogger/errorLogger or throw) so they are not
silently swallowed.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: ccfd81bf-e670-4b38-abcb-238367df7188
📒 Files selected for processing (3)
apps/backend/scripts/benchmark-internal-metrics.tsapps/backend/src/app/api/latest/internal/metrics/route.tsxapps/backend/src/lib/clickhouse.tsx
Greptile SummaryThis PR fixes a ClickHouse OOM at
Confidence Score: 4/5Safe to merge; the 30-day bounds and connection-level settings directly address the OOM, and the post-fix memory profile is well within the cluster ceiling. The two main production changes — bounded event_at scans and the metrics client factory — are straightforward, well-benchmarked, and don't touch any data-write paths. The max_bytes_before_external_group_by value (6 GB) is above the ClickHouse-recommended 50% of max_memory_usage, which could cause the merge phase to exceed the per-query limit if the spill path is ever exercised; and max_memory_usage_for_user is a user-level aggregate shared across all concurrent admin queries, not just the metrics fan-out. apps/backend/src/lib/clickhouse.tsx — the METRICS_CLICKHOUSE_SETTINGS values, particularly max_bytes_before_external_group_by relative to max_memory_usage. Important Files Changed
|
… user-level cap - Drop max_bytes_before_external_group_by from 6 GB to 4 GB to match ClickHouse's 50%-of-max_memory_usage recommendation, leaving headroom for the post-spill merge phase. - Replace the bare catch in tryRunAndReadStats with one that logs the error (AGENTS.md: NEVER try-catch-all). - Clarify the METRICS_CLICKHOUSE_SETTINGS comment: max_memory_usage_for_user is enforced per ClickHouse user, so the budget is shared with all other admin queries through the same connecting user.
Fixes Sentry STACK-BACKEND-16H — the
/api/v1/internal/metricsendpoint was triggering the cluster's 10.8 GiB OvercommitTracker kill on tenants with months of$token-refreshhistory.Root cause
Three queries in
loadAnalyticsOverviewplusloadUsersByCountrydidGROUP BY user_idover the events table with no lowerevent_atbound, so their hash table working set scaled with cumulative-distinct-users-ever-seen instead of the 30-day metrics window.Changes
event_atlower bound toloadUsersByCountryand to theanalyticsUserJoininner subquery (used bydailyEvents,totalVisitors,topReferrers).getClickhouseAdminClientForMetrics()factory inlib/clickhouse.tsxwith connection-level safety net: per-query + per-user memory caps, external GROUP BY spill, andjoin_algorithm: 'grace_hash,parallel_hash,hash'(grace_hash measured to give 48% memory reduction at zero latency cost — see benchmark notes in the file).is_anonymousat ingest on page-view/click events, then drop the join entirely).scripts/benchmark-internal-metrics.tswith the historical-seed knob and three new modes (BENCH_BACKFILL_COMPARE,BENCH_JOIN_ALGO_COMPARE, plus the existingBENCH_ROUTE_QUERIESupdated) used to validate the choices above.Benchmark — pre-PR vs post-PR
Synthetic seed: 300k users × 9 events spread over 365 days (~2.7M events).
Per-query at 300k users:
analyticsOverview:dailyEvents561 → 44 MiB (12.8× less)analyticsOverview:totalVisitors560 → 50 MiB (11.2× less)analyticsOverview:topReferrers546 → 50 MiB (10.9× less)loadUsersByCountry388 → 44 MiB (8.9× less)Caveats
loadDailyActiveSplitFromClickhousestill scans all-history on itsmin(event_at)subquery. It can't be naively bounded —first_dateis used to classify entities as new vs reactivated, and a 30d bound would silently mislabel old-but-active entities as "new." The new SETTINGS cap+spill it; the proper fix is option C (documented inline).$token-refreshin the last 30 days now falls through tocoalesce(NULL, 0)and is classified non-anonymous. Token-refresh fires every few minutes per active session, so this is rare but not impossible (embedded SDKs that poll less frequently, sessions straddling the 30d boundary).max_memory_usage_for_user: 9 GBtrades "cluster-wide OvercommitTracker kill of a random query" for "clean per-user memory error attributed to the specific query." After our 30d bounds, no query is anywhere near 9 GB.Test plan
pnpm typecheckpassespnpm lintpassespnpm test run apps/e2e/tests/backend/endpoints/api/v1/internal-metrics.test.ts— 9/10 pass; the 1 failure (risk_scoressnapshot drift) reproduces on cleandevand is unrelatedpnpm test run apps/e2e/tests/backend/endpoints/api/v1/analytics-{events,events-batch,query}.test.ts apps/e2e/tests/backend/endpoints/api/v1/token-refresh-events.test.ts apps/e2e/tests/backend/performance/metrics.test.ts— all passing tests pass; 10 pre-existingPRODUCT_DOES_NOT_EXISTsetup failures reproduce on cleandevBENCH_ROUTE_QUERIES=1at 300k users shows the deltas aboveSummary by CodeRabbit