feat(mcp): add trace waterfall and breakdown tools by brandon-pereira · Pull Request #2334 · hyperdxio/hyperdx

brandon-pereira · 2026-05-22T22:31:41Z

What

Add two new MCP tools for trace investigation:

`hyperdx_trace_waterfall`

Fetch all spans in a single trace as a parent/child waterfall tree, pre-ordered for human-readable display.

Specific trace mode: pass traceId to fetch a known trace
Auto-pick mode: pass pickFilter + pickBy (slowest / first_error / most_recent) to find one matching trace
Includes correlated log rows when the trace source has a linked logSourceId
Respects maxSpans / maxLogs caps with truncation notes

`hyperdx_trace_top_time_consuming_operations`

Aggregate breakdown of child operations consuming the most cumulative time across traces matching a parent-span filter. Same algorithm as the in-app "Top Most Time Consuming Operations" chart.

Two-stage SQL: find distinct TraceIds matching the parent filter, then aggregate child spans
Supports minParentDurationMs to focus on slow parents
Returns ranked operations with totalTimeMs, calls, inParents, p50Ms, p99Ms, and shareOfTotalTime

Why

MCP agents investigating latency issues need a way to:

Inspect a single concrete trace tree without writing self-JOINs (waterfall)
Ask "where does the time go" across many slow traces of an operation (breakdown)

The existing builder tools (table/timeseries/search) can't express the TraceId subselect pattern needed for the breakdown.

Testing

Integration tests in packages/api/src/mcp/__tests__/trace.test.ts cover:

Schema serialization (expected input properties)
Error paths: non-existent source, non-trace source, invalid time range, bad SQL
Seeded data: waterfall tree structure, auto-pick modes, truncation, correlated logs, breakdown ranking, minParentDurationMs filtering, topN

Run with: make dev-int FILE=trace

Add two new MCP tools for trace investigation: - hyperdx_trace_waterfall: Fetch all spans in a single trace as a parent/child waterfall tree. Supports auto-pick by slowest, first error, or most recent trace. Includes correlated logs when the trace source has a linked logSourceId. - hyperdx_trace_top_time_consuming_operations: Aggregate breakdown of child operations consuming the most cumulative time across traces matching a parent-span filter. Same algorithm as the in-app 'Top Most Time Consuming Operations' chart. Both tools use source-configured expressions for attribute extraction, set readonly:1 on ClickHouse queries for defense-in-depth, and output JSON. Includes integration tests covering schema serialization, error paths, seeded data scenarios (waterfall tree structure, auto-pick modes, truncation, correlated logs, breakdown ranking, minParentDurationMs filtering, topN).

changeset-bot · 2026-05-22T22:31:45Z

🦋 Changeset detected

Latest commit: c1abd1f

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 3 packages

Name	Type
@hyperdx/api	Patch
@hyperdx/app	Patch
@hyperdx/otel-collector	Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

vercel · 2026-05-22T22:31:46Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
hyperdx-oss	Ready	Preview, Comment	May 22, 2026 10:56pm

github-actions · 2026-05-22T22:32:02Z

🔴 Tier 4 — Critical

Touches auth, data models, config, tasks, OTel pipeline, ClickHouse, or CI/CD.

Why this tier:

Large diff: 1007 production lines changed (threshold: 1000)

Review process: Deep review from a domain expert. Synchronous walkthrough may be required.
SLA: Schedule synchronous review within 2 business days.

Stats

Production files changed: 4
Production lines changed: 1007 (+ 877 in test files, excluded from tier calculation)
Branch: brandon/mcp-trace-tools
Author: brandon-pereira

To override this classification, remove the review/tier-4 label and apply a different review/tier-* label. Manual overrides are preserved on subsequent pushes.

github-actions · 2026-05-22T22:33:51Z

E2E Test Results

✅ All tests passed • 176 passed • 3 skipped • 1201s

Status	Count
✅ Passed	176
❌ Failed	0
⚠️ Flaky	6
⏭️ Skipped	3

Tests ran across 4 shards in parallel.

View full report →

…llisions otel_traces is not truncated between tests (commented out in clearClickhouseTables), so trace data accumulates across all test suites. Use unique trc-test-/wf-test- prefixed names to prevent collisions with other tests' data.

github-actions · 2026-05-22T22:50:38Z

Deep Review

Scope: PR #2334, base 937e043a. Two new MCP trace tools (~1900 LOC including tests).
Intent: Add hyperdx_trace_waterfall (single-trace tree with optional auto-pick + correlated logs) and hyperdx_trace_top_time_consuming_operations (aggregate child-op breakdown across matching parent traces).
Reviewer team (10): correctness, security, adversarial, performance, testing, maintainability, api-contract, reliability, project-standards, kieran-typescript.

No P0 or P1 findings — the diff is feature-additive with no regression path. The bar to clear is contract clarity and a handful of new failure modes the diff introduces.

🟡 P2 -- recommended

packages/api/src/mcp/tools/trace/breakdown.ts:136 -- the schema description advertises rows with total_time_ms/calls/in_parents/p50_ms/p99_ms but the response emits camelCase totalTimeMs/inParents/p50Ms/p99Ms plus an undocumented shareOfTotalTime, so an MCP agent following the schema gets undefined on every documented key.
- Fix: Rewrite the RETURNS paragraph in the tool description to list the camelCase keys actually emitted, including shareOfTotalTime.
- _{maintainability}
packages/api/src/mcp/tools/trace/waterfall.ts:479 -- totalDurationMs is Math.max(...spans.map(s => s.durationMs)), i.e. the longest single span, which is neither cumulative child time nor wall-clock trace duration, so a downstream agent reasoning about "total" gets a misleading number.
- Fix: Either rename the field to longestSpanDurationMs or replace the value with root.durationMs and call the field rootDurationMs.
- _{correctness, adversarial, maintainability}
packages/api/src/mcp/tools/trace/breakdown.ts:301 -- clickhouse_settings is built as { readonly: '1', ...source.querySettings }, so a source-admin querySetting with setting: 'readonly' silently overrides the safeguard the inline comment claims prevents DDL/DML via parentFilter; the same anti-pattern appears at waterfall.ts:431 and waterfall.ts:554.
- Fix: Spread source.querySettings first, then apply readonly: '1' last, in all three locations.
- _{correctness, security}
packages/api/src/mcp/tools/trace/waterfall.ts:301 -- when pickBy: 'first_error' is combined with a non-empty lucene pickFilter, the code appends the literal AND StatusCode:STATUS_CODE_ERROR, hardcoding the OTEL-default column name and ignoring source.statusCodeExpression, so any source whose status column is mapped differently silently matches nothing or fails parsing.
- Fix: Render the user's pickFilter alongside ${statusCodeExpr} = 'STATUS_CODE_ERROR' in SQL always, or reject first_error + lucene with a clear error.
- _{adversarial, maintainability}
packages/api/src/mcp/tools/trace/waterfall.ts:478 -- when the result spans form a cycle or every span's parent is in the result set, buildPreOrderTree produces roots=[] and an empty tree, then root = tree.find(...) ?? tree[0] is undefined, so the next line dereferences root.spanId and crashes with a TypeError surfaced as a generic MCP internal error.
- Fix: After building the tree, check tree.length === 0 and return a structured response with the raw spans plus a note indicating the parent chain could not be resolved.
- _adversarial
packages/api/src/mcp/tools/trace/waterfall.ts:382 -- the picker locates the TraceId column by find(([k]) => k !== 'span_count' && k !== '__hdx_time_bucket'), which silently selects the wrong column the moment the builder layer emits any new synthetic key.
- Fix: Route the TraceId expression through select with an explicit alias (e.g. trace_id) and read firstRow.trace_id deterministically.
- _{adversarial, kieran-typescript}
packages/api/src/mcp/tools/trace/breakdown.ts:56 -- startTime/endTime are marked REQUIRED in the description but the zod schema only enforces z.string() with no .optional(), while waterfall.ts:63 and the shared startTimeSchema/endTimeSchema in query/schemas.ts make them optional with a 15-minute default, leaving the two new tools and the existing query tools inconsistent.
- Fix: Import the shared startTimeSchema/endTimeSchema, or enforce required with .min(1) and update the runtime check.
- _{api-contract, maintainability}
packages/api/src/mcp/tools/trace/waterfall.ts:441 -- (result as { json: () => Promise<SpanRow[]> }).json()) ?? [] defeats the typed BaseResultSet.json<T>() generic, hides the real return type, and the ?? [] is unreachable; the same shape appears at waterfall.ts:567 and breakdown.ts:311 with a different cast shape.
- Fix: Replace each cast with await result.json<SpanRow[]>() / result.json<LogRow[]>() / result.json<{ data: Row[] }>() and drop the dead nullability fallbacks.
- _{kieran-typescript, maintainability}
packages/api/src/mcp/tools/trace/waterfall.ts:599 -- both the truncation note (success-path) and the misconfig/error note (failure-path) write to the same top-level logsNote key via separate spreads, so any future refactor that loosens the implicit mutual exclusion will silently drop one of the two messages.
- Fix: Rename the truncation field to logsTruncationNote (or set logsTruncated: true alongside logs) and reserve logsNote for the misconfig/error path.
- _{maintainability}
packages/api/src/mcp/tools/trace/waterfall.ts:296 -- the comment says "Compromise: tell user to express the error filter in pickFilter directly" but the very next line does the opposite by appending StatusCode:STATUS_CODE_ERROR itself.
- Fix: Delete the misleading half of the comment so future readers can trust the prose.
- _{maintainability}
packages/api/src/mcp/tools/trace/breakdown.ts:301 -- clickhouse_settings carries no max_execution_time, so a wide-window breakdown with maxParentTraces up to 1,000,000 can run for the default requestTimeout (1 hour per BaseClickhouseClient), pinning an API worker and a ClickHouse slot; same gap in waterfall.ts:431.
- Fix: Set a default max_execution_time (e.g. 60) before spreading source.querySettings so admin overrides can still raise it.
- _{reliability, adversarial}
packages/api/src/mcp/tools/trace/breakdown.ts:241 -- LIMIT {maxParentTraces:UInt32} silently truncates when the parent filter matches more traces than the cap, but no field in the response signals the truncation and no test exercises hitting the cap, so users get under-reported aggregates without any indicator.
- Fix: Compare the row count or add a parentTracesTruncated boolean to the response when the picker hit maxParentTraces, and add a test pinning the behavior.
- _testing
packages/api/src/mcp/__tests__/trace.test.ts -- the auto-pick suite seeds only one trace, so pickBy: 'slowest' vs pickBy: 'most_recent' cannot actually disambiguate ordering — both pass trivially.
- Fix: Seed at least two traces with different durations and start times, then assert slowest returns the longer-duration TraceId and most_recent returns the later-timestamp TraceId.
- _testing
packages/api/src/mcp/tools/trace/waterfall.ts:294 -- only the lucene-with-filter branch of first_error is exercised by tests; the SQL-with-filter and no-filter sub-branches at lines 294 and 303 (which mutate the SQL sent to ClickHouse) have no regression coverage.
- Fix: Add two tests: first_error with pickFilterLanguage: 'sql' and a non-empty pickFilter, and first_error with the default empty pickFilter.
- _testing

🔵 P3 nitpicks (14)

packages/api/src/mcp/tools/trace/breakdown.ts:97 -- durationDivisor is duplicated verbatim in waterfall.ts:159 minus its explanatory comment.
- Fix: Move to a shared tools/trace/helpers.ts with the precision comment attached.
- _{maintainability}
packages/api/src/mcp/tools/trace/breakdown.ts:229 -- input.minParentDurationMs != null is truthy for 0, so passing 0 appends a tautological AND Duration >= 0.
- Fix: Either guard with > 0 or use min(1) on the schema so 0 rejects at parse time.
- _{correctness, testing}
packages/api/src/mcp/tools/trace/waterfall.ts:138 -- a result containing duplicate spanId values causes each duplicate to be visited as its own root, multiplying child output combinatorially.
- Fix: Track a visited Set inside visit() or dedupe by spanId before building childrenByParent.
- _adversarial
packages/api/src/mcp/tools/trace/waterfall.ts:133 -- a span with spanId === '' poisons idsInResult, so the real root (whose parentSpanId === '') is reclassified as a child of the empty-id row.
- Fix: Build idsInResult from spans with non-empty spanId only, or treat parentSpanId === '' as root unconditionally.
- _adversarial
packages/api/src/mcp/tools/trace/waterfall.ts:264 -- a fresh ClickhouseClient is constructed per invocation and never closed, doubling on the cross-connection log-source path at waterfall.ts:522; pre-existing pattern across the MCP tool set.
- Fix: Wrap the handler in try/finally and call client.close() on exit (apply across MCP tools, not just here).
- _{adversarial, reliability}
packages/api/src/mcp/tools/trace/waterfall.ts:420 -- the ClickHouse client supports abort_signal but neither tool passes one, so an MCP client disconnect leaves the in-flight query running until requestTimeout expires.
- Fix: Plumb an AbortController tied to the MCP request lifetime into each client.query call.
- _reliability
packages/api/src/mcp/tools/trace/breakdown.ts:255 -- the outer AND NOT (${input.parentFilter}) re-evaluates an arbitrary predicate on every child row; a parent_span_ids subselect in the CTE would be cheaper and clearer.
- Fix: Capture parent SpanIds in the CTE and exclude by SpanId NOT IN (parent_span_ids).
- _performance
packages/api/src/mcp/tools/trace/waterfall.ts:531 -- the trace tree query and the correlated-logs query are independent once pickedTraceId is resolved but run sequentially, adding the full logs RTT to every waterfall response.
- Fix: Kick off both queries with Promise.all after the TraceId is picked.
- _performance
packages/api/src/mcp/tools/trace/waterfall.ts:614 -- JSON.stringify(output, null, 2) at max caps (2000 spans + 1000 logs) inflates the response by ~20-30% versus compact JSON for no consumer benefit.
- Fix: Drop the 2-space indent.
- _performance
packages/api/src/mcp/tools/trace/waterfall.ts:342 -- the picker config is annotated BuilderChartConfigWithDateRange then cast as ChartConfigWithDateRange, where the idiomatic repo pattern is satisfies ChartConfigWithDateRange (see query/helpers.ts:231).
- Fix: Drop the type annotation and the as cast in favor of satisfies.
- _{kieran-typescript}
packages/api/src/mcp/tools/trace/waterfall.ts:345 -- queryChartConfig returns a typed ResponseJSON<Record<string, string | number>> but the result is widened back to { data?: Array<Record<string, unknown>> }, discarding information.
- Fix: Let the return type flow or use the canonical type from common-utils.
- _{kieran-typescript}
packages/api/src/mcp/__tests__/trace.test.ts:293 -- tests assert output.spans[0].depth === 0, which only works if the root happens to win timestamp ties; use the explicit output.rootSpan contract instead.
- Fix: Replace spans[0] assertions with assertions on output.rootSpan or spans.find(s => s.depth === 0).
- _testing
packages/api/src/mcp/__tests__/trace.test.ts -- output.operations[0] for topN=1 and the breakdown rank test rely on ORDER BY total_time_ms DESC alone without a deterministic tie-breaker.
- Fix: Add service, operation to the ORDER BY in breakdown.ts:257 so ties resolve consistently.
- _testing
packages/api/src/mcp/tools/trace/waterfall.ts:1 -- file is 620 lines vs the repo's 300-line guideline; sibling MCP tools also exceed, so the standard is loosely applied.
- Fix: Extract pickTrace, fetchTraceSpans, and fetchCorrelatedLogs into module-local helpers when the next iteration of this file happens.
- _{maintainability, project-standards}

Reviewers (10): correctness, security, adversarial, performance, testing, maintainability, api-contract, reliability, project-standards, kieran-typescript.

Testing gaps:

first_error SQL-with-filter and no-filter sub-branches (waterfall.ts:294, :303).
Picker no-TraceId-column error path (waterfall.ts:385); logs misconfig branches logsNote = 'connection for log source ... not found' (waterfall.ts:519) and logsNote = 'Failed to fetch correlated logs ...' (waterfall.ts:574).
Logs truncation accounting + logsNote (waterfall.ts:570, :599).
maxParentTraces cap-hit signal (breakdown.ts:241) and the ±60s child window widening (breakdown.ts:264).
p50Ms / p99Ms percentile values are returned but never asserted.
Schema enum values (pickBy, pickFilterLanguage) and defaults are not pinned.
Time-range edge cases (identical/negative-width windows) and the traceId-provided-but-no-spans-in-window hint path (waterfall.ts:458).

- Fix first_error pickBy description: 'root span' → 'a span' (matches actual any-error-in-trace semantics) - Add .max(4096) to parentFilter schema to cap input length - Use uniqExact() instead of count(DISTINCT) for in_parents aggregate - Add MCP.md entries for both new trace tools - Add tests: first_error pick mode, sql pickFilterLanguage, logSource edge cases (missing, wrong kind, non-existent)

github-actions Bot added the review/tier-4 Critical — deep review + domain expert sign-off label May 22, 2026

vercel Bot had a problem deploying to Preview May 22, 2026 22:32 Failure

chore: add changeset for trace tools

0ffa7a3

fix: use string type for readonly clickhouse_settings

e6562cd

vercel Bot deployed to Preview May 22, 2026 22:38 View deployment

vercel Bot deployed to Preview May 22, 2026 22:47 View deployment

vercel Bot deployed to Preview May 22, 2026 22:56 View deployment

brandon-pereira requested review from a team, bot-hyperdx and dhable and removed request for a team and bot-hyperdx May 22, 2026 23:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(mcp): add trace waterfall and breakdown tools#2334

feat(mcp): add trace waterfall and breakdown tools#2334
brandon-pereira wants to merge 5 commits into
mainfrom
brandon/mcp-trace-tools

brandon-pereira commented May 22, 2026

Uh oh!

changeset-bot Bot commented May 22, 2026 •

edited

Loading

Uh oh!

vercel Bot commented May 22, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 22, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 22, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 22, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

brandon-pereira commented May 22, 2026

What

hyperdx_trace_waterfall

hyperdx_trace_top_time_consuming_operations

Why

Testing

Uh oh!

changeset-bot Bot commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🦋 Changeset detected

Uh oh!

vercel Bot commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔴 Tier 4 — Critical

Uh oh!

github-actions Bot commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

E2E Test Results

Uh oh!

github-actions Bot commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Deep Review

🟡 P2 -- recommended

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

`hyperdx_trace_waterfall`

`hyperdx_trace_top_time_consuming_operations`

changeset-bot Bot commented May 22, 2026 •

edited

Loading

vercel Bot commented May 22, 2026 •

edited

Loading

github-actions Bot commented May 22, 2026 •

edited

Loading

github-actions Bot commented May 22, 2026 •

edited

Loading

github-actions Bot commented May 22, 2026 •

edited

Loading