Skip to content

feat(mcp): add trace waterfall and breakdown tools#2334

Open
brandon-pereira wants to merge 5 commits into
mainfrom
brandon/mcp-trace-tools
Open

feat(mcp): add trace waterfall and breakdown tools#2334
brandon-pereira wants to merge 5 commits into
mainfrom
brandon/mcp-trace-tools

Conversation

@brandon-pereira
Copy link
Copy Markdown
Member

What

Add two new MCP tools for trace investigation:

hyperdx_trace_waterfall

Fetch all spans in a single trace as a parent/child waterfall tree, pre-ordered for human-readable display.

  • Specific trace mode: pass traceId to fetch a known trace
  • Auto-pick mode: pass pickFilter + pickBy (slowest / first_error / most_recent) to find one matching trace
  • Includes correlated log rows when the trace source has a linked logSourceId
  • Respects maxSpans / maxLogs caps with truncation notes

hyperdx_trace_top_time_consuming_operations

Aggregate breakdown of child operations consuming the most cumulative time across traces matching a parent-span filter. Same algorithm as the in-app "Top Most Time Consuming Operations" chart.

  • Two-stage SQL: find distinct TraceIds matching the parent filter, then aggregate child spans
  • Supports minParentDurationMs to focus on slow parents
  • Returns ranked operations with totalTimeMs, calls, inParents, p50Ms, p99Ms, and shareOfTotalTime

Why

MCP agents investigating latency issues need a way to:

  1. Inspect a single concrete trace tree without writing self-JOINs (waterfall)
  2. Ask "where does the time go" across many slow traces of an operation (breakdown)

The existing builder tools (table/timeseries/search) can't express the TraceId subselect pattern needed for the breakdown.

Testing

Integration tests in packages/api/src/mcp/__tests__/trace.test.ts cover:

  • Schema serialization (expected input properties)
  • Error paths: non-existent source, non-trace source, invalid time range, bad SQL
  • Seeded data: waterfall tree structure, auto-pick modes, truncation, correlated logs, breakdown ranking, minParentDurationMs filtering, topN

Run with: make dev-int FILE=trace

Add two new MCP tools for trace investigation:

- hyperdx_trace_waterfall: Fetch all spans in a single trace as a
  parent/child waterfall tree. Supports auto-pick by slowest, first
  error, or most recent trace. Includes correlated logs when the trace
  source has a linked logSourceId.

- hyperdx_trace_top_time_consuming_operations: Aggregate breakdown of
  child operations consuming the most cumulative time across traces
  matching a parent-span filter. Same algorithm as the in-app
  'Top Most Time Consuming Operations' chart.

Both tools use source-configured expressions for attribute extraction,
set readonly:1 on ClickHouse queries for defense-in-depth, and output
JSON.

Includes integration tests covering schema serialization, error paths,
seeded data scenarios (waterfall tree structure, auto-pick modes,
truncation, correlated logs, breakdown ranking, minParentDurationMs
filtering, topN).
@changeset-bot
Copy link
Copy Markdown

changeset-bot Bot commented May 22, 2026

🦋 Changeset detected

Latest commit: c1abd1f

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 3 packages
Name Type
@hyperdx/api Patch
@hyperdx/app Patch
@hyperdx/otel-collector Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@vercel
Copy link
Copy Markdown

vercel Bot commented May 22, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
hyperdx-oss Ready Ready Preview, Comment May 22, 2026 10:56pm

Request Review

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 22, 2026

🔴 Tier 4 — Critical

Touches auth, data models, config, tasks, OTel pipeline, ClickHouse, or CI/CD.

Why this tier:

  • Large diff: 1007 production lines changed (threshold: 1000)

Review process: Deep review from a domain expert. Synchronous walkthrough may be required.
SLA: Schedule synchronous review within 2 business days.

Stats
  • Production files changed: 4
  • Production lines changed: 1007 (+ 877 in test files, excluded from tier calculation)
  • Branch: brandon/mcp-trace-tools
  • Author: brandon-pereira

To override this classification, remove the review/tier-4 label and apply a different review/tier-* label. Manual overrides are preserved on subsequent pushes.

@github-actions github-actions Bot added the review/tier-4 Critical — deep review + domain expert sign-off label May 22, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 22, 2026

E2E Test Results

All tests passed • 176 passed • 3 skipped • 1201s

Status Count
✅ Passed 176
❌ Failed 0
⚠️ Flaky 6
⏭️ Skipped 3

Tests ran across 4 shards in parallel.

View full report →

…llisions

otel_traces is not truncated between tests (commented out in
clearClickhouseTables), so trace data accumulates across all test
suites. Use unique trc-test-/wf-test- prefixed names to prevent
collisions with other tests' data.
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 22, 2026

Deep Review

Scope: PR #2334, base 937e043a. Two new MCP trace tools (~1900 LOC including tests).
Intent: Add hyperdx_trace_waterfall (single-trace tree with optional auto-pick + correlated logs) and hyperdx_trace_top_time_consuming_operations (aggregate child-op breakdown across matching parent traces).
Reviewer team (10): correctness, security, adversarial, performance, testing, maintainability, api-contract, reliability, project-standards, kieran-typescript.

No P0 or P1 findings — the diff is feature-additive with no regression path. The bar to clear is contract clarity and a handful of new failure modes the diff introduces.

🟡 P2 -- recommended

  • packages/api/src/mcp/tools/trace/breakdown.ts:136 -- the schema description advertises rows with total_time_ms/calls/in_parents/p50_ms/p99_ms but the response emits camelCase totalTimeMs/inParents/p50Ms/p99Ms plus an undocumented shareOfTotalTime, so an MCP agent following the schema gets undefined on every documented key.

    • Fix: Rewrite the RETURNS paragraph in the tool description to list the camelCase keys actually emitted, including shareOfTotalTime.
    • maintainability
  • packages/api/src/mcp/tools/trace/waterfall.ts:479 -- totalDurationMs is Math.max(...spans.map(s => s.durationMs)), i.e. the longest single span, which is neither cumulative child time nor wall-clock trace duration, so a downstream agent reasoning about "total" gets a misleading number.

    • Fix: Either rename the field to longestSpanDurationMs or replace the value with root.durationMs and call the field rootDurationMs.
    • correctness, adversarial, maintainability
  • packages/api/src/mcp/tools/trace/breakdown.ts:301 -- clickhouse_settings is built as { readonly: '1', ...source.querySettings }, so a source-admin querySetting with setting: 'readonly' silently overrides the safeguard the inline comment claims prevents DDL/DML via parentFilter; the same anti-pattern appears at waterfall.ts:431 and waterfall.ts:554.

    • Fix: Spread source.querySettings first, then apply readonly: '1' last, in all three locations.
    • correctness, security
  • packages/api/src/mcp/tools/trace/waterfall.ts:301 -- when pickBy: 'first_error' is combined with a non-empty lucene pickFilter, the code appends the literal AND StatusCode:STATUS_CODE_ERROR, hardcoding the OTEL-default column name and ignoring source.statusCodeExpression, so any source whose status column is mapped differently silently matches nothing or fails parsing.

    • Fix: Render the user's pickFilter alongside ${statusCodeExpr} = 'STATUS_CODE_ERROR' in SQL always, or reject first_error + lucene with a clear error.
    • adversarial, maintainability
  • packages/api/src/mcp/tools/trace/waterfall.ts:478 -- when the result spans form a cycle or every span's parent is in the result set, buildPreOrderTree produces roots=[] and an empty tree, then root = tree.find(...) ?? tree[0] is undefined, so the next line dereferences root.spanId and crashes with a TypeError surfaced as a generic MCP internal error.

    • Fix: After building the tree, check tree.length === 0 and return a structured response with the raw spans plus a note indicating the parent chain could not be resolved.
    • adversarial
  • packages/api/src/mcp/tools/trace/waterfall.ts:382 -- the picker locates the TraceId column by find(([k]) => k !== 'span_count' && k !== '__hdx_time_bucket'), which silently selects the wrong column the moment the builder layer emits any new synthetic key.

    • Fix: Route the TraceId expression through select with an explicit alias (e.g. trace_id) and read firstRow.trace_id deterministically.
    • adversarial, kieran-typescript
  • packages/api/src/mcp/tools/trace/breakdown.ts:56 -- startTime/endTime are marked REQUIRED in the description but the zod schema only enforces z.string() with no .optional(), while waterfall.ts:63 and the shared startTimeSchema/endTimeSchema in query/schemas.ts make them optional with a 15-minute default, leaving the two new tools and the existing query tools inconsistent.

    • Fix: Import the shared startTimeSchema/endTimeSchema, or enforce required with .min(1) and update the runtime check.
    • api-contract, maintainability
  • packages/api/src/mcp/tools/trace/waterfall.ts:441 -- (result as { json: () => Promise<SpanRow[]> }).json()) ?? [] defeats the typed BaseResultSet.json<T>() generic, hides the real return type, and the ?? [] is unreachable; the same shape appears at waterfall.ts:567 and breakdown.ts:311 with a different cast shape.

    • Fix: Replace each cast with await result.json<SpanRow[]>() / result.json<LogRow[]>() / result.json<{ data: Row[] }>() and drop the dead nullability fallbacks.
    • kieran-typescript, maintainability
  • packages/api/src/mcp/tools/trace/waterfall.ts:599 -- both the truncation note (success-path) and the misconfig/error note (failure-path) write to the same top-level logsNote key via separate spreads, so any future refactor that loosens the implicit mutual exclusion will silently drop one of the two messages.

    • Fix: Rename the truncation field to logsTruncationNote (or set logsTruncated: true alongside logs) and reserve logsNote for the misconfig/error path.
    • maintainability
  • packages/api/src/mcp/tools/trace/waterfall.ts:296 -- the comment says "Compromise: tell user to express the error filter in pickFilter directly" but the very next line does the opposite by appending StatusCode:STATUS_CODE_ERROR itself.

    • Fix: Delete the misleading half of the comment so future readers can trust the prose.
    • maintainability
  • packages/api/src/mcp/tools/trace/breakdown.ts:301 -- clickhouse_settings carries no max_execution_time, so a wide-window breakdown with maxParentTraces up to 1,000,000 can run for the default requestTimeout (1 hour per BaseClickhouseClient), pinning an API worker and a ClickHouse slot; same gap in waterfall.ts:431.

    • Fix: Set a default max_execution_time (e.g. 60) before spreading source.querySettings so admin overrides can still raise it.
    • reliability, adversarial
  • packages/api/src/mcp/tools/trace/breakdown.ts:241 -- LIMIT {maxParentTraces:UInt32} silently truncates when the parent filter matches more traces than the cap, but no field in the response signals the truncation and no test exercises hitting the cap, so users get under-reported aggregates without any indicator.

    • Fix: Compare the row count or add a parentTracesTruncated boolean to the response when the picker hit maxParentTraces, and add a test pinning the behavior.
    • testing
  • packages/api/src/mcp/__tests__/trace.test.ts -- the auto-pick suite seeds only one trace, so pickBy: 'slowest' vs pickBy: 'most_recent' cannot actually disambiguate ordering — both pass trivially.

    • Fix: Seed at least two traces with different durations and start times, then assert slowest returns the longer-duration TraceId and most_recent returns the later-timestamp TraceId.
    • testing
  • packages/api/src/mcp/tools/trace/waterfall.ts:294 -- only the lucene-with-filter branch of first_error is exercised by tests; the SQL-with-filter and no-filter sub-branches at lines 294 and 303 (which mutate the SQL sent to ClickHouse) have no regression coverage.

    • Fix: Add two tests: first_error with pickFilterLanguage: 'sql' and a non-empty pickFilter, and first_error with the default empty pickFilter.
    • testing
🔵 P3 nitpicks (14)
  • packages/api/src/mcp/tools/trace/breakdown.ts:97 -- durationDivisor is duplicated verbatim in waterfall.ts:159 minus its explanatory comment.

    • Fix: Move to a shared tools/trace/helpers.ts with the precision comment attached.
    • maintainability
  • packages/api/src/mcp/tools/trace/breakdown.ts:229 -- input.minParentDurationMs != null is truthy for 0, so passing 0 appends a tautological AND Duration >= 0.

    • Fix: Either guard with > 0 or use min(1) on the schema so 0 rejects at parse time.
    • correctness, testing
  • packages/api/src/mcp/tools/trace/waterfall.ts:138 -- a result containing duplicate spanId values causes each duplicate to be visited as its own root, multiplying child output combinatorially.

    • Fix: Track a visited Set inside visit() or dedupe by spanId before building childrenByParent.
    • adversarial
  • packages/api/src/mcp/tools/trace/waterfall.ts:133 -- a span with spanId === '' poisons idsInResult, so the real root (whose parentSpanId === '') is reclassified as a child of the empty-id row.

    • Fix: Build idsInResult from spans with non-empty spanId only, or treat parentSpanId === '' as root unconditionally.
    • adversarial
  • packages/api/src/mcp/tools/trace/waterfall.ts:264 -- a fresh ClickhouseClient is constructed per invocation and never closed, doubling on the cross-connection log-source path at waterfall.ts:522; pre-existing pattern across the MCP tool set.

    • Fix: Wrap the handler in try/finally and call client.close() on exit (apply across MCP tools, not just here).
    • adversarial, reliability
  • packages/api/src/mcp/tools/trace/waterfall.ts:420 -- the ClickHouse client supports abort_signal but neither tool passes one, so an MCP client disconnect leaves the in-flight query running until requestTimeout expires.

    • Fix: Plumb an AbortController tied to the MCP request lifetime into each client.query call.
    • reliability
  • packages/api/src/mcp/tools/trace/breakdown.ts:255 -- the outer AND NOT (${input.parentFilter}) re-evaluates an arbitrary predicate on every child row; a parent_span_ids subselect in the CTE would be cheaper and clearer.

    • Fix: Capture parent SpanIds in the CTE and exclude by SpanId NOT IN (parent_span_ids).
    • performance
  • packages/api/src/mcp/tools/trace/waterfall.ts:531 -- the trace tree query and the correlated-logs query are independent once pickedTraceId is resolved but run sequentially, adding the full logs RTT to every waterfall response.

    • Fix: Kick off both queries with Promise.all after the TraceId is picked.
    • performance
  • packages/api/src/mcp/tools/trace/waterfall.ts:614 -- JSON.stringify(output, null, 2) at max caps (2000 spans + 1000 logs) inflates the response by ~20-30% versus compact JSON for no consumer benefit.

    • Fix: Drop the 2-space indent.
    • performance
  • packages/api/src/mcp/tools/trace/waterfall.ts:342 -- the picker config is annotated BuilderChartConfigWithDateRange then cast as ChartConfigWithDateRange, where the idiomatic repo pattern is satisfies ChartConfigWithDateRange (see query/helpers.ts:231).

    • Fix: Drop the type annotation and the as cast in favor of satisfies.
    • kieran-typescript
  • packages/api/src/mcp/tools/trace/waterfall.ts:345 -- queryChartConfig returns a typed ResponseJSON<Record<string, string | number>> but the result is widened back to { data?: Array<Record<string, unknown>> }, discarding information.

    • Fix: Let the return type flow or use the canonical type from common-utils.
    • kieran-typescript
  • packages/api/src/mcp/__tests__/trace.test.ts:293 -- tests assert output.spans[0].depth === 0, which only works if the root happens to win timestamp ties; use the explicit output.rootSpan contract instead.

    • Fix: Replace spans[0] assertions with assertions on output.rootSpan or spans.find(s => s.depth === 0).
    • testing
  • packages/api/src/mcp/__tests__/trace.test.ts -- output.operations[0] for topN=1 and the breakdown rank test rely on ORDER BY total_time_ms DESC alone without a deterministic tie-breaker.

    • Fix: Add service, operation to the ORDER BY in breakdown.ts:257 so ties resolve consistently.
    • testing
  • packages/api/src/mcp/tools/trace/waterfall.ts:1 -- file is 620 lines vs the repo's 300-line guideline; sibling MCP tools also exceed, so the standard is loosely applied.

    • Fix: Extract pickTrace, fetchTraceSpans, and fetchCorrelatedLogs into module-local helpers when the next iteration of this file happens.
    • maintainability, project-standards

Reviewers (10): correctness, security, adversarial, performance, testing, maintainability, api-contract, reliability, project-standards, kieran-typescript.

Testing gaps:

  • first_error SQL-with-filter and no-filter sub-branches (waterfall.ts:294, :303).
  • Picker no-TraceId-column error path (waterfall.ts:385); logs misconfig branches logsNote = 'connection for log source ... not found' (waterfall.ts:519) and logsNote = 'Failed to fetch correlated logs ...' (waterfall.ts:574).
  • Logs truncation accounting + logsNote (waterfall.ts:570, :599).
  • maxParentTraces cap-hit signal (breakdown.ts:241) and the ±60s child window widening (breakdown.ts:264).
  • p50Ms / p99Ms percentile values are returned but never asserted.
  • Schema enum values (pickBy, pickFilterLanguage) and defaults are not pinned.
  • Time-range edge cases (identical/negative-width windows) and the traceId-provided-but-no-spans-in-window hint path (waterfall.ts:458).

- Fix first_error pickBy description: 'root span' → 'a span' (matches
  actual any-error-in-trace semantics)
- Add .max(4096) to parentFilter schema to cap input length
- Use uniqExact() instead of count(DISTINCT) for in_parents aggregate
- Add MCP.md entries for both new trace tools
- Add tests: first_error pick mode, sql pickFilterLanguage, logSource
  edge cases (missing, wrong kind, non-existent)
@brandon-pereira brandon-pereira requested review from a team, bot-hyperdx and dhable and removed request for a team and bot-hyperdx May 22, 2026 23:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

review/tier-4 Critical — deep review + domain expert sign-off

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant