feat(mcp): add trace waterfall and breakdown tools#2334
feat(mcp): add trace waterfall and breakdown tools#2334brandon-pereira wants to merge 5 commits into
Conversation
Add two new MCP tools for trace investigation: - hyperdx_trace_waterfall: Fetch all spans in a single trace as a parent/child waterfall tree. Supports auto-pick by slowest, first error, or most recent trace. Includes correlated logs when the trace source has a linked logSourceId. - hyperdx_trace_top_time_consuming_operations: Aggregate breakdown of child operations consuming the most cumulative time across traces matching a parent-span filter. Same algorithm as the in-app 'Top Most Time Consuming Operations' chart. Both tools use source-configured expressions for attribute extraction, set readonly:1 on ClickHouse queries for defense-in-depth, and output JSON. Includes integration tests covering schema serialization, error paths, seeded data scenarios (waterfall tree structure, auto-pick modes, truncation, correlated logs, breakdown ranking, minParentDurationMs filtering, topN).
🦋 Changeset detectedLatest commit: c1abd1f The changes in this PR will be included in the next version bump. This PR includes changesets to release 3 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
🔴 Tier 4 — CriticalTouches auth, data models, config, tasks, OTel pipeline, ClickHouse, or CI/CD. Why this tier:
Review process: Deep review from a domain expert. Synchronous walkthrough may be required. Stats
|
E2E Test Results✅ All tests passed • 176 passed • 3 skipped • 1201s
Tests ran across 4 shards in parallel. |
…llisions otel_traces is not truncated between tests (commented out in clearClickhouseTables), so trace data accumulates across all test suites. Use unique trc-test-/wf-test- prefixed names to prevent collisions with other tests' data.
Deep ReviewScope: PR #2334, base No P0 or P1 findings — the diff is feature-additive with no regression path. The bar to clear is contract clarity and a handful of new failure modes the diff introduces. 🟡 P2 -- recommended
🔵 P3 nitpicks (14)
Reviewers (10): correctness, security, adversarial, performance, testing, maintainability, api-contract, reliability, project-standards, kieran-typescript. Testing gaps:
|
- Fix first_error pickBy description: 'root span' → 'a span' (matches actual any-error-in-trace semantics) - Add .max(4096) to parentFilter schema to cap input length - Use uniqExact() instead of count(DISTINCT) for in_parents aggregate - Add MCP.md entries for both new trace tools - Add tests: first_error pick mode, sql pickFilterLanguage, logSource edge cases (missing, wrong kind, non-existent)
What
Add two new MCP tools for trace investigation:
hyperdx_trace_waterfallFetch all spans in a single trace as a parent/child waterfall tree, pre-ordered for human-readable display.
traceIdto fetch a known tracepickFilter+pickBy(slowest / first_error / most_recent) to find one matching tracelogSourceIdmaxSpans/maxLogscaps with truncation noteshyperdx_trace_top_time_consuming_operationsAggregate breakdown of child operations consuming the most cumulative time across traces matching a parent-span filter. Same algorithm as the in-app "Top Most Time Consuming Operations" chart.
minParentDurationMsto focus on slow parentstotalTimeMs,calls,inParents,p50Ms,p99Ms, andshareOfTotalTimeWhy
MCP agents investigating latency issues need a way to:
waterfall)breakdown)The existing builder tools (table/timeseries/search) can't express the TraceId subselect pattern needed for the breakdown.
Testing
Integration tests in
packages/api/src/mcp/__tests__/trace.test.tscover:minParentDurationMsfiltering,topNRun with:
make dev-int FILE=trace