Skip to content

Cache Debug Trace#3339

Draft
Kbhat1 wants to merge 12 commits intomainfrom
trace-baker
Draft

Cache Debug Trace#3339
Kbhat1 wants to merge 12 commits intomainfrom
trace-baker

Conversation

@Kbhat1
Copy link
Copy Markdown
Contributor

@Kbhat1 Kbhat1 commented Apr 29, 2026

Describe your changes and provide context

  • Cache debug_trace* results off the consensus path into a separate pebble cache so RPC nodes serve traces in instead of full re-execution
  • Using this approach we get single-digit millisecond debug_trace times vs how long it takes now (especially helpful for indexers, etc)
  • Add a background trace-baker that re-runs each committed block under callTracer on a worker goroutine
    and serves debug_traceTransaction / debug_traceBlockBy* from cache
  • Fully configurable
  • Cache debug_trace results during normal block flow, not on demand

Testing performed to validate your change

  • Ran fully on local node
  • Verifying on mainnet node
  • Unit tests

Kbhat1 and others added 7 commits April 29, 2026 14:10
Standalone pebble db at <home>/data/trace_cache so writes don't share
LSM with the chain state (the lesson from 42b7077, where the
sentinel-pointer experiment regressed avgTotal ~32% due to compaction
contention with chain pebble).

Key shape: "ts/" || height(BE,8) || tracerLen(1) || tracer || txHash(32).
Height is leading so Prune is a single range-delete by height window.
Tx hashes are globally unique on this chain, so (height, tracer, txHash)
collisions are impossible.

Also defines TraceEnqueuer + a tiny indirection (SetTraceEnqueuer /
Enqueue) so the keeper can hold one *TraceCache field that owns both
the cache and the forwarder, without taking a hard dep on the baker
that lives in evmrpc.

All methods are nil-safe: callers can hold a single field and skip
init when the feature is off.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pre-bakes debug_trace results so trace RPCs become a single PK lookup
in the trace cache instead of full re-execution. The baker is a
bounded-queue worker pool that pulls heights enqueued from EndBlock,
calls the existing tracers.API.TraceBlockByNumber for each configured
tracer, and writes the per-tx JSON into TraceCache.

Hard guarantee on consensus impact: Enqueue is a non-blocking channel
send (drops on full queue with sparse logging); all re-execution
happens on baker goroutines; reads from chain pebble go through
versioned MVCC (no locks); writes go to a separate pebble db.

If the baker falls behind, dropped blocks fall through to today's
on-demand re-execution at trace time. No correctness loss.

Tracer indirection (blockTracer interface) keeps the baker testable
without standing up a real EVM/keeper.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a single *TraceCache field on the Keeper (nil-safe) plus an
Enqueue call from EndBlock that forwards the just-committed height to
the trace baker if one is registered. Skipped during tracing (re-entry
guard) so debug_trace replays don't recursively re-enqueue.

The Enqueue call is a non-blocking channel send via TraceCache (which
forwards to the registered TraceEnqueuer). When the baker queue is
full, the height is dropped and the block falls through to today's
on-demand re-execution at trace time. Consensus latency is unaffected
in any case.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a cache lookup at the top of TraceTransaction. On hit (the baker
already produced the result for this tx + tracer), returns the cached
JSON directly. On miss (no cache, unbakeable tracer config, missing
receipt, or absent row) falls through to today's tracersAPI re-execution
path with no behavior change.

bakeableTracerName decides whether a config can be served from cache.
We only bake the standard named tracers (callTracer / prestateTracer /
flatCallTracer) without per-call TracerConfig — anything else (struct
logger, raw JS, custom config) misses by design so we can't return a
false hit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds five new evm.* config knobs (all default-off / sane defaults):

  trace_bake_enabled         (bool, default false)
  trace_bake_workers         (int, default 1)
  trace_bake_queue_size      (int, default 4096)
  trace_bake_tracers         ([]string, default ["callTracer"])
  trace_bake_window_blocks   (int64, default 0 = disabled)

When trace_bake_enabled=true:
  - app.go opens the trace cache pebble db at <home>/data/trace_cache
    and attaches it to the EVM keeper (so EndBlock can Enqueue heights).
  - The HTTP server constructs a TraceBaker that re-executes blocks via
    the existing tracers.API, registers it as the keeper's enqueuer, and
    starts the workers.

Validators leave it off and pay nothing. RPC nodes flip it on. The
keeper-side EndBlock enqueue is a non-blocking channel send that
short-circuits to a counter when the queue is full, so consensus
latency is bounded regardless of baker progress.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
TraceBlockByNumber, TraceBlockByHash, and the *ExcludeTraceFail
variants now check the trace cache before falling through to live
re-execution. The cache lookup is "all-or-nothing": if every tx in
the block has a cached entry under the requested tracer, return the
assembled list; if any tx misses, fall through to the existing path
(no partial results to keep the live path simple and deterministic).

Cached entries are never errored (the baker skips errored traces),
so the ExcludeTraceFail filter applied to live traces is a no-op for
cache hits.

The inner cache lookup is a free function over (cache, height, txHashes,
config) so it stays unit-testable without standing up an EVM backend.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two follow-ups in one commit because both depend on TipFn:

last_baked_height watermark
  TraceCache gains SetLastBakedHeight (atomic-max under a small lock,
  out-of-order workers can't roll the watermark backwards) and
  LastBakedHeight (read). Stored under "meta/last_baked_height" so
  Prune's "ts/" range delete leaves it alone. The bakeBlock worker
  updates the watermark after every successful (block, tracer) bake.

Catch-up sweep
  When TipFn is set, Start() spawns a one-shot catchUpLoop that walks
  last_baked+1 .. tip, baking each height directly (bypasses the
  bounded queue so backfill can't drop). Bounded by WindowBlocks so a
  long-stopped node doesn't try to bake from genesis. Skipped when
  no prior watermark exists (operators who want a one-shot full
  backfill run it explicitly).

Periodic prune
  When TipFn is set AND WindowBlocks > 0, Start() spawns a pruneLoop
  ticking on PruneInterval (default 1m). Each tick calls
  cache.Prune(tip - WindowBlocks) — one DeleteRange on pebble, cheap.

Wiring: server.go passes TipFn := func() int64 { return
ctxProvider(LatestCtxHeight).BlockHeight() } and forwards
TraceBakeWindowBlocks from config.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 29, 2026

The latest Buf updates on your PR. Results from workflow Buf / buf (pull_request).

BuildFormatLintBreakingUpdated (UTC)
✅ passed✅ passed✅ passed✅ passedApr 30, 2026, 6:28 PM

Kbhat1 and others added 5 commits April 29, 2026 14:52
- evmrpc/tracers.go: drop the redundant json.RawMessage(bz) conversion
  flagged by unconvert. cache.Get already returns json.RawMessage so the
  result is the same byte sequence wrapped in the same type.
- x/evm/keeper/trace_cache.go: annotate the int64 -> uint64 conversion
  in traceCacheKey with //nolint:gosec; block heights are non-negative,
  matching the same annotation already used elsewhere in the file.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- evmrpc/tracers.go: drop the second redundant json.RawMessage(bz)
  conversion in tryTraceCache (cache.Get already returns json.RawMessage).
- evmrpc/tracers.go: extract callTracerName / prestateTracerName /
  flatCallTracerName constants so the tracer names appear in one place
  (goconst was flagging "callTracer" with 3 occurrences).
- x/evm/keeper/trace_cache.go: handle the closer.Close() return value
  in lastBakedHeightUnlocked via "_ = closer.Close()" inside a deferred
  closure (matches the existing pattern in Get).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1. Off-by-one in EndBlock enqueue.
   When EndBlock(N) fires, height N isn't yet "safe latest" for
   geth tracer queries — the watermark sits at N-1. The baker was
   consistently failing every block with:
     "requested height N is not yet available; safe latest is N-1"
   Fix: enqueue (height - 1) from EndBlock; skip the genesis tick
   where height-1 wouldn't exist.

2. Trace cache wasn't closed on graceful shutdown.
   Baker writes use pebble.NoSync, so SIGTERM lost in-memory data
   because nothing flushed the WAL on the way out. HandleClose now
   closes the cache before falling through to the receipt store
   close, mirroring the existing pattern.

Plus minor: log a debug-level "trace cache hit" line on the read
path and a startup banner from the baker so this kind of e2e bug
is visible to operators on next debug.

Verified end-to-end against a local sei-chain at -chain-id sei-chain:
  - bake "n_results=1" log line for the block carrying our test tx
  - "trace cache hit" log line on the matching debug_traceTransaction
  - graceful shutdown flushed 13 WAL batches; reopened db shows
    last_baked_height advanced and the tx's row at "ts/<height>/<tracer>/<txHash>"

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a parallel "tb/<height,8>/<tracerLen,1><tracer>" keyspace for the
assembled per-block trace result. Same height ordering as the per-tx
"ts/" keyspace so Prune is still cheap — one DeleteRange per prefix,
both bounded work regardless of row count.

Block-level reads (debug_traceBlockBy*) can now be a single PK seek
into "tb/" instead of N seeks under "ts/" + assembly. The baker (next
commit) writes both rows when the new flag is on so per-tx and
per-block paths each hit at one seek.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When debug_traceBlockBy* dominates the trace traffic, caching only the
per-tx rows costs N seeks per block lookup. With CacheBlockResults the
baker additionally writes the assembled JSON to a "tb/<height>/<tracer>"
row, so block-level reads hit at one PK seek instead of N. Per-tx
"ts/" rows are still written either way — the new flag is purely
additive.

Reader fast-path: tryBlockResultCache checks tb/ first; on miss falls
back to today's per-tx assembly. Per-tx hits are unchanged. Unbakeable
tracer configs (struct logger, custom JS, per-call TracerConfig)
short-circuit before touching either keyspace.

Empty blocks are skipped on the write side — per-tx assembly already
returns [] for them at zero cache cost, and json.Marshal(nil)="null"
would have been a format mismatch with the live path.

Verified live: tx in block 0xdf gets a tb/ row written; per-block
RPC returns the cached JSON; empty blocks fall through to the per-tx
path and return [] correctly. Final state: ts/ rows = 1 (one tx),
tb/ rows = 1 (one tx-bearing block), no empty-block garbage.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant