Skip to content

TraceDB: Add async debug_trace caching#3359

Open
Kbhat1 wants to merge 2 commits intomainfrom
pr/trace-baker-main
Open

TraceDB: Add async debug_trace caching#3359
Kbhat1 wants to merge 2 commits intomainfrom
pr/trace-baker-main

Conversation

@Kbhat1
Copy link
Copy Markdown
Contributor

@Kbhat1 Kbhat1 commented May 1, 2026

Describe your changes and provide context

  • Adds an opt-in TraceDB-backed cache for bakeable debug_trace* calls, stored under <home>/data/trace_db.
  • Runs a background TraceBaker off the consensus path to precompute configured tracers for committed blocks; cache misses still fall back to live tracing.
  • Adds bounded workers/queueing, startup catch-up, and optional rolling pruning via [evm] config.

Testing performed to validate your change

  • Ran fully on local node
  • Verifying on mainnet node
  • Unit tests

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 1, 2026

The latest Buf updates on your PR. Results from workflow Buf / buf (pull_request).

BuildFormatLintBreakingUpdated (UTC)
✅ passed✅ passed✅ passed✅ passedMay 5, 2026, 3:23 PM

@Kbhat1 Kbhat1 force-pushed the pr/trace-baker-main branch 4 times, most recently from b177656 to 7b8a363 Compare May 1, 2026 15:42
@Kbhat1 Kbhat1 changed the title Add TraceCache + TraceBaker for async debug_trace caching TraceDB: Add TraceCache + TraceBaker for async debug_trace caching May 1, 2026
@Kbhat1 Kbhat1 force-pushed the pr/trace-baker-main branch 3 times, most recently from 6981a66 to fe9ec89 Compare May 1, 2026 21:02
@jewei1997
Copy link
Copy Markdown
Contributor

Ran through LLM review and got this

Major issues

1. Baker is never stopped — goroutine leak + use-after-close on pebble. evmrpc/server.go:95 calls StartTraceBakerForDebugAPI(...) and discards the return value. App.HandleClose (app/app.go:1073) calls tc.Close() on the TraceDB but the baker's Stop() is never invoked. Workers will continue to call b.cache.Put/PutBlock/SetLastBakedHeight against a closed pebble DB during shutdown — pebble returns errors rather than panicking on closed-DB writes, but they'll be silent (logged at Debug). Worse, catchUpLoop and pruneLoop continue to run.

Fix: capture the baker, register it for shutdown, and Stop it before closing the TraceDB so workers drain first.

2. Enqueue will panic on send-to-closed-channel if Stop is ever wired up. Stop() does close(b.queue), but Enqueue selects on case b.queue <- height: with no done-channel guard:

select {
case b.queue <- height:
default:
    ...
}

If a block commit fires after Stop, the send panics. Add a case <-b.done: return in Enqueue (or guard with an atomic-bool/RWMutex). Currently latent because Stop is never called (issue 1), but the moment that's fixed this surfaces.

3. bakeableTracerName ignores config fields that change semantics. It rejects only when TracerConfig is non-empty:

if len(config.TracerConfig) > 0 { return "" }

But tracers.TraceConfig also carries Reexec, Timeout, LogConfig (struct-logger fields), and EnableMemory/EnableReturnData/DisableStorage/DisableStack from the embedded logger.Config. A caller passing Reexec: 100 (force re-execution) silently gets a stale cache hit. Tighten the check to also require Reexec == 0, Timeout == nil, and zero-value LogConfig — or define a canonical tuple that goes into the cache key.

4. WindowBlocks: 0 (the default) means pruning is disabled — disk grows unbounded. Operator footgun: enabling trace_bake_enabled without setting a window quietly fills the disk. Either pick a sane non-zero default (e.g. 100k blocks ≈ a few days) or refuse to start when enabled with window=0.

@Kbhat1
Copy link
Copy Markdown
Contributor Author

Kbhat1 commented May 4, 2026

@jewei1997

Will respond in more detail but don't think any of these are real triggerable bugs. Can clean up but can you take a look at the logic independently and lmk if any thoughts?

@Kbhat1
Copy link
Copy Markdown
Contributor Author

Kbhat1 commented May 4, 2026

  1. Tendermint is stopped before pebble closes, so the workers can't finish a trace and never reach the Put
  2. Stop() has no callers — the panic path is unreachable. If we ever wire it up we'd add the done-guard at the same time
  3. Sei's StateAtBlock ignores Reexec entirely, and logger.Config only applies when Tracer is nil. Same output either way
  4. that's the documented behavior, "0 disables prune" is right there in the config comment

Will clean up w/some suggestions but don't seem like blockers

@sei-protocol sei-protocol deleted a comment from codecov Bot May 4, 2026
@Kbhat1 Kbhat1 force-pushed the pr/trace-baker-main branch from ae34e85 to f861506 Compare May 4, 2026 21:03
@codecov
Copy link
Copy Markdown

codecov Bot commented May 4, 2026

Codecov Report

❌ Patch coverage is 66.74699% with 138 lines in your changes missing coverage. Please review.
✅ Project coverage is 59.23%. Comparing base (6bf79dc) to head (ceeccd2).

Files with missing lines Patch % Lines
evmrpc/tracers.go 57.39% 44 Missing and 5 partials ⚠️
evmrpc/trace_baker.go 71.22% 28 Missing and 12 partials ⚠️
x/evm/keeper/trace_db.go 86.50% 10 Missing and 7 partials ⚠️
evmrpc/config/config.go 0.00% 10 Missing and 5 partials ⚠️
app/app.go 0.00% 7 Missing and 2 partials ⚠️
evmrpc/server.go 0.00% 6 Missing and 1 partial ⚠️
x/evm/keeper/keeper.go 50.00% 1 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #3359      +/-   ##
==========================================
+ Coverage   59.07%   59.23%   +0.16%     
==========================================
  Files        2100     2099       -1     
  Lines      173066   172898     -168     
==========================================
+ Hits       102241   102421     +180     
+ Misses      61945    61595     -350     
- Partials     8880     8882       +2     
Flag Coverage Δ
sei-chain-pr 62.66% <66.74%> (?)
sei-db 70.41% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
x/evm/keeper/abci.go 57.14% <100.00%> (+0.83%) ⬆️
x/evm/keeper/keeper.go 54.89% <50.00%> (-0.03%) ⬇️
evmrpc/server.go 88.46% <0.00%> (-3.09%) ⬇️
app/app.go 69.38% <0.00%> (-0.47%) ⬇️
evmrpc/config/config.go 74.31% <0.00%> (-11.86%) ⬇️
x/evm/keeper/trace_db.go 86.50% <86.50%> (ø)
evmrpc/trace_baker.go 71.22% <71.22%> (ø)
evmrpc/tracers.go 63.12% <57.39%> (-2.18%) ⬇️

... and 49 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Opt-in pebble-backed cache for debug_trace results. RPC nodes that
flip it on get tip-following debug_trace as ms-level cache hits
instead of seconds-level live re-execution.

- TraceCache (x/evm/keeper/trace_cache.go): pebble db at
  <home>/data/trace_cache, keyed (height, tracer, txHash) per tx and
  (height, tracer) per block. height-leading keys make windowed prune
  a single DeleteRange.

- TraceBaker (evmrpc/trace_baker.go): bounded worker pool that re-runs
  committed blocks through TraceBlockByNumber and writes results into
  TraceCache. Enqueue is non-blocking so consensus is unaffected;
  cache misses fall through to live re-execution. Includes startup
  catch-up (from last_baked+1) and rolling prune.

- DebugAPI hits the cache for both per-tx and per-block traces and
  falls through transparently on miss.

- EVM EndBlock enqueues height-1 (the latest block whose indexer
  state is guaranteed available).

- App opens NewTraceCache and Closes it on shutdown so the WAL flushes
  cleanly (writes use NoSync for throughput).

Configurable via [evm] in app.toml:
  trace_bake_enabled         (default false)
  trace_bake_workers         (default 1)
  trace_bake_queue_size      (default 4096)
  trace_bake_tracers         (default ["callTracer"])
  trace_bake_window_blocks   (default 0; 0 disables prune)
  trace_bake_block_results   (default false)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@Kbhat1 Kbhat1 force-pushed the pr/trace-baker-main branch from 7c210c4 to 27dc9b0 Compare May 4, 2026 21:40
@Kbhat1 Kbhat1 changed the title TraceDB: Add TraceCache + TraceBaker for async debug_trace caching TraceDB: Add async debug_trace caching May 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants