Skip to content

feat(evmrpc): migrate RPC telemetry to OpenTelemetry Meter API#3265

Merged
amir-deris merged 19 commits intomainfrom
amir/plt-264-migrate-evmrpc-to-standardized-otel
May 5, 2026
Merged

feat(evmrpc): migrate RPC telemetry to OpenTelemetry Meter API#3265
amir-deris merged 19 commits intomainfrom
amir/plt-264-migrate-evmrpc-to-standardized-otel

Conversation

@amir-deris
Copy link
Copy Markdown
Contributor

@amir-deris amir-deris commented Apr 16, 2026

Summary

Follow-up to #3253, rebased to use the standardized OpenTelemetry Meter API per reviewer feedback (@masih: "standardise on OTEL and use runtime bindings to report to prometheus instead of direct metrics reporting via prometheus client").

  • Replaces utils/metrics calls in evmrpc (filter.go, utils.go, websockets.go) with a new metrics.go layer using otel.Meter("evmrpc") (the global MeterProvider convenience wrapper)
  • Keeps IncrementRpcRequestCounter, MeasureRpcRequestLatency, and IncWebsocketConnects in utils/metrics but annotates them TODO(PLT-326) for removal once dashboards are migrated to the new evmrpc_* OTel series
  • Removes num_blocks_fetched batch counters from internal processBatch call sites — these were unused and not wired to any dashboard or alert
  • Threads ctx through all RPC handler signatures (14+ files: association.go, block.go, info.go, subscribe.go, tracers.go, tx.go, txpool.go, and others) to support OTel's context-based propagation and adopt the new defer-with-recover() pattern
  • Adds success, error_class, and jsonrpc_code labels to the latency histogram so failed requests and error types can be tracked with low cardinality
  • Adds a ctx context.Context parameter to NewPendingTransactionFilter (previously missing)
  • Adds a test for the new telemetry layer

New metric names (OTel naming convention, exported via the process-wide MeterProvider):

  • evmrpc_request_latency_seconds — request latency histogram (labels: endpoint, connection, success, error_class, jsonrpc_code)
  • evmrpc_websocket_connects_total — websocket connection count

Note

Metric name migration

Legacy metrics (sei_rpc_request_counter / sei_rpc_request_latency_ms / sei_websocket_connect) are still dual-emitted during the migration window so dashboards continue to receive data. They carry TODO markers and will be removed once dashboards are migrated to the new evmrpc_* OTel series.

success label semantic correction

The old recordMetricsWithError had a logic bug — success := panicValue == nil || err != nil — which caused requests that returned an error without panicking to be recorded as success=true. This PR fixes that to success := panicValue == nil && err == nil.

The correction applies to both the new OTel metric and the dual-emitted legacy sei_rpc_request_counter series.

Deploy-time impact: A search of the platform repo found no Prometheus alert rules referencing these metrics, so no alerts will fire. Two Grafana dashboard panels filter on success="false" and will show a spike at deploy time:

  • clusters/prod/monitoring/grafana-dashboards-protocol.yaml — EVM RPC failed requests panel
  • clusters/dev/monitoring/dashboards/protocol/sei_node_monitoring.json — EVM RPC failed requests panel

This spike is a visual artifact of the correction (previously miscounted errors now classified correctly), not a real increase in failures. No action is required, but heads-up for anyone watching these panels during the rollout.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 16, 2026

The latest Buf updates on your PR. Results from workflow Buf / buf (pull_request).

BuildFormatLintBreakingUpdated (UTC)
✅ passed✅ passed✅ passed✅ passedMay 5, 2026, 6:59 PM

@amir-deris amir-deris changed the title used otel for evmrpc, added test feat(evmrpc): migrate RPC telemetry to OpenTelemetry Meter API Apr 16, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 16, 2026

Codecov Report

❌ Patch coverage is 91.37255% with 22 lines in your changes missing coverage. Please review.
✅ Project coverage is 59.10%. Comparing base (6bf79dc) to head (26aaf32).

Files with missing lines Patch % Lines
evmrpc/send.go 56.25% 7 Missing ⚠️
evmrpc/subscribe.go 63.63% 4 Missing ⚠️
evmrpc/association.go 82.35% 3 Missing ⚠️
evmrpc/metrics.go 95.34% 1 Missing and 1 partial ⚠️
evmrpc/net.go 0.00% 2 Missing ⚠️
evmrpc/state.go 85.71% 2 Missing ⚠️
evmrpc/utils.go 83.33% 1 Missing and 1 partial ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #3265      +/-   ##
==========================================
- Coverage   59.11%   59.10%   -0.02%     
==========================================
  Files        2100     2100              
  Lines      173066   173085      +19     
==========================================
- Hits       102312   102298      -14     
- Misses      61869    61924      +55     
+ Partials     8885     8863      -22     
Flag Coverage Δ
sei-chain-pr 65.31% <91.37%> (?)
sei-db 70.41% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
evmrpc/block.go 82.78% <100.00%> (+0.52%) ⬆️
evmrpc/filter.go 64.69% <100.00%> (+0.31%) ⬆️
evmrpc/info.go 78.52% <100.00%> (+1.11%) ⬆️
evmrpc/sei_legacy.go 73.13% <100.00%> (ø)
evmrpc/sei_legacy_http.go 80.15% <ø> (ø)
evmrpc/simulate.go 74.54% <100.00%> (+0.47%) ⬆️
evmrpc/trace_profile.go 70.23% <100.00%> (+0.72%) ⬆️
evmrpc/tracers.go 67.92% <100.00%> (+2.61%) ⬆️
evmrpc/tx.go 85.48% <100.00%> (ø)
evmrpc/txpool.go 76.47% <100.00%> (+1.47%) ⬆️
... and 10 more

... and 34 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Comment thread evmrpc/utils.go
// Automatically detect success/failure based on panic state
panicValue := recover()
success := panicValue == nil || err != nil
success := panicValue == nil && err == nil
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous success value would become true as long as there was no panic (due to short circuit for OR), even in cases that err was not nil. Updated here to account for both panic and err.

@amir-deris amir-deris requested review from bdchatham and masih April 16, 2026 19:25
Comment thread evmrpc/rpc_telemetry.go Outdated
Comment thread evmrpc/rpc_telemetry.go Outdated
Comment thread evmrpc/info.go Outdated
Comment thread evmrpc/rpc_telemetry.go Outdated
Comment thread evmrpc/rpc_telemetry.go Outdated
Comment thread evmrpc/filter.go
Copy link
Copy Markdown
Collaborator

@masih masih left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yay glad to see OTEL integration picking up 🚀

Blockers:

  • standardise on seconds to measure time/duration/latency
  • Remove redundant counters

Left a bunch of naming suggestions and a few questions

Thanks @amir-deris 🙌

Comment thread evmrpc/rpc_telemetry.go Outdated
// configured by the application.

var (
rpcTelemetryMeter = otel.Meter("evmrpc_rpc")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • There would be a single meter this entire package. I would simply name this meter.
  • I recommend removing the repetitive _rpc suffix.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I just noticed there is another meter definition in evmrpc/worker_pool_metrics.go and its meter definition clashes with this one if we rename this meter:
meter = otel.Meter("evmrpc_workerpool")
Should we combine all meters into one file (metrics.go)?

Comment thread evmrpc/rpc_telemetry.go Outdated
var (
rpcTelemetryMeter = otel.Meter("evmrpc_rpc")

rpcTelemetryMetrics = struct {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, I would call this metrics.

Comment thread evmrpc/rpc_telemetry.go Outdated
rpcTelemetryMeter = otel.Meter("evmrpc_rpc")

rpcTelemetryMetrics = struct {
requests metric.Int64Counter
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • General note on naming: this reads too abstract. I would add a Count at the end of it.
  • The counter itself is redundant, if we always measure latency in a histogram. See my other comment.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will remove this counter, as discussed in histogram comment.

Comment thread evmrpc/rpc_telemetry.go Outdated

rpcTelemetryMetrics = struct {
requests metric.Int64Counter
requestLatencyMs metric.Float64Histogram
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Always standardise on seconds for latency/duration measurements. Duration.Seconds() conveniently returns a float64 without the loss of precision.

Comment thread evmrpc/rpc_telemetry.go Outdated
rpcTelemetryMetrics = struct {
requests metric.Int64Counter
requestLatencyMs metric.Float64Histogram
websocketConnects metric.Int64Counter
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly I would add a Count at the end, and perhaps pick a simpler name: wsConnectionCount or similar.

Comment thread evmrpc/rpc_telemetry.go Outdated
func recordRPCRequest(ctx context.Context, endpoint, connection string, success bool) {
rpcTelemetryMetrics.requests.Add(ctx, 1,
metric.WithAttributes(
attribute.String("endpoint", endpoint),
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Define attrKeys for repeated tags as vars?

Comment thread evmrpc/rpc_telemetry.go Outdated
metric.WithAttributes(
attribute.String("endpoint", endpoint),
attribute.String("connection", connection),
attribute.Bool("success", success),
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A better approach is to take error and tag by error type for a fixed set of well known errors.

Also do we measure JSON RPC response code somewhere? that would be super useful.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added new method classifyRPCMetricError to record error type and response code with latency.

Comment thread evmrpc/filter.go
Comment thread evmrpc/association.go Outdated
func (t *AssociationAPI) Associate(ctx context.Context, req *AssociateRequest) (returnErr error) {
startTime := time.Now()
defer recordMetricsWithError("sei_associate", t.connectionType, startTime, returnErr)
defer recordMetricsWithError(ctx, "sei_associate", t.connectionType, startTime, returnErr)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we keeping recordMetricsWithError calls at all?

Is the idea to not break the existing metrics, roll out both then remove the old stuff?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we need to keep these metrics to be backward compatible with any dashboards we might have. Should I remove these metrics?

Comment thread evmrpc/rpc_telemetry.go Outdated
@@ -0,0 +1,65 @@
package evmrpc
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Conventionally I would simply call this file metrics.go. Each package would have one as needed. that go file would initialise once on startup and that's it; all package level vars.

@amir-deris amir-deris changed the base branch from main to amir/plt-296-observability-revamp-long-running-branch April 30, 2026 20:39
Copy link
Copy Markdown
Collaborator

@masih masih left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

THis is a stacked PR.

I advise against stacked PRs in general and the work is done from dev perspective once it lands on main.

Approving and will re-review the full work on PR to main.
Please make sure the final PR doesn't get too big otherwise the punishment is to buy ice cream for the team 🍨 😜

@amir-deris amir-deris changed the base branch from amir/plt-296-observability-revamp-long-running-branch to main May 1, 2026 17:16
Copy link
Copy Markdown
Contributor

@bdchatham bdchatham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fresh expert pass on top of my approval — five callouts worth considering. (1) is a real bug from the named-return wrapping pattern; (2) is a deploy-coordination concern that contradicts the "existing dashboards continue to work" framing; (3)–(5) are good-to-fix.

Comment thread evmrpc/utils.go Outdated
func recordMetricsWithError(apiMethod string, connectionType ConnectionType, startTime time.Time, err error) {
func recordMetricsWithError(ctx context.Context, apiMethod string, connectionType ConnectionType, startTime time.Time, err error) {
// Automatically detect success/failure based on panic state
panicValue := recover()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This recover() no longer catches anything.

The pre-PR pattern defer recordMetricsWithError(...) made recordMetricsWithError itself the deferred function, so recover() worked. After this PR, every caller wraps as defer func() { recordMetricsWithError(...) }() (e.g. association.go:60, info.go:88, send.go:73) to capture the named return. Per the Go spec: "recover returns nil if recover was not called directly by a deferred function." recordMetricsWithError is now invoked by the closure, not deferred itself — so panicValue is always nil.

Consequences:

  • errorClassPanic is dead — that label value is now unreachable.
  • success evaluates true on a panicked handler that returned a zero-value err.
  • The closing if panicValue != nil { panic(panicValue) } re-raise never fires (panics still propagate up to whichever recover does run, but this site's metric labeling is wrong).

Fix: have the closure call recover() and pass the result in:

defer func() {
    recordMetricsWithError(ctx, "sei_associate", t.connectionType, startTime, returnErr, recover())
}()

…and accept panicValue any here instead of calling recover() internally.

Comment thread evmrpc/utils.go
// Automatically detect success/failure based on panic state
panicValue := recover()
success := panicValue == nil || err != nil
success := panicValue == nil && err == nil
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correctness fix — but this silently changes the meaning of the legacy dual-emitted sei_rpc_request_counter{success="true"} series too.

The old || was buggy: non-panicking errors were marked success=true. After this lands, those flip to success=false on the legacy series. Existing alerts / SLO queries shaped like rate(sei_rpc_request_counter{success="true"}[5m]) / rate(sei_rpc_request_counter[5m]) will step-down at deploy.

The PR description says "existing dashboards continue to work" — true at the metric-name level, but not at the semantic level. Suggest:

  • Call this out explicitly in the rollout note.
  • Sweep sei-protocol/sei-infra/monitoring/ and the alerting config for queries filtering on success and adjust in the same release window.

Copy link
Copy Markdown
Contributor Author

@amir-deris amir-deris May 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for feedback. That is a good point, it seems we don't have any alerts based on the success attributes. We only have two existing dashboards which filter for success=false and can potentially show a spike after the rollout of this PR's changes, due to how we now correctly assign error status:

File Query Impact
clusters/prod/monitoring/grafana-dashboards-protocol.yaml (line 19189) sei_rpc_request_counter{success="false"} Will spike up at deploy
clusters/dev/monitoring/dashboards/protocol/sei_node_monitoring.json (line 4251) sei_rpc_request_counter{success="false"} Will spike up at deploy

Updated PR description accordingly.

Comment thread evmrpc/utils.go Outdated
metrics.IncrementRpcRequestCounter(apiMethod, string(connectionType), success)
metrics.MeasureRpcRequestLatency(apiMethod, string(connectionType), startTime)
recordRPCLatency(ctx, apiMethod, string(connectionType), success, err, panicValue != nil, startTime)
// TODO: remove legacy dual-emit once dashboards are migrated to evmrpc_* OTEL metrics. Use metrics.requestLatencySeconds histogram instead.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bare TODOs for "remove later" reliably rot. Suggest:

  1. File a tracking issue with explicit cleanup criteria — dashboards / alerts / recording rules migrated to evmrpc_*, ≥2 weeks of new-metric retention, then drop the dual emission.
  2. Replace the TODO bodies here, in evmrpc/websockets.go, and on the three legacy helpers in utils/metrics/metrics_util.go with // TODO(PLT-XXX): ... linking the issue.

Otherwise the dual emission + paired sei_* helpers will outlive everyone's memory of why they're here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Created this ticket https://linear.app/seilabs/issue/PLT-326
Also updated the TODO comments with ticket number.l

Comment thread evmrpc/metrics.go
"evmrpc_request_latency_seconds",
metric.WithDescription("RPC request latency in seconds (labeled by success, bounded error_class, and jsonrpc_code when known)"),
metric.WithUnit("s"),
)),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two histogram-shape concerns worth addressing before this scales out:

Cardinality. jsonrpc_code is an Int64 attribute; for error_class=jsonrpc_error it falls through to whatever coder.ErrorCode() returns. JSON-RPC 2.0 reserves -32000..-32099 for server-defined codes, plus go-ethereum surfaces additional ones. Worst-case bound (~30 endpoints × 2 connections × 7 error_class × ~30 codes seen × ~12 default buckets) is ~150k series per pod for this one histogram. Either:

  • Bucket the code on the histogram (e.g. "spec" / "server" / "other"), or
  • Drop jsonrpc_code from the histogram and put it on a low-cost counter only.

Also, success and error_class are partially redundant — drop error_class (or set to empty) on the success path so the high-volume happy path doesn't carry both.

Bucket boundaries. Default OTel buckets start at 5ms; cached eth_chainId / net_version collapse to bucket 0, and debug_traceBlockByNumber runs past the 10s ceiling. Consider explicit boundaries:

metric.WithExplicitBucketBoundaries(
    0.0005, 0.001, 0.0025, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25,
    0.5, 1, 2.5, 5, 10, 30,
)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noted. Fixed the histogram bucket boundaries and also added specific buckets for the error codes.

Comment thread evmrpc/metrics.go
metric.WithUnit("s"),
)),
wsConnectionCount: must(rpcTelemetryMeter.Int64Counter(
"evmrpc_websocket_connects_total",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_total is auto-appended on counters by the OTel Prometheus exporter (default behavior in go.opentelemetry.io/otel/exporters/prometheus). Naming the instrument evmrpc_websocket_connects_total likely emits sei_chain_evmrpc_websocket_connects_total_total after the namespace prefix and counter-suffix pass.

Worth verifying on a running pod's /metrics before the dashboard PR ships. Fix is to rename the instrument to evmrpc_websocket_connects and let the exporter add the suffix.

(The dashboard PR doesn't query this counter directly so it should be unaffected, but anyone writing future alerts against it would hit the double suffix.)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I verified this with the local cluster: make docker-cluster-start, then making websocket request:
printf '%s\n' '{"jsonrpc":"2.0","method":"eth_chainId","params":[],"id":1}' | websocat -n1 ws://127.0.0.1:8546/
Then checking the metrics in the container:
curl -s http://localhost:26660/metrics | grep evmrpc_websocket_connects_total
It shows:
sei_chain_evmrpc_websocket_connects_total{otel_scope_name="evmrpc",otel_scope_schema_url="",otel_scope_version=""} 5
so it seems the _total is not doubled.
I also reviewed the otel package go.opentelemetry.io/otel/exporters/prometheus with Claude and it seems there is some logic in there to not duplicate the _total:
The naming is handled entirely by otlptranslator — your instrument name evmrpc_websocket_connects_total gets tokenized, "total" is stripped and re-appended at the end, so you always get exactly one _total. No code change needed on your side.

@amir-deris amir-deris added this pull request to the merge queue May 5, 2026
Merged via the queue into main with commit acef70c May 5, 2026
38 checks passed
@amir-deris amir-deris deleted the amir/plt-264-migrate-evmrpc-to-standardized-otel branch May 5, 2026 19:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants