feat(evmrpc): migrate RPC telemetry to OpenTelemetry Meter API by amir-deris · Pull Request #3265 · sei-protocol/sei-chain

amir-deris · 2026-04-16T19:05:17Z

Summary

Follow-up to #3253, rebased to use the standardized OpenTelemetry Meter API per reviewer feedback (@masih: "standardise on OTEL and use runtime bindings to report to prometheus instead of direct metrics reporting via prometheus client").

Replaces utils/metrics calls in evmrpc (filter.go, utils.go, websockets.go) with a new metrics.go layer using otel.Meter("evmrpc") (the global MeterProvider convenience wrapper)
Keeps IncrementRpcRequestCounter, MeasureRpcRequestLatency, and IncWebsocketConnects in utils/metrics but annotates them TODO(PLT-326) for removal once dashboards are migrated to the new evmrpc_* OTel series
Removes num_blocks_fetched batch counters from internal processBatch call sites — these were unused and not wired to any dashboard or alert
Threads ctx through all RPC handler signatures (14+ files: association.go, block.go, info.go, subscribe.go, tracers.go, tx.go, txpool.go, and others) to support OTel's context-based propagation and adopt the new defer-with-recover() pattern
Adds success, error_class, and jsonrpc_code labels to the latency histogram so failed requests and error types can be tracked with low cardinality
Adds a ctx context.Context parameter to NewPendingTransactionFilter (previously missing)
Adds a test for the new telemetry layer

New metric names (OTel naming convention, exported via the process-wide MeterProvider):

evmrpc_request_latency_seconds — request latency histogram (labels: endpoint, connection, success, error_class, jsonrpc_code)
evmrpc_websocket_connects_total — websocket connection count

Note

Metric name migration

Legacy metrics (sei_rpc_request_counter / sei_rpc_request_latency_ms / sei_websocket_connect) are still dual-emitted during the migration window so dashboards continue to receive data. They carry TODO markers and will be removed once dashboards are migrated to the new evmrpc_* OTel series.

`success` label semantic correction

The old recordMetricsWithError had a logic bug — success := panicValue == nil || err != nil — which caused requests that returned an error without panicking to be recorded as success=true. This PR fixes that to success := panicValue == nil && err == nil.

The correction applies to both the new OTel metric and the dual-emitted legacy sei_rpc_request_counter series.

Deploy-time impact: A search of the platform repo found no Prometheus alert rules referencing these metrics, so no alerts will fire. Two Grafana dashboard panels filter on success="false" and will show a spike at deploy time:

clusters/prod/monitoring/grafana-dashboards-protocol.yaml — EVM RPC failed requests panel
clusters/dev/monitoring/dashboards/protocol/sei_node_monitoring.json — EVM RPC failed requests panel

This spike is a visual artifact of the correction (previously miscounted errors now classified correctly), not a real increase in failures. No action is required, but heads-up for anyone watching these panels during the rollout.

github-actions · 2026-04-16T19:06:19Z

The latest Buf updates on your PR. Results from workflow Buf / buf (pull_request).

Build	Format	Lint	Breaking	Updated (UTC)
`✅ passed`	`✅ passed`	`✅ passed`	`✅ passed`	May 5, 2026, 6:59 PM

codecov · 2026-04-16T19:08:59Z

Codecov Report

❌ Patch coverage is 91.37255% with 22 lines in your changes missing coverage. Please review.
✅ Project coverage is 59.10%. Comparing base (6bf79dc) to head (26aaf32).

Files with missing lines	Patch %	Lines
evmrpc/send.go	56.25%	7 Missing ⚠️
evmrpc/subscribe.go	63.63%	4 Missing ⚠️
evmrpc/association.go	82.35%	3 Missing ⚠️
evmrpc/metrics.go	95.34%	1 Missing and 1 partial ⚠️
evmrpc/net.go	0.00%	2 Missing ⚠️
evmrpc/state.go	85.71%	2 Missing ⚠️
evmrpc/utils.go	83.33%	1 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3265      +/-   ##
==========================================
- Coverage   59.11%   59.10%   -0.02%     
==========================================
  Files        2100     2100              
  Lines      173066   173085      +19     
==========================================
- Hits       102312   102298      -14     
- Misses      61869    61924      +55     
+ Partials     8885     8863      -22

Flag	Coverage Δ
sei-chain-pr	`65.31% <91.37%> (?)`
sei-db	`70.41% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
evmrpc/block.go	`82.78% <100.00%> (+0.52%)`	⬆️
evmrpc/filter.go	`64.69% <100.00%> (+0.31%)`	⬆️
evmrpc/info.go	`78.52% <100.00%> (+1.11%)`	⬆️
evmrpc/sei_legacy.go	`73.13% <100.00%> (ø)`
evmrpc/sei_legacy_http.go	`80.15% <ø> (ø)`
evmrpc/simulate.go	`74.54% <100.00%> (+0.47%)`	⬆️
evmrpc/trace_profile.go	`70.23% <100.00%> (+0.72%)`	⬆️
evmrpc/tracers.go	`67.92% <100.00%> (+2.61%)`	⬆️
evmrpc/tx.go	`85.48% <100.00%> (ø)`
evmrpc/txpool.go	`76.47% <100.00%> (+1.47%)`	⬆️
... and 10 more

... and 34 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

amir-deris · 2026-04-16T19:24:58Z

 	// Automatically detect success/failure based on panic state
 	panicValue := recover()
-	success := panicValue == nil || err != nil
+	success := panicValue == nil && err == nil


The previous success value would become true as long as there was no panic (due to short circuit for OR), even in cases that err was not nil. Updated here to account for both panic and err.

masih

Yay glad to see OTEL integration picking up 🚀

Blockers:

standardise on seconds to measure time/duration/latency
Remove redundant counters

Left a bunch of naming suggestions and a few questions

Thanks @amir-deris 🙌

masih · 2026-04-17T08:34:12Z

+// configured by the application.
+
+var (
+	rpcTelemetryMeter = otel.Meter("evmrpc_rpc")


There would be a single meter this entire package. I would simply name this meter.

I recommend removing the repetitive _rpc suffix.

Actually I just noticed there is another meter definition in evmrpc/worker_pool_metrics.go and its meter definition clashes with this one if we rename this meter:
meter = otel.Meter("evmrpc_workerpool")
Should we combine all meters into one file (metrics.go)?

masih · 2026-04-17T08:34:28Z

+var (
+	rpcTelemetryMeter = otel.Meter("evmrpc_rpc")
+
+	rpcTelemetryMetrics = struct {


Similarly, I would call this metrics.

masih · 2026-04-17T08:35:44Z

+	rpcTelemetryMeter = otel.Meter("evmrpc_rpc")
+
+	rpcTelemetryMetrics = struct {
+		requests          metric.Int64Counter


General note on naming: this reads too abstract. I would add a Count at the end of it.

The counter itself is redundant, if we always measure latency in a histogram. See my other comment.

I will remove this counter, as discussed in histogram comment.

masih · 2026-04-17T08:36:32Z

+
+	rpcTelemetryMetrics = struct {
+		requests          metric.Int64Counter
+		requestLatencyMs  metric.Float64Histogram


Always standardise on seconds for latency/duration measurements. Duration.Seconds() conveniently returns a float64 without the loss of precision.

masih · 2026-04-17T08:38:03Z

+	rpcTelemetryMetrics = struct {
+		requests          metric.Int64Counter
+		requestLatencyMs  metric.Float64Histogram
+		websocketConnects metric.Int64Counter


Similarly I would add a Count at the end, and perhaps pick a simpler name: wsConnectionCount or similar.

masih · 2026-04-17T08:45:17Z

+func recordRPCRequest(ctx context.Context, endpoint, connection string, success bool) {
+	rpcTelemetryMetrics.requests.Add(ctx, 1,
+		metric.WithAttributes(
+			attribute.String("endpoint", endpoint),


Define attrKeys for repeated tags as vars?

masih · 2026-04-17T08:46:24Z

+		metric.WithAttributes(
+			attribute.String("endpoint", endpoint),
+			attribute.String("connection", connection),
+			attribute.Bool("success", success),


A better approach is to take error and tag by error type for a fixed set of well known errors.

Also do we measure JSON RPC response code somewhere? that would be super useful.

I added new method classifyRPCMetricError to record error type and response code with latency.

masih · 2026-04-17T08:49:07Z

 func (t *AssociationAPI) Associate(ctx context.Context, req *AssociateRequest) (returnErr error) {
 	startTime := time.Now()
-	defer recordMetricsWithError("sei_associate", t.connectionType, startTime, returnErr)
+	defer recordMetricsWithError(ctx, "sei_associate", t.connectionType, startTime, returnErr)


Why are we keeping recordMetricsWithError calls at all?

Is the idea to not break the existing metrics, roll out both then remove the old stuff?

I thought we need to keep these metrics to be backward compatible with any dashboards we might have. Should I remove these metrics?

masih · 2026-04-17T08:50:01Z

@@ -0,0 +1,65 @@
+package evmrpc


Conventionally I would simply call this file metrics.go. Each package would have one as needed. that go file would initialise once on startup and that's it; all package level vars.

…ute in metric

…otel

masih

THis is a stacked PR.

I advise against stacked PRs in general and the work is done from dev perspective once it lands on main.

Approving and will re-review the full work on PR to main.
Please make sure the final PR doesn't get too big otherwise the punishment is to buy ice cream for the team 🍨 😜

…otel

bdchatham

Fresh expert pass on top of my approval — five callouts worth considering. (1) is a real bug from the named-return wrapping pattern; (2) is a deploy-coordination concern that contradicts the "existing dashboards continue to work" framing; (3)–(5) are good-to-fix.

bdchatham · 2026-05-04T20:27:40Z

-func recordMetricsWithError(apiMethod string, connectionType ConnectionType, startTime time.Time, err error) {
+func recordMetricsWithError(ctx context.Context, apiMethod string, connectionType ConnectionType, startTime time.Time, err error) {
 	// Automatically detect success/failure based on panic state
 	panicValue := recover()


This recover() no longer catches anything.

The pre-PR pattern defer recordMetricsWithError(...) made recordMetricsWithError itself the deferred function, so recover() worked. After this PR, every caller wraps as defer func() { recordMetricsWithError(...) }() (e.g. association.go:60, info.go:88, send.go:73) to capture the named return. Per the Go spec: "recover returns nil if recover was not called directly by a deferred function." recordMetricsWithError is now invoked by the closure, not deferred itself — so panicValue is always nil.

Consequences:

errorClassPanic is dead — that label value is now unreachable.

success evaluates true on a panicked handler that returned a zero-value err.

The closing if panicValue != nil { panic(panicValue) } re-raise never fires (panics still propagate up to whichever recover does run, but this site's metric labeling is wrong).

Fix: have the closure call recover() and pass the result in:

defer func() { recordMetricsWithError(ctx, "sei_associate", t.connectionType, startTime, returnErr, recover()) }()

…and accept panicValue any here instead of calling recover() internally.

bdchatham · 2026-05-04T20:27:40Z

 	// Automatically detect success/failure based on panic state
 	panicValue := recover()
-	success := panicValue == nil || err != nil
+	success := panicValue == nil && err == nil


Correctness fix — but this silently changes the meaning of the legacy dual-emitted sei_rpc_request_counter{success="true"} series too.

The old || was buggy: non-panicking errors were marked success=true. After this lands, those flip to success=false on the legacy series. Existing alerts / SLO queries shaped like rate(sei_rpc_request_counter{success="true"}[5m]) / rate(sei_rpc_request_counter[5m]) will step-down at deploy.

The PR description says "existing dashboards continue to work" — true at the metric-name level, but not at the semantic level. Suggest:

Call this out explicitly in the rollout note.

Sweep sei-protocol/sei-infra/monitoring/ and the alerting config for queries filtering on success and adjust in the same release window.

Thanks for feedback. That is a good point, it seems we don't have any alerts based on the success attributes. We only have two existing dashboards which filter for success=false and can potentially show a spike after the rollout of this PR's changes, due to how we now correctly assign error status:

File Query Impact

clusters/prod/monitoring/grafana-dashboards-protocol.yaml (line 19189) sei_rpc_request_counter{success="false"} Will spike up at deploy

clusters/dev/monitoring/dashboards/protocol/sei_node_monitoring.json (line 4251) sei_rpc_request_counter{success="false"} Will spike up at deploy

Updated PR description accordingly.

bdchatham · 2026-05-04T20:27:40Z

-	metrics.IncrementRpcRequestCounter(apiMethod, string(connectionType), success)
-	metrics.MeasureRpcRequestLatency(apiMethod, string(connectionType), startTime)
+	recordRPCLatency(ctx, apiMethod, string(connectionType), success, err, panicValue != nil, startTime)
+	// TODO: remove legacy dual-emit once dashboards are migrated to evmrpc_* OTEL metrics. Use metrics.requestLatencySeconds histogram instead.


Bare TODOs for "remove later" reliably rot. Suggest:

File a tracking issue with explicit cleanup criteria — dashboards / alerts / recording rules migrated to evmrpc_*, ≥2 weeks of new-metric retention, then drop the dual emission.

Replace the TODO bodies here, in evmrpc/websockets.go, and on the three legacy helpers in utils/metrics/metrics_util.go with // TODO(PLT-XXX): ... linking the issue.

Otherwise the dual emission + paired sei_* helpers will outlive everyone's memory of why they're here.

Done. Created this ticket https://linear.app/seilabs/issue/PLT-326
Also updated the TODO comments with ticket number.l

bdchatham · 2026-05-04T20:27:40Z

+			"evmrpc_request_latency_seconds",
+			metric.WithDescription("RPC request latency in seconds (labeled by success, bounded error_class, and jsonrpc_code when known)"),
+			metric.WithUnit("s"),
+		)),


Two histogram-shape concerns worth addressing before this scales out:

Cardinality. jsonrpc_code is an Int64 attribute; for error_class=jsonrpc_error it falls through to whatever coder.ErrorCode() returns. JSON-RPC 2.0 reserves -32000..-32099 for server-defined codes, plus go-ethereum surfaces additional ones. Worst-case bound (~30 endpoints × 2 connections × 7 error_class × ~30 codes seen × ~12 default buckets) is ~150k series per pod for this one histogram. Either:

Bucket the code on the histogram (e.g. "spec" / "server" / "other"), or

Drop jsonrpc_code from the histogram and put it on a low-cost counter only.

Also, success and error_class are partially redundant — drop error_class (or set to empty) on the success path so the high-volume happy path doesn't carry both.

Bucket boundaries. Default OTel buckets start at 5ms; cached eth_chainId / net_version collapse to bucket 0, and debug_traceBlockByNumber runs past the 10s ceiling. Consider explicit boundaries:

metric.WithExplicitBucketBoundaries( 0.0005, 0.001, 0.0025, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10, 30, )

Noted. Fixed the histogram bucket boundaries and also added specific buckets for the error codes.

bdchatham · 2026-05-04T20:27:40Z

+			metric.WithUnit("s"),
+		)),
+		wsConnectionCount: must(rpcTelemetryMeter.Int64Counter(
+			"evmrpc_websocket_connects_total",


_total is auto-appended on counters by the OTel Prometheus exporter (default behavior in go.opentelemetry.io/otel/exporters/prometheus). Naming the instrument evmrpc_websocket_connects_total likely emits sei_chain_evmrpc_websocket_connects_total_total after the namespace prefix and counter-suffix pass.

Worth verifying on a running pod's /metrics before the dashboard PR ships. Fix is to rename the instrument to evmrpc_websocket_connects and let the exporter add the suffix.

(The dashboard PR doesn't query this counter directly so it should be unaffected, but anyone writing future alerts against it would hit the double suffix.)

I verified this with the local cluster: make docker-cluster-start, then making websocket request:
printf '%s\n' '{"jsonrpc":"2.0","method":"eth_chainId","params":[],"id":1}' | websocat -n1 ws://127.0.0.1:8546/
Then checking the metrics in the container:
curl -s http://localhost:26660/metrics | grep evmrpc_websocket_connects_total
It shows:
sei_chain_evmrpc_websocket_connects_total{otel_scope_name="evmrpc",otel_scope_schema_url="",otel_scope_version=""} 5
so it seems the _total is not doubled.
I also reviewed the otel package go.opentelemetry.io/otel/exporters/prometheus with Claude and it seems there is some logic in there to not duplicate the _total:
The naming is handled entirely by otlptranslator — your instrument name evmrpc_websocket_connects_total gets tokenized, "total" is stripped and re-appended at the end, so you always get exactly one _total. No code change needed on your side.

…otel

used otel for evmrpc, added test

39d3ce0

amir-deris self-assigned this Apr 16, 2026

amir-deris added the non-app-hash-breaking label Apr 16, 2026

amir-deris changed the title ~~used otel for evmrpc, added test~~ feat(evmrpc): migrate RPC telemetry to OpenTelemetry Meter API Apr 16, 2026

updated definition of success

d6ca26c

amir-deris commented Apr 16, 2026

View reviewed changes

amir-deris requested review from bdchatham and masih April 16, 2026 19:25

bdchatham reviewed Apr 16, 2026

View reviewed changes

Comment thread evmrpc/rpc_telemetry.go Outdated

bdchatham reviewed Apr 16, 2026

View reviewed changes

Comment thread evmrpc/rpc_telemetry.go Outdated

Passed context to rpc_telemetry methods

9674cb3

bdchatham reviewed Apr 16, 2026

View reviewed changes

Comment thread evmrpc/info.go Outdated

Comment thread evmrpc/rpc_telemetry.go Outdated

Comment thread evmrpc/rpc_telemetry.go Outdated

amir-deris added 3 commits April 16, 2026 15:36

removed redundant metric

35a80ae

Added missing context in evmrpc api calls with metric

c96937f

Fixing info_test and tx_test

5ddbade

bdchatham approved these changes Apr 17, 2026

View reviewed changes

Comment thread evmrpc/filter.go

masih reviewed Apr 17, 2026

View reviewed changes

amir-deris added 5 commits April 19, 2026 17:16

Adressing some pr feedback

4aabf00

pr feedback: t.Context instead of ctx.Background(), error code attrib…

470c163

…ute in metric

Merge branch 'main' into amir/plt-264-migrate-evmrpc-to-standardized-…

cc4e18d

…otel

Merge branch 'main' into amir/plt-264-migrate-evmrpc-to-standardized-…

4d2d1bc

…otel

Fixing build error

bb78b62

amir-deris changed the base branch from main to amir/plt-296-observability-revamp-long-running-branch April 30, 2026 20:39

Fixed the defer err capture

61b914b

masih reviewed May 1, 2026

View reviewed changes

masih approved these changes May 1, 2026

View reviewed changes

amir-deris changed the base branch from amir/plt-296-observability-revamp-long-running-branch to main May 1, 2026 17:16

Merge branch 'main' into amir/plt-264-migrate-evmrpc-to-standardized-…

adafdd3

…otel

amir-deris added 2 commits May 1, 2026 15:09

Fixed variable shadowing in defer metrics

45b970a

Emit both metrics during migration period

dd60d58

bdchatham reviewed May 4, 2026

View reviewed changes

amir-deris added 4 commits May 4, 2026 17:01

Merge branch 'main' into amir/plt-264-migrate-evmrpc-to-standardized-…

c8047e9

…otel

Fixed panic value bug

060b4eb

Addressing pr feedback

e127209

Merge branch 'main' into amir/plt-264-migrate-evmrpc-to-standardized-…

26aaf32

…otel

amir-deris added this pull request to the merge queue May 5, 2026

Merged via the queue into main with commit acef70c May 5, 2026
38 checks passed

amir-deris deleted the amir/plt-264-migrate-evmrpc-to-standardized-otel branch May 5, 2026 19:32

File	Query	Impact
`clusters/prod/monitoring/grafana-dashboards-protocol.yaml` (line 19189)	`sei_rpc_request_counter{success="false"}`	Will spike up at deploy
`clusters/dev/monitoring/dashboards/protocol/sei_node_monitoring.json` (line 4251)	`sei_rpc_request_counter{success="false"}`	Will spike up at deploy

Conversation

amir-deris commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Note

Metric name migration

success label semantic correction

Uh oh!

github-actions Bot commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov Bot commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

masih left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

masih left a comment

Choose a reason for hiding this comment

Uh oh!

bdchatham left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amir-deris May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

amir-deris commented Apr 16, 2026 •

edited

Loading

`success` label semantic correction

github-actions Bot commented Apr 16, 2026 •

edited

Loading

codecov Bot commented Apr 16, 2026 •

edited

Loading

amir-deris May 5, 2026 •

edited

Loading