feat(evmrpc): migrate RPC telemetry to OpenTelemetry Meter API#3265
feat(evmrpc): migrate RPC telemetry to OpenTelemetry Meter API#3265amir-deris merged 19 commits intomainfrom
Conversation
|
The latest Buf updates on your PR. Results from workflow Buf / buf (pull_request).
|
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3265 +/- ##
==========================================
- Coverage 59.11% 59.10% -0.02%
==========================================
Files 2100 2100
Lines 173066 173085 +19
==========================================
- Hits 102312 102298 -14
- Misses 61869 61924 +55
+ Partials 8885 8863 -22
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
| // Automatically detect success/failure based on panic state | ||
| panicValue := recover() | ||
| success := panicValue == nil || err != nil | ||
| success := panicValue == nil && err == nil |
There was a problem hiding this comment.
The previous success value would become true as long as there was no panic (due to short circuit for OR), even in cases that err was not nil. Updated here to account for both panic and err.
masih
left a comment
There was a problem hiding this comment.
Yay glad to see OTEL integration picking up 🚀
Blockers:
- standardise on
secondsto measure time/duration/latency - Remove redundant counters
Left a bunch of naming suggestions and a few questions
Thanks @amir-deris 🙌
| // configured by the application. | ||
|
|
||
| var ( | ||
| rpcTelemetryMeter = otel.Meter("evmrpc_rpc") |
There was a problem hiding this comment.
- There would be a single
meterthis entire package. I would simply name thismeter. - I recommend removing the repetitive
_rpcsuffix.
There was a problem hiding this comment.
Actually I just noticed there is another meter definition in evmrpc/worker_pool_metrics.go and its meter definition clashes with this one if we rename this meter:
meter = otel.Meter("evmrpc_workerpool")
Should we combine all meters into one file (metrics.go)?
| var ( | ||
| rpcTelemetryMeter = otel.Meter("evmrpc_rpc") | ||
|
|
||
| rpcTelemetryMetrics = struct { |
There was a problem hiding this comment.
Similarly, I would call this metrics.
| rpcTelemetryMeter = otel.Meter("evmrpc_rpc") | ||
|
|
||
| rpcTelemetryMetrics = struct { | ||
| requests metric.Int64Counter |
There was a problem hiding this comment.
- General note on naming: this reads too abstract. I would add a
Countat the end of it. - The counter itself is redundant, if we always measure latency in a histogram. See my other comment.
There was a problem hiding this comment.
I will remove this counter, as discussed in histogram comment.
|
|
||
| rpcTelemetryMetrics = struct { | ||
| requests metric.Int64Counter | ||
| requestLatencyMs metric.Float64Histogram |
There was a problem hiding this comment.
Always standardise on seconds for latency/duration measurements. Duration.Seconds() conveniently returns a float64 without the loss of precision.
| rpcTelemetryMetrics = struct { | ||
| requests metric.Int64Counter | ||
| requestLatencyMs metric.Float64Histogram | ||
| websocketConnects metric.Int64Counter |
There was a problem hiding this comment.
Similarly I would add a Count at the end, and perhaps pick a simpler name: wsConnectionCount or similar.
| func recordRPCRequest(ctx context.Context, endpoint, connection string, success bool) { | ||
| rpcTelemetryMetrics.requests.Add(ctx, 1, | ||
| metric.WithAttributes( | ||
| attribute.String("endpoint", endpoint), |
There was a problem hiding this comment.
Define attrKeys for repeated tags as vars?
| metric.WithAttributes( | ||
| attribute.String("endpoint", endpoint), | ||
| attribute.String("connection", connection), | ||
| attribute.Bool("success", success), |
There was a problem hiding this comment.
A better approach is to take error and tag by error type for a fixed set of well known errors.
Also do we measure JSON RPC response code somewhere? that would be super useful.
There was a problem hiding this comment.
I added new method classifyRPCMetricError to record error type and response code with latency.
| func (t *AssociationAPI) Associate(ctx context.Context, req *AssociateRequest) (returnErr error) { | ||
| startTime := time.Now() | ||
| defer recordMetricsWithError("sei_associate", t.connectionType, startTime, returnErr) | ||
| defer recordMetricsWithError(ctx, "sei_associate", t.connectionType, startTime, returnErr) |
There was a problem hiding this comment.
Why are we keeping recordMetricsWithError calls at all?
Is the idea to not break the existing metrics, roll out both then remove the old stuff?
There was a problem hiding this comment.
I thought we need to keep these metrics to be backward compatible with any dashboards we might have. Should I remove these metrics?
| @@ -0,0 +1,65 @@ | |||
| package evmrpc | |||
There was a problem hiding this comment.
Conventionally I would simply call this file metrics.go. Each package would have one as needed. that go file would initialise once on startup and that's it; all package level vars.
masih
left a comment
There was a problem hiding this comment.
THis is a stacked PR.
I advise against stacked PRs in general and the work is done from dev perspective once it lands on main.
Approving and will re-review the full work on PR to main.
Please make sure the final PR doesn't get too big otherwise the punishment is to buy ice cream for the team 🍨 😜
bdchatham
left a comment
There was a problem hiding this comment.
Fresh expert pass on top of my approval — five callouts worth considering. (1) is a real bug from the named-return wrapping pattern; (2) is a deploy-coordination concern that contradicts the "existing dashboards continue to work" framing; (3)–(5) are good-to-fix.
| func recordMetricsWithError(apiMethod string, connectionType ConnectionType, startTime time.Time, err error) { | ||
| func recordMetricsWithError(ctx context.Context, apiMethod string, connectionType ConnectionType, startTime time.Time, err error) { | ||
| // Automatically detect success/failure based on panic state | ||
| panicValue := recover() |
There was a problem hiding this comment.
This recover() no longer catches anything.
The pre-PR pattern defer recordMetricsWithError(...) made recordMetricsWithError itself the deferred function, so recover() worked. After this PR, every caller wraps as defer func() { recordMetricsWithError(...) }() (e.g. association.go:60, info.go:88, send.go:73) to capture the named return. Per the Go spec: "recover returns nil if recover was not called directly by a deferred function." recordMetricsWithError is now invoked by the closure, not deferred itself — so panicValue is always nil.
Consequences:
errorClassPanicis dead — that label value is now unreachable.successevaluatestrueon a panicked handler that returned a zero-valueerr.- The closing
if panicValue != nil { panic(panicValue) }re-raise never fires (panics still propagate up to whichever recover does run, but this site's metric labeling is wrong).
Fix: have the closure call recover() and pass the result in:
defer func() {
recordMetricsWithError(ctx, "sei_associate", t.connectionType, startTime, returnErr, recover())
}()…and accept panicValue any here instead of calling recover() internally.
| // Automatically detect success/failure based on panic state | ||
| panicValue := recover() | ||
| success := panicValue == nil || err != nil | ||
| success := panicValue == nil && err == nil |
There was a problem hiding this comment.
Correctness fix — but this silently changes the meaning of the legacy dual-emitted sei_rpc_request_counter{success="true"} series too.
The old || was buggy: non-panicking errors were marked success=true. After this lands, those flip to success=false on the legacy series. Existing alerts / SLO queries shaped like rate(sei_rpc_request_counter{success="true"}[5m]) / rate(sei_rpc_request_counter[5m]) will step-down at deploy.
The PR description says "existing dashboards continue to work" — true at the metric-name level, but not at the semantic level. Suggest:
- Call this out explicitly in the rollout note.
- Sweep
sei-protocol/sei-infra/monitoring/and the alerting config for queries filtering onsuccessand adjust in the same release window.
There was a problem hiding this comment.
Thanks for feedback. That is a good point, it seems we don't have any alerts based on the success attributes. We only have two existing dashboards which filter for success=false and can potentially show a spike after the rollout of this PR's changes, due to how we now correctly assign error status:
| File | Query | Impact |
|---|---|---|
clusters/prod/monitoring/grafana-dashboards-protocol.yaml (line 19189) |
sei_rpc_request_counter{success="false"} |
Will spike up at deploy |
clusters/dev/monitoring/dashboards/protocol/sei_node_monitoring.json (line 4251) |
sei_rpc_request_counter{success="false"} |
Will spike up at deploy |
Updated PR description accordingly.
| metrics.IncrementRpcRequestCounter(apiMethod, string(connectionType), success) | ||
| metrics.MeasureRpcRequestLatency(apiMethod, string(connectionType), startTime) | ||
| recordRPCLatency(ctx, apiMethod, string(connectionType), success, err, panicValue != nil, startTime) | ||
| // TODO: remove legacy dual-emit once dashboards are migrated to evmrpc_* OTEL metrics. Use metrics.requestLatencySeconds histogram instead. |
There was a problem hiding this comment.
Bare TODOs for "remove later" reliably rot. Suggest:
- File a tracking issue with explicit cleanup criteria — dashboards / alerts / recording rules migrated to
evmrpc_*, ≥2 weeks of new-metric retention, then drop the dual emission. - Replace the TODO bodies here, in
evmrpc/websockets.go, and on the three legacy helpers inutils/metrics/metrics_util.gowith// TODO(PLT-XXX): ...linking the issue.
Otherwise the dual emission + paired sei_* helpers will outlive everyone's memory of why they're here.
There was a problem hiding this comment.
Done. Created this ticket https://linear.app/seilabs/issue/PLT-326
Also updated the TODO comments with ticket number.l
| "evmrpc_request_latency_seconds", | ||
| metric.WithDescription("RPC request latency in seconds (labeled by success, bounded error_class, and jsonrpc_code when known)"), | ||
| metric.WithUnit("s"), | ||
| )), |
There was a problem hiding this comment.
Two histogram-shape concerns worth addressing before this scales out:
Cardinality. jsonrpc_code is an Int64 attribute; for error_class=jsonrpc_error it falls through to whatever coder.ErrorCode() returns. JSON-RPC 2.0 reserves -32000..-32099 for server-defined codes, plus go-ethereum surfaces additional ones. Worst-case bound (~30 endpoints × 2 connections × 7 error_class × ~30 codes seen × ~12 default buckets) is ~150k series per pod for this one histogram. Either:
- Bucket the code on the histogram (e.g.
"spec"/"server"/"other"), or - Drop
jsonrpc_codefrom the histogram and put it on a low-cost counter only.
Also, success and error_class are partially redundant — drop error_class (or set to empty) on the success path so the high-volume happy path doesn't carry both.
Bucket boundaries. Default OTel buckets start at 5ms; cached eth_chainId / net_version collapse to bucket 0, and debug_traceBlockByNumber runs past the 10s ceiling. Consider explicit boundaries:
metric.WithExplicitBucketBoundaries(
0.0005, 0.001, 0.0025, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25,
0.5, 1, 2.5, 5, 10, 30,
)There was a problem hiding this comment.
Noted. Fixed the histogram bucket boundaries and also added specific buckets for the error codes.
| metric.WithUnit("s"), | ||
| )), | ||
| wsConnectionCount: must(rpcTelemetryMeter.Int64Counter( | ||
| "evmrpc_websocket_connects_total", |
There was a problem hiding this comment.
_total is auto-appended on counters by the OTel Prometheus exporter (default behavior in go.opentelemetry.io/otel/exporters/prometheus). Naming the instrument evmrpc_websocket_connects_total likely emits sei_chain_evmrpc_websocket_connects_total_total after the namespace prefix and counter-suffix pass.
Worth verifying on a running pod's /metrics before the dashboard PR ships. Fix is to rename the instrument to evmrpc_websocket_connects and let the exporter add the suffix.
(The dashboard PR doesn't query this counter directly so it should be unaffected, but anyone writing future alerts against it would hit the double suffix.)
There was a problem hiding this comment.
I verified this with the local cluster: make docker-cluster-start, then making websocket request:
printf '%s\n' '{"jsonrpc":"2.0","method":"eth_chainId","params":[],"id":1}' | websocat -n1 ws://127.0.0.1:8546/
Then checking the metrics in the container:
curl -s http://localhost:26660/metrics | grep evmrpc_websocket_connects_total
It shows:
sei_chain_evmrpc_websocket_connects_total{otel_scope_name="evmrpc",otel_scope_schema_url="",otel_scope_version=""} 5
so it seems the _total is not doubled.
I also reviewed the otel package go.opentelemetry.io/otel/exporters/prometheus with Claude and it seems there is some logic in there to not duplicate the _total:
The naming is handled entirely by otlptranslator — your instrument name evmrpc_websocket_connects_total gets tokenized, "total" is stripped and re-appended at the end, so you always get exactly one _total. No code change needed on your side.
Summary
Follow-up to #3253, rebased to use the standardized OpenTelemetry Meter API per reviewer feedback (@masih: "standardise on OTEL and use runtime bindings to report to prometheus instead of direct metrics reporting via prometheus client").
utils/metricscalls inevmrpc(filter.go,utils.go,websockets.go) with a newmetrics.golayer usingotel.Meter("evmrpc")(the global MeterProvider convenience wrapper)IncrementRpcRequestCounter,MeasureRpcRequestLatency, andIncWebsocketConnectsinutils/metricsbut annotates themTODO(PLT-326)for removal once dashboards are migrated to the newevmrpc_*OTel seriesnum_blocks_fetchedbatch counters from internalprocessBatchcall sites — these were unused and not wired to any dashboard or alertctxthrough all RPC handler signatures (14+ files:association.go,block.go,info.go,subscribe.go,tracers.go,tx.go,txpool.go, and others) to support OTel's context-based propagation and adopt the new defer-with-recover()patternsuccess,error_class, andjsonrpc_codelabels to the latency histogram so failed requests and error types can be tracked with low cardinalityctx context.Contextparameter toNewPendingTransactionFilter(previously missing)New metric names (OTel naming convention, exported via the process-wide MeterProvider):
evmrpc_request_latency_seconds— request latency histogram (labels:endpoint,connection,success,error_class,jsonrpc_code)evmrpc_websocket_connects_total— websocket connection countNote
Metric name migration
Legacy metrics (
sei_rpc_request_counter/sei_rpc_request_latency_ms/sei_websocket_connect) are still dual-emitted during the migration window so dashboards continue to receive data. They carry TODO markers and will be removed once dashboards are migrated to the newevmrpc_*OTel series.successlabel semantic correctionThe old
recordMetricsWithErrorhad a logic bug —success := panicValue == nil || err != nil— which caused requests that returned an error without panicking to be recorded assuccess=true. This PR fixes that tosuccess := panicValue == nil && err == nil.The correction applies to both the new OTel metric and the dual-emitted legacy
sei_rpc_request_counterseries.Deploy-time impact: A search of the platform repo found no Prometheus alert rules referencing these metrics, so no alerts will fire. Two Grafana dashboard panels filter on
success="false"and will show a spike at deploy time:clusters/prod/monitoring/grafana-dashboards-protocol.yaml— EVM RPC failed requests panelclusters/dev/monitoring/dashboards/protocol/sei_node_monitoring.json— EVM RPC failed requests panelThis spike is a visual artifact of the correction (previously miscounted errors now classified correctly), not a real increase in failures. No action is required, but heads-up for anyone watching these panels during the rollout.