add OpenTelemetry metrics instrumentation#3110
add OpenTelemetry metrics instrumentation#3110PavelPashov wants to merge 46 commits intoredis:masterfrom
Conversation
a11117e to
5fb1791
Compare
1be4565 to
03db1a2
Compare
There was a problem hiding this comment.
❌ The following Jit checks failed to run:
- license-compliance-checker
- secret-detection-trufflehog
- software-component-analysis-js
- static-code-analysis-semgrep-pro
#jit_bypass_commit in this PR to bypass, Jit Admin privileges required.
More info in the Jit platform.
1106f8a to
242e98d
Compare
|
This is an exciting feature, I hope it makes it to main! |
❌ Security scan failedSecurity scan failed: Branch feat/add-opentelemetry-metrics does not exist in the remote repository 💡 Need to bypass this check? Comment |
364ff86 to
30b0e8c
Compare
❌ Security scan failedSecurity scan failed: Branch feat/add-opentelemetry-metrics does not exist in the remote repository 💡 Need to bypass this check? Comment |
1 similar comment
❌ Security scan failedSecurity scan failed: Branch feat/add-opentelemetry-metrics does not exist in the remote repository 💡 Need to bypass this check? Comment |
0e91b34 to
e2cc888
Compare
…sh and receive paths
af9f0fb to
d0f42d3
Compare
| ); | ||
| // Estimate bytes saved by avoiding network round-trip | ||
| // Note: JSON.stringify approximation; actual RESP wire size may differ (especially for Buffers) | ||
| const bytesEstimate = JSON.stringify(cacheEntry.value).length; |
There was a problem hiding this comment.
What do you think about wrapping it with Buffer.byteLength(JSON.stringify(cacheEntry.value), 'utf8');? If data to be stored isn't just ASCII it makes sense to measure UTF-8 bytes count
There was a problem hiding this comment.
That's a good catch!
| } from "./types"; | ||
| import { noopFunction } from "./utils"; | ||
|
|
||
| export class NoopCommandMetrics implements IOTelCommandMetrics { |
There was a problem hiding this comment.
Why do you need this object? Just to ensure that no commands will be actually send to a collector?
There was a problem hiding this comment.
Because we have multiple metric groups and don’t know what the user will opt into, I kept each group in its own class and defaulted them to noop. When the user initializes metrics, we swap in the real implementations based on config, so all checks happen once during OpenTelemetry.init(...). This keeps runtime code simple (no repeated conditional checks) and ensures disabled metrics have near-zero overhead
There was a problem hiding this comment.
So this is basically Null object pattern where in case if Otel is disabled empty methods are called?
| } catch (error) { | ||
| recordBatch(error as Error); | ||
| throw error; | ||
| } |
There was a problem hiding this comment.
Double metric recording for WatchError in _executeMulti
Medium Severity
When execResult is null (a WatchError), recordBatch(error) is called on line 1488, then the error is thrown and caught by the catch block on line 1498, which calls recordBatch(error) again on line 1499. This causes the db.client.operation.duration histogram to double-count the MULTI operation for every WatchError, inflating the metric.
| ); | ||
| // Estimate bytes saved by avoiding network round-trip | ||
| // Note: JSON.stringify approximation; actual RESP wire size may differ (especially for Buffers) | ||
| const bytesEstimate = JSON.stringify(cacheEntry.value).length; |
There was a problem hiding this comment.
Byte estimate uses character count not byte length
Low Severity
JSON.stringify(cacheEntry.value).length returns the number of UTF-16 characters, not the number of bytes. The redis.client.csc.network_saved metric uses unit "By" (bytes), so this will undercount for non-ASCII data. Using Buffer.byteLength(JSON.stringify(cacheEntry.value), 'utf8') would give the correct UTF-8 byte count.
| true, | ||
| client._clientId, | ||
| i | ||
| ); |
There was a problem hiding this comment.
Cluster redirections double-count errors in resiliency metrics
Medium Severity
Every ASK/MOVED error in cluster mode is recorded twice in redis.client.errors. The underlying RedisClient.sendCommand .catch() handler records the error with internal: false, and then the cluster's _execute method catches the same error and records it again with internal: true. This inflates the error counter for every cluster redirection.
Additional Locations (1)
There was a problem hiding this comment.
@nkaradzhov I think it's worth discussing this, because there seem to be trade offs to the possible solutions
| readonly #reconnectStrategy; | ||
| readonly #socketFactory; | ||
| readonly #socketTimeout; | ||
| readonly #clientId: string; |
There was a problem hiding this comment.
Socket retains stale clientId after maintenance socket swap
Medium Severity
After a maintenance socket swap, the main client's new socket still carries the temporary client's #clientId (which is readonly). Since the temporary client is destroyed and unregistered, subsequent metric recordings from the socket (connection closed, relaxed timeout, connection create time) will use an orphaned clientId that no longer resolves in ClientRegistry, causing all socket-level metrics to lose host/port/db attribution after any maintenance handoff.
Additional Locations (1)
| TRANSFORM_LEGACY_REPLY?: boolean; | ||
| transformReply: TransformReply | Record<RespVersions, TransformReply>; | ||
| unstableResp3?: boolean; | ||
| onSuccess?: (args: ReadonlyArray<RedisArgument>, reply: unknown, clientId: string) => void; |
There was a problem hiding this comment.
@nkaradzhov we might want to rename onSuccess to something more metrics/otel related
| if (command.onSuccess) { | ||
| command.onSuccess(parser.redisArgs, finalReply, this._self._clientId); |
There was a problem hiding this comment.
@nkaradzhov there might be a better way to record specific command related metrics that require the server response
| // Build the appropriate function based on options | ||
| if (options.hasIncludeCommands || options.hasExcludeCommands) { | ||
| // Version with filtering | ||
| this.createRecordOperationDuration = this.#createWithFiltering.bind(this); | ||
| } else { | ||
| this.createRecordOperationDuration = | ||
| this.#createWithoutFiltering.bind(this); | ||
| } | ||
| } |


Description
TODO
Checklist
npm testpass with this change (including linting)?Note
Medium Risk
Touches core request/response paths (
sendCommand, socket lifecycle, pubsub/streams/CSC) to emit metrics, so regressions could affect performance or error handling even though behavior is intended to be observational and gated behindOpenTelemetry.init(). Dependency/peer-dependency changes may impact consumers that rely on strict lockfiles or bundling expectations.Overview
Adds first-class, opt-in OpenTelemetry metrics support via a new
OpenTelemetry.init()API (exported from@redis/client), plus new docs (docs/otel-metrics.md) and a runnable example (examples/otel-metrics.js).Instrumentation is wired into core client flows to record command and batch durations, connection lifecycle/closures, maintenance notifications/handoffs, pubsub publish/receive, stream lag (
XREAD/XREADGROUP), and client-side-caching hits/misses/evictions/bytes-saved. This introduces a per-client identity model +ClientRegistryused by observable gauges, updates queue/cache/socket internals to expose needed signals (egpendingCount, cachesize()), and adds comprehensive unit/e2e tests for the metrics system.Also updates dependencies to include optional
@opentelemetry/api(peer) and adds OTel SDK packages for development/testing.Written by Cursor Bugbot for commit 08c7413. This will update automatically on new commits. Configure here.