add OpenTelemetry metrics instrumentation by PavelPashov · Pull Request #3110 · redis/node-redis

PavelPashov · 2025-10-24T14:15:00Z

Description

TODO

Add docs
Add examples

Checklist

Does npm test pass with this change (including linting)?
Is the new or changed code fully tested?
Is a documentation update included (if this change modifies existing APIs, or introduces new ones)?

Note

Medium Risk
Touches core request/response paths (sendCommand, socket lifecycle, pubsub/streams/CSC) to emit metrics, so regressions could affect performance or error handling even though behavior is intended to be observational and gated behind OpenTelemetry.init(). Dependency/peer-dependency changes may impact consumers that rely on strict lockfiles or bundling expectations.

Overview
Adds first-class, opt-in OpenTelemetry metrics support via a new OpenTelemetry.init() API (exported from @redis/client), plus new docs (docs/otel-metrics.md) and a runnable example (examples/otel-metrics.js).

Instrumentation is wired into core client flows to record command and batch durations, connection lifecycle/closures, maintenance notifications/handoffs, pubsub publish/receive, stream lag (XREAD/XREADGROUP), and client-side-caching hits/misses/evictions/bytes-saved. This introduces a per-client identity model + ClientRegistry used by observable gauges, updates queue/cache/socket internals to expose needed signals (eg pendingCount, cache size()), and adds comprehensive unit/e2e tests for the metrics system.

Also updates dependencies to include optional @opentelemetry/api (peer) and adds OTel SDK packages for development/testing.

^{Written by Cursor Bugbot for commit 08c7413. This will update automatically on new commits. Configure here.}

jit-ci

❌ The following Jit checks failed to run:

license-compliance-checker
secret-detection-trufflehog
software-component-analysis-js
static-code-analysis-semgrep-pro

#jit_bypass_commit in this PR to bypass, Jit Admin privileges required.

More info in the Jit platform.

elimelt · 2025-12-08T21:20:11Z

This is an exciting feature, I hope it makes it to main!

jit-ci · 2026-01-19T08:28:50Z

❌ Security scan failed

Security scan failed: Branch feat/add-opentelemetry-metrics does not exist in the remote repository

💡 Need to bypass this check? Comment @sera bypass to override.

jit-ci · 2026-01-19T08:35:55Z

❌ Security scan failed

Security scan failed: Branch feat/add-opentelemetry-metrics does not exist in the remote repository

💡 Need to bypass this check? Comment @sera bypass to override.

jit-ci · 2026-01-19T09:28:19Z

❌ Security scan failed

Security scan failed: Branch feat/add-opentelemetry-metrics does not exist in the remote repository

💡 Need to bypass this check? Comment @sera bypass to override.

…overhead

…ne attributes

…ved on cache hit

…n recording

…sh and receive paths

…m producers

…strument

…overage

…onfig

…ource attrs

packages/client/lib/client/cache.ts

vladvildanov · 2026-02-25T14:28:16Z

packages/client/lib/client/cache.ts

+        );
+        // Estimate bytes saved by avoiding network round-trip
+        // Note: JSON.stringify approximation; actual RESP wire size may differ (especially for Buffers)
+        const bytesEstimate = JSON.stringify(cacheEntry.value).length;


What do you think about wrapping it with Buffer.byteLength(JSON.stringify(cacheEntry.value), 'utf8');? If data to be stored isn't just ASCII it makes sense to measure UTF-8 bytes count

That's a good catch!

packages/client/lib/client/cache.ts

packages/client/lib/client/index.ts

packages/client/lib/opentelemetry/metrics.ts

vladvildanov · 2026-02-26T10:53:43Z

packages/client/lib/opentelemetry/noop-metrics.ts

+} from "./types";
+import { noopFunction } from "./utils";
+
+export class NoopCommandMetrics implements IOTelCommandMetrics {


Why do you need this object? Just to ensure that no commands will be actually send to a collector?

Because we have multiple metric groups and don’t know what the user will opt into, I kept each group in its own class and defaulted them to noop. When the user initializes metrics, we swap in the real implementations based on config, so all checks happen once during OpenTelemetry.init(...). This keeps runtime code simple (no repeated conditional checks) and ensures disabled metrics have near-zero overhead

So this is basically Null object pattern where in case if Otel is disabled empty methods are called?

Yes, exactly

cursor

Cursor Bugbot has reviewed your changes and found 4 potential issues.

cursor · 2026-02-26T12:45:58Z

packages/client/lib/client/index.ts

+    } catch (error) {
+      recordBatch(error as Error);
+      throw error;
+    }


Double metric recording for WatchError in _executeMulti

Medium Severity

When execResult is null (a WatchError), recordBatch(error) is called on line 1488, then the error is thrown and caught by the catch block on line 1498, which calls recordBatch(error) again on line 1499. This causes the db.client.operation.duration histogram to double-count the MULTI operation for every WatchError, inflating the metric.

cursor · 2026-02-26T12:45:58Z

packages/client/lib/client/cache.ts

+        );
+        // Estimate bytes saved by avoiding network round-trip
+        // Note: JSON.stringify approximation; actual RESP wire size may differ (especially for Buffers)
+        const bytesEstimate = JSON.stringify(cacheEntry.value).length;


Byte estimate uses character count not byte length

Low Severity

JSON.stringify(cacheEntry.value).length returns the number of UTF-16 characters, not the number of bytes. The redis.client.csc.network_saved metric uses unit "By" (bytes), so this will undercount for non-ASCII data. Using Buffer.byteLength(JSON.stringify(cacheEntry.value), 'utf8') would give the correct UTF-8 byte count.

cursor · 2026-02-26T12:45:58Z

packages/client/lib/cluster/index.ts

+            true,
+            client._clientId,
+            i
+          );


Cluster redirections double-count errors in resiliency metrics

Medium Severity

Every ASK/MOVED error in cluster mode is recorded twice in redis.client.errors. The underlying RedisClient.sendCommand .catch() handler records the error with internal: false, and then the cluster's _execute method catches the same error and records it again with internal: true. This inflates the error counter for every cluster redirection.

Additional Locations (1)

packages/client/lib/client/index.ts#L1195-L1199

@nkaradzhov I think it's worth discussing this, because there seem to be trade offs to the possible solutions

cursor · 2026-02-26T12:45:58Z

packages/client/lib/client/socket.ts

  readonly #reconnectStrategy;
  readonly #socketFactory;
  readonly #socketTimeout;
+  readonly #clientId: string;


Socket retains stale clientId after maintenance socket swap

Medium Severity

After a maintenance socket swap, the main client's new socket still carries the temporary client's #clientId (which is readonly). Since the temporary client is destroyed and unregistered, subsequent metric recordings from the socket (connection closed, relaxed timeout, connection create time) will use an orphaned clientId that no longer resolves in ClientRegistry, causing all socket-level metrics to lose host/port/db attribution after any maintenance handoff.

Additional Locations (1)

packages/client/lib/client/enterprise-maintenance-manager.ts#L292-L297

@nkaradzhov we can discuss this as well

PavelPashov · 2026-02-27T14:58:56Z

packages/client/lib/RESP/types.ts

  TRANSFORM_LEGACY_REPLY?: boolean;
  transformReply: TransformReply | Record<RespVersions, TransformReply>;
  unstableResp3?: boolean;
+  onSuccess?: (args: ReadonlyArray<RedisArgument>, reply: unknown, clientId: string) => void;


@nkaradzhov we might want to rename onSuccess to something more metrics/otel related

PavelPashov · 2026-02-27T15:00:38Z

packages/client/lib/client/index.ts

+      if (command.onSuccess) {
+        command.onSuccess(parser.redisArgs, finalReply, this._self._clientId);


@nkaradzhov there might be a better way to record specific command related metrics that require the server response

nkaradzhov · 2026-03-04T11:32:33Z

packages/client/lib/opentelemetry/metrics.ts

+    // Build the appropriate function based on options
+    if (options.hasIncludeCommands || options.hasExcludeCommands) {
+      // Version with filtering
+      this.createRecordOperationDuration = this.#createWithFiltering.bind(this);
+    } else {
+      this.createRecordOperationDuration =
+        this.#createWithoutFiltering.bind(this);
+    }
+  }


why have two fns?

PavelPashov force-pushed the feat/add-opentelemetry-metrics branch 4 times, most recently from a11117e to 5fb1791 Compare October 31, 2025 11:39

PavelPashov force-pushed the feat/add-opentelemetry-metrics branch 2 times, most recently from 1be4565 to 03db1a2 Compare November 5, 2025 11:03

jit-ci bot reviewed Nov 5, 2025

View reviewed changes

PavelPashov force-pushed the feat/add-opentelemetry-metrics branch 3 times, most recently from 1106f8a to 242e98d Compare November 6, 2025 13:55

PavelPashov force-pushed the feat/add-opentelemetry-metrics branch from 364ff86 to 30b0e8c Compare January 19, 2026 08:35

PavelPashov force-pushed the feat/add-opentelemetry-metrics branch 2 times, most recently from 0e91b34 to e2cc888 Compare February 20, 2026 15:32

PavelPashov marked this pull request as ready for review February 25, 2026 11:28

PavelPashov added 12 commits February 25, 2026 13:30

wip: add OpenTelemetry metrics instrumentation

2d569f7

add noop metrics

364ed40

fix: check if metrics are initilized in sendCommand

2e3cb2a

fix(otel): optimize metrics tracking by eliminating promise chaining …

cd31dfa

…overhead

refactor(otel): organize metrics into specialized metric groups

0d55b20

fix(otel): revert metrics to be tracked in the sendCommand method

67d99ca

perf(metrics): optimize command metrics with factory pattern and inli…

ce74dc1

…ne attributes

feat: Add new OTEL_ATTRIBUTES constants

0eedb01

feat: Add client-side-caching metric group with 4 new metric names

a14d159

feat: Create error categorization stub function

46ebd09

feat: Add pool name formatting utility function

cfe88dc

feat: Refactor recordConnectionCreateTime to use closure pattern

bccbdf7

PavelPashov added 19 commits February 25, 2026 13:30

feat: Wire redis.client.csc.evictions metric with eviction reason

9a08be7

feat: Wire redis.client.csc.network_saved metric to estimate bytes sa…

41f0188

…ved on cache hit

refactor: Add count parameter to recordCacheEviction to batch evictio…

d22a0fb

…n recording

fix: typo

996d4ee

feat: Implement redis.client.pubsub.messages metric for pub/sub publi…

c0feedb

…sh and receive paths

feat: Implement redis.client.stream.produce.messages metric for strea…

406a9c6

…m producers

refactor: move pub/sub and stream metrics recording to command wrapper

d167cc9

refactor: align metric groups with instrumentation spec

d377b92

feat: add error classification helper and enrich metrics attributes

ce23121

refactor: replace command wrapper with onSuccess hook for metrics

89aaf1b

feat(client): add client identity tracking for OpenTelemetry metrics

912e27d

feat(otel): convert metrics to observable gauges with client registry

ad2fd8d

feat(otel): refine metrics coverage and remove connection use_time in…

e843d0d

…strument

refactor(otel): resolve metric attributes via clientId registry lookup

6d14d9c

refactor(otel): propagate clientId through runtime metrics paths

4305d30

test(otel): expand metrics coverage and add test utilities

19c95b8

test(otel): fix flaky test

34cd991

feat(otel): align metric attributes/config and expand observability c…

ee5f2b0

…overage

test(otel): add maintenance metrics e2e scenario with standalone FI c…

d0f42d3

…onfig

PavelPashov force-pushed the feat/add-opentelemetry-metrics branch from af9f0fb to d0f42d3 Compare February 25, 2026 11:34

PavelPashov changed the title ~~wip: add OpenTelemetry metrics instrumentation~~ add OpenTelemetry metrics instrumentation Feb 25, 2026

refactor(otel): use instrumentation scope name and stop injecting res…

3e19cad

…ource attrs

vladvildanov requested changes Feb 25, 2026

View reviewed changes

vladvildanov reviewed Feb 26, 2026

View reviewed changes

PavelPashov added 2 commits February 26, 2026 14:37

fix(opentelemetry): rename stream bucket config

baac384

docs(opentelemetry): add metrics docs/examples

08c7413

cursor bot reviewed Feb 26, 2026

View reviewed changes

PavelPashov commented Feb 27, 2026

View reviewed changes

nkaradzhov reviewed Mar 4, 2026

View reviewed changes

		if (command.onSuccess) {
		command.onSuccess(parser.redisArgs, finalReply, this._self._clientId);

Conversation

PavelPashov commented Oct 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

Uh oh!

jit-ci bot left a comment

Choose a reason for hiding this comment

Uh oh!

elimelt commented Dec 8, 2025

Uh oh!

jit-ci bot commented Jan 19, 2026

❌ Security scan failed

Uh oh!

jit-ci bot commented Jan 19, 2026

❌ Security scan failed

Uh oh!

jit-ci bot commented Jan 19, 2026

❌ Security scan failed

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor bot Feb 26, 2026

Choose a reason for hiding this comment

Double metric recording for WatchError in _executeMulti

Uh oh!

cursor bot Feb 26, 2026

Choose a reason for hiding this comment

Byte estimate uses character count not byte length

Uh oh!

cursor bot Feb 26, 2026

Choose a reason for hiding this comment

Cluster redirections double-count errors in resiliency metrics

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cursor bot Feb 26, 2026

Choose a reason for hiding this comment

Socket retains stale clientId after maintenance socket swap

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

PavelPashov commented Oct 24, 2025 •

edited

Loading