Skip to content

add OpenTelemetry metrics instrumentation#3110

Open
PavelPashov wants to merge 46 commits intoredis:masterfrom
PavelPashov:feat/add-opentelemetry-metrics
Open

add OpenTelemetry metrics instrumentation#3110
PavelPashov wants to merge 46 commits intoredis:masterfrom
PavelPashov:feat/add-opentelemetry-metrics

Conversation

@PavelPashov
Copy link
Contributor

@PavelPashov PavelPashov commented Oct 24, 2025

Description

TODO

  • Add docs
  • Add examples

Checklist

  • Does npm test pass with this change (including linting)?
  • Is the new or changed code fully tested?
  • Is a documentation update included (if this change modifies existing APIs, or introduces new ones)?

Note

Medium Risk
Touches core request/response paths (sendCommand, socket lifecycle, pubsub/streams/CSC) to emit metrics, so regressions could affect performance or error handling even though behavior is intended to be observational and gated behind OpenTelemetry.init(). Dependency/peer-dependency changes may impact consumers that rely on strict lockfiles or bundling expectations.

Overview
Adds first-class, opt-in OpenTelemetry metrics support via a new OpenTelemetry.init() API (exported from @redis/client), plus new docs (docs/otel-metrics.md) and a runnable example (examples/otel-metrics.js).

Instrumentation is wired into core client flows to record command and batch durations, connection lifecycle/closures, maintenance notifications/handoffs, pubsub publish/receive, stream lag (XREAD/XREADGROUP), and client-side-caching hits/misses/evictions/bytes-saved. This introduces a per-client identity model + ClientRegistry used by observable gauges, updates queue/cache/socket internals to expose needed signals (eg pendingCount, cache size()), and adds comprehensive unit/e2e tests for the metrics system.

Also updates dependencies to include optional @opentelemetry/api (peer) and adds OTel SDK packages for development/testing.

Written by Cursor Bugbot for commit 08c7413. This will update automatically on new commits. Configure here.

@PavelPashov PavelPashov force-pushed the feat/add-opentelemetry-metrics branch 4 times, most recently from a11117e to 5fb1791 Compare October 31, 2025 11:39
@PavelPashov PavelPashov force-pushed the feat/add-opentelemetry-metrics branch 2 times, most recently from 1be4565 to 03db1a2 Compare November 5, 2025 11:03
Copy link

@jit-ci jit-ci bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❌ The following Jit checks failed to run:

  • license-compliance-checker
  • secret-detection-trufflehog
  • software-component-analysis-js
  • static-code-analysis-semgrep-pro

#jit_bypass_commit in this PR to bypass, Jit Admin privileges required.

More info in the Jit platform.

@PavelPashov PavelPashov force-pushed the feat/add-opentelemetry-metrics branch 3 times, most recently from 1106f8a to 242e98d Compare November 6, 2025 13:55
@elimelt
Copy link
Contributor

elimelt commented Dec 8, 2025

This is an exciting feature, I hope it makes it to main!

@jit-ci
Copy link

jit-ci bot commented Jan 19, 2026

❌ Security scan failed

Security scan failed: Branch feat/add-opentelemetry-metrics does not exist in the remote repository


💡 Need to bypass this check? Comment @sera bypass to override.

@PavelPashov PavelPashov force-pushed the feat/add-opentelemetry-metrics branch from 364ff86 to 30b0e8c Compare January 19, 2026 08:35
@jit-ci
Copy link

jit-ci bot commented Jan 19, 2026

❌ Security scan failed

Security scan failed: Branch feat/add-opentelemetry-metrics does not exist in the remote repository


💡 Need to bypass this check? Comment @sera bypass to override.

1 similar comment
@jit-ci
Copy link

jit-ci bot commented Jan 19, 2026

❌ Security scan failed

Security scan failed: Branch feat/add-opentelemetry-metrics does not exist in the remote repository


💡 Need to bypass this check? Comment @sera bypass to override.

@PavelPashov PavelPashov force-pushed the feat/add-opentelemetry-metrics branch 2 times, most recently from 0e91b34 to e2cc888 Compare February 20, 2026 15:32
@PavelPashov PavelPashov marked this pull request as ready for review February 25, 2026 11:28
@PavelPashov PavelPashov force-pushed the feat/add-opentelemetry-metrics branch from af9f0fb to d0f42d3 Compare February 25, 2026 11:34
@PavelPashov PavelPashov changed the title wip: add OpenTelemetry metrics instrumentation add OpenTelemetry metrics instrumentation Feb 25, 2026
);
// Estimate bytes saved by avoiding network round-trip
// Note: JSON.stringify approximation; actual RESP wire size may differ (especially for Buffers)
const bytesEstimate = JSON.stringify(cacheEntry.value).length;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about wrapping it with Buffer.byteLength(JSON.stringify(cacheEntry.value), 'utf8');? If data to be stored isn't just ASCII it makes sense to measure UTF-8 bytes count

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good catch!

} from "./types";
import { noopFunction } from "./utils";

export class NoopCommandMetrics implements IOTelCommandMetrics {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need this object? Just to ensure that no commands will be actually send to a collector?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we have multiple metric groups and don’t know what the user will opt into, I kept each group in its own class and defaulted them to noop. When the user initializes metrics, we swap in the real implementations based on config, so all checks happen once during OpenTelemetry.init(...). This keeps runtime code simple (no repeated conditional checks) and ensures disabled metrics have near-zero overhead

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is basically Null object pattern where in case if Otel is disabled empty methods are called?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, exactly

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 4 potential issues.

} catch (error) {
recordBatch(error as Error);
throw error;
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Double metric recording for WatchError in _executeMulti

Medium Severity

When execResult is null (a WatchError), recordBatch(error) is called on line 1488, then the error is thrown and caught by the catch block on line 1498, which calls recordBatch(error) again on line 1499. This causes the db.client.operation.duration histogram to double-count the MULTI operation for every WatchError, inflating the metric.

Fix in Cursor Fix in Web

);
// Estimate bytes saved by avoiding network round-trip
// Note: JSON.stringify approximation; actual RESP wire size may differ (especially for Buffers)
const bytesEstimate = JSON.stringify(cacheEntry.value).length;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Byte estimate uses character count not byte length

Low Severity

JSON.stringify(cacheEntry.value).length returns the number of UTF-16 characters, not the number of bytes. The redis.client.csc.network_saved metric uses unit "By" (bytes), so this will undercount for non-ASCII data. Using Buffer.byteLength(JSON.stringify(cacheEntry.value), 'utf8') would give the correct UTF-8 byte count.

Fix in Cursor Fix in Web

true,
client._clientId,
i
);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cluster redirections double-count errors in resiliency metrics

Medium Severity

Every ASK/MOVED error in cluster mode is recorded twice in redis.client.errors. The underlying RedisClient.sendCommand .catch() handler records the error with internal: false, and then the cluster's _execute method catches the same error and records it again with internal: true. This inflates the error counter for every cluster redirection.

Additional Locations (1)

Fix in Cursor Fix in Web

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nkaradzhov I think it's worth discussing this, because there seem to be trade offs to the possible solutions

readonly #reconnectStrategy;
readonly #socketFactory;
readonly #socketTimeout;
readonly #clientId: string;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Socket retains stale clientId after maintenance socket swap

Medium Severity

After a maintenance socket swap, the main client's new socket still carries the temporary client's #clientId (which is readonly). Since the temporary client is destroyed and unregistered, subsequent metric recordings from the socket (connection closed, relaxed timeout, connection create time) will use an orphaned clientId that no longer resolves in ClientRegistry, causing all socket-level metrics to lose host/port/db attribution after any maintenance handoff.

Additional Locations (1)

Fix in Cursor Fix in Web

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nkaradzhov we can discuss this as well

TRANSFORM_LEGACY_REPLY?: boolean;
transformReply: TransformReply | Record<RespVersions, TransformReply>;
unstableResp3?: boolean;
onSuccess?: (args: ReadonlyArray<RedisArgument>, reply: unknown, clientId: string) => void;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nkaradzhov we might want to rename onSuccess to something more metrics/otel related

Comment on lines +1127 to +1128
if (command.onSuccess) {
command.onSuccess(parser.redisArgs, finalReply, this._self._clientId);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nkaradzhov there might be a better way to record specific command related metrics that require the server response

Comment on lines +65 to +73
// Build the appropriate function based on options
if (options.hasIncludeCommands || options.hasExcludeCommands) {
// Version with filtering
this.createRecordOperationDuration = this.#createWithFiltering.bind(this);
} else {
this.createRecordOperationDuration =
this.#createWithoutFiltering.bind(this);
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why have two fns?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants