Conversation
📝 WalkthroughWalkthroughOpenTelemetry metrics and tracing support is added via a new ChangesOpenTelemetry Metrics & Tracing Integration
Sequence DiagramsequenceDiagram
participant Client
participant OTELInit
participant OTELStatter
participant MeterProvider
participant OTLPExporter
Client->>OTELInit: Init(cfg with OTELAgent)
OTELInit->>OTELInit: newOTELStatter(endpoint, headers, baseTags)
OTELInit->>OTELInit: newOTELResource(baseTags)
OTELInit->>OTLPExporter: Create OTLP metric exporter<br/>(endpoint, insecure, headers)
OTELInit->>MeterProvider: Create OTel MeterProvider<br/>with exporter & periodic reader
OTELInit->>OTELInit: Set global MeterProvider
OTELInit->>OTELStatter: Initialize with<br/>cached instruments
OTELInit->>Client: Return OTELStatter instance
Client->>OTELStatter: Timing("func.timing", duration, tags)
OTELStatter->>OTELStatter: tagsToAttrs(tags)
OTELStatter->>OTELStatter: getHistogram("func.timing")
OTELStatter->>MeterProvider: Record(duration in seconds, attributes)
MeterProvider->>OTLPExporter: Batch & export metrics
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~30 minutes Poem
🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
Review the following changes in direct dependencies. Learn more about Socket for GitHub. |
There was a problem hiding this comment.
Actionable comments posted: 4
🧹 Nitpick comments (1)
otel.go (1)
29-41: 💤 Low valueDeduplicate the
key=valuetag parser.
newOTELResourceand(*otelStatter).tagsToAttrsimplement the samekey=value→attribute.KeyValueconversion. The receiver ontagsToAttrsis also unused. Consider extracting a single helper and dropping the receiver.♻️ Proposed refactor
-func newOTELResource(baseTags []string) *resource.Resource { - attrs := []attribute.KeyValue{} - for _, tag := range baseTags { - if idx := strings.IndexByte(tag, '='); idx > 0 { - attrs = append(attrs, attribute.String(tag[:idx], tag[idx+1:])) - } - } - res, err := resource.New(context.Background(), resource.WithAttributes(attrs...)) +func parseKVTagsToAttrs(tags []string) []attribute.KeyValue { + attrs := make([]attribute.KeyValue, 0, len(tags)) + for _, tag := range tags { + if idx := strings.IndexByte(tag, '='); idx > 0 { + attrs = append(attrs, attribute.String(tag[:idx], tag[idx+1:])) + } + } + return attrs +} + +func newOTELResource(baseTags []string) *resource.Resource { + res, err := resource.New(context.Background(), resource.WithAttributes(parseKVTagsToAttrs(baseTags)...)) if err != nil { return resource.Default() } return res } @@ -// tagsToAttrs parses "key=value" tag strings into OTel attributes. -func (s *otelStatter) tagsToAttrs(tags []string) []attribute.KeyValue { - attrs := make([]attribute.KeyValue, 0, len(tags)) - for _, tag := range tags { - idx := strings.IndexByte(tag, '=') - if idx <= 0 { - continue - } - attrs = append(attrs, attribute.String(tag[:idx], tag[idx+1:])) - } - return attrs -} +// (use parseKVTagsToAttrs at the call sites)…and replace
s.tagsToAttrs(tags)withparseKVTagsToAttrs(tags)in the metric methods.Also applies to: 102-113
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@otel.go` around lines 29 - 41, newOTELResource duplicates the key=value → attribute.KeyValue parsing logic found in (*otelStatter).tagsToAttrs and the otelStatter receiver is unused; extract a single helper (e.g., parseKVTagsToAttrs) that accepts []string and returns []attribute.KeyValue, replace the parsing code in newOTELResource with a call to that helper, update all uses of s.tagsToAttrs(tags) to parseKVTagsToAttrs(tags), and remove the tagsToAttrs method (and its receiver) to avoid duplication.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@client.go`:
- Around line 184-194: The traceProvider shutdown closure uses an unbounded
context which can hang process exit; update the traceProviderShutdownFn (set
where newOTELTracerProvider is used and where traceProvider.Shutdown is called)
to create a context with a finite timeout (mirror the pattern used in
otelStatter.Close()), call traceProvider.Shutdown with that timed context and
ensure the cancel function is deferred/called; additionally, review the use of
addr passed into newOTELTracerProvider (it currently reuses the metrics
endpoint) and either document that traces and metrics must share addr or add a
separate OTELTraceAddr configuration to avoid mixing endpoints.
- Around line 111-122: The Close function currently ignores errors from
client.Close() and traceProviderShutdownFn(), so capture both return errors and
either return a combined error from Close (change signature to Close() error and
update callers) or, if keeping void, at minimum log failures via the existing
logger; specifically modify the Close function (guarded by clientMux) to check
the error from client.Close() and the error from traceProviderShutdownFn() (if
it returns error), aggregate them (e.g., wrap or join messages) and return that
error, or log each error with clear context like "otel client close failed" /
"trace provider shutdown failed" if you opt not to change the signature.
In `@otel.go`:
- Around line 230-232: The Close method currently calls s.meterProvider.Shutdown
with context.Background(), which can hang indefinitely; replace this with a
context that has a bounded timeout (e.g., context.WithTimeout) and pass that
cancellable context to s.meterProvider.Shutdown in otelStatter.Close, deferring
cancel() so Shutdown returns promptly even if the OTLP exporter/collector
stalls.
- Around line 211-219: The Timing method records durations in seconds but OTel's
default histogram buckets are millisecond-scaled; update otelStatter.Timing so
it records milliseconds (use value.Milliseconds() as a float64) instead of
value.Seconds() to match Telegraf/Datadog and preserve histogram resolution, and
keep the existing use of s.getHistogram(name) and s.tagsToAttrs(tags) unchanged;
alternatively, if you prefer seconds, register a corresponding View with custom
sub-second bucket boundaries for the histogram returned by getHistogram.
---
Nitpick comments:
In `@otel.go`:
- Around line 29-41: newOTELResource duplicates the key=value →
attribute.KeyValue parsing logic found in (*otelStatter).tagsToAttrs and the
otelStatter receiver is unused; extract a single helper (e.g.,
parseKVTagsToAttrs) that accepts []string and returns []attribute.KeyValue,
replace the parsing code in newOTELResource with a call to that helper, update
all uses of s.tagsToAttrs(tags) to parseKVTagsToAttrs(tags), and remove the
tagsToAttrs method (and its receiver) to avoid duplication.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 872426a4-b516-454b-b3df-39ec005bf587
⛔ Files ignored due to path filters (1)
go.sumis excluded by!**/*.sum
📒 Files selected for processing (5)
client.gogo.modmetrics.gometrics_test.gootel.go
💤 Files with no reviewable changes (1)
- metrics.go
| func Close() { | ||
| clientMux.RLock() | ||
| defer clientMux.RUnlock() | ||
| if client == nil { | ||
| return | ||
|
|
||
| if client != nil { | ||
| client.Close() | ||
| } | ||
|
|
||
| if traceProviderShutdownFn != nil { | ||
| traceProviderShutdownFn() | ||
| } | ||
| client.Close() | ||
| } |
There was a problem hiding this comment.
Shutdown errors are silently dropped (errcheck) and Close() itself returns nothing.
Static analysis flags both client.Close() and traceProviderShutdownFn() as unchecked. Beyond the linter, swallowing these errors hides exporter shutdown failures (e.g., backed‑up OTLP queue), which is exactly when operators need visibility. Consider logging at minimum, and ideally surfacing an error to callers.
🛠️ Proposed fix
-func Close() {
+func Close() {
clientMux.RLock()
defer clientMux.RUnlock()
if client != nil {
- client.Close()
+ if err := client.Close(); err != nil {
+ log.WithError(err).Errorln("metrics client close failed")
+ }
}
if traceProviderShutdownFn != nil {
- traceProviderShutdownFn()
+ if err := traceProviderShutdownFn(); err != nil {
+ log.WithError(err).Errorln("trace provider shutdown failed")
+ }
}
}If you'd rather keep the void signature unchanged, a follow-up could change Close() to return error and have callers decide.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| func Close() { | |
| clientMux.RLock() | |
| defer clientMux.RUnlock() | |
| if client == nil { | |
| return | |
| if client != nil { | |
| client.Close() | |
| } | |
| if traceProviderShutdownFn != nil { | |
| traceProviderShutdownFn() | |
| } | |
| client.Close() | |
| } | |
| func Close() { | |
| clientMux.RLock() | |
| defer clientMux.RUnlock() | |
| if client != nil { | |
| if err := client.Close(); err != nil { | |
| log.WithError(err).Errorln("metrics client close failed") | |
| } | |
| } | |
| if traceProviderShutdownFn != nil { | |
| if err := traceProviderShutdownFn(); err != nil { | |
| log.WithError(err).Errorln("trace provider shutdown failed") | |
| } | |
| } | |
| } |
🧰 Tools
🪛 golangci-lint (2.12.1)
[error] 116-116: Error return value of client.Close is not checked
(errcheck)
[error] 120-120: Error return value is not checked
(errcheck)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@client.go` around lines 111 - 122, The Close function currently ignores
errors from client.Close() and traceProviderShutdownFn(), so capture both return
errors and either return a combined error from Close (change signature to
Close() error and update callers) or, if keeping void, at minimum log failures
via the existing logger; specifically modify the Close function (guarded by
clientMux) to check the error from client.Close() and the error from
traceProviderShutdownFn() (if it returns error), aggregate them (e.g., wrap or
join messages) and return that error, or log each error with clear context like
"otel client close failed" / "trace provider shutdown failed" if you opt not to
change the signature.
| } else if cfg.Agent == OTELAgent && cfg.TracingEnabled { | ||
| traceProvider, err := newOTELTracerProvider(addr, cfg.OTELInsecure, cfg.OTELHeaders, config.BaseTags()) | ||
| if err != nil { | ||
| return errors.Wrap(err, "otel tracer provider init failed") | ||
| } | ||
| otel.SetTracerProvider(traceProvider) | ||
| tracer = otel.Tracer(prefix) | ||
| traceProviderShutdownFn = func() error { | ||
| return traceProvider.Shutdown(context.Background()) | ||
| } | ||
| } |
There was a problem hiding this comment.
Tracer shutdown closure should use a bounded context.
traceProvider.Shutdown(context.Background()) will block process exit indefinitely if the OTLP collector is unreachable or slow. Wrap with a timeout, mirroring what should be done in otelStatter.Close() as well.
🛡️ Proposed fix
traceProviderShutdownFn = func() error {
- return traceProvider.Shutdown(context.Background())
+ ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
+ defer cancel()
+ return traceProvider.Shutdown(ctx)
}Also worth verifying: addr is reused as the OTLP trace endpoint (line 185) while it was originally the metrics endpoint. In typical collector setups they coincide on :4317, but if a deployment ever splits metrics and traces this becomes a footgun. Consider an explicit OTELTraceAddr (or document the expectation).
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@client.go` around lines 184 - 194, The traceProvider shutdown closure uses an
unbounded context which can hang process exit; update the
traceProviderShutdownFn (set where newOTELTracerProvider is used and where
traceProvider.Shutdown is called) to create a context with a finite timeout
(mirror the pattern used in otelStatter.Close()), call traceProvider.Shutdown
with that timed context and ensure the cancel function is deferred/called;
additionally, review the use of addr passed into newOTELTracerProvider (it
currently reuses the metrics endpoint) and either document that traces and
metrics must share addr or add a separate OTELTraceAddr configuration to avoid
mixing endpoints.
| func (s *otelStatter) Timing(name string, value time.Duration, tags []string, rate float64) error { | ||
| h, err := s.getHistogram(name) | ||
| if err != nil { | ||
| return err | ||
| } | ||
| // OTel convention: duration histograms use seconds | ||
| h.Record(context.Background(), value.Seconds(), otelmetric.WithAttributes(s.tagsToAttrs(tags)...)) | ||
| return nil | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
OpenTelemetry Go SDK default explicit bucket boundaries for Float64Histogram v1.43.0
💡 Result:
Default explicit bucket boundaries used by the OpenTelemetry Go SDK for Float64Histogram v1.43.0 are: [1][2] [3]. Note: OpenTelemetry’s SDK keeps these defaults unless you explicitly override boundaries via APIs like WithExplicitBucketBoundaries (or equivalent mechanisms such as metric views / exporter configuration); a maintainer issue documents these defaults and explains they’re oriented around milliseconds, not seconds [3].
Citations:
- 1: Update default values for explicit histogram bucket boundaries to better handle seconds open-telemetry/opentelemetry-python#4527
- 2: Use []float64 for histogram boundaries, not []metric.Number open-telemetry/opentelemetry-go#758
- 3: Default values for histogram bucket boundaries are oriented around milliseconds rather than seconds open-telemetry/opentelemetry-go#5821
🏁 Script executed:
# First, let's find the relevant files in the codebase
find . -name "otel.go" -o -name "*dogstatsd*" -o -name "*statsd*" -o -name "metrics.go" | head -20Repository: InjectiveLabs/metrics
Length of output: 88
🏁 Script executed:
# Look at the otel.go file to confirm the Timing implementation
fd "otel\.go" -x cat -n {}Repository: InjectiveLabs/metrics
Length of output: 7808
🏁 Script executed:
# Search for Datadog/Telegraf Statter implementations to verify they use milliseconds
fd -e "go" | xargs rg -l "Timing.*time\.Duration" | head -10Repository: InjectiveLabs/metrics
Length of output: 111
🏁 Script executed:
# Check for any metric configuration or default bucket boundaries in the codebase
rg "Histogram|bucket|Boundary" --type go -i | grep -i "default\|bucket" | head -15Repository: InjectiveLabs/metrics
Length of output: 1186
🏁 Script executed:
# Look at the imports and dependencies to understand which OTEL version is being used
rg "go\.opentelemetry\.io" | head -20Repository: InjectiveLabs/metrics
Length of output: 1660
🏁 Script executed:
# Check telegraf.go Timing implementation
cat -n telegraf.go | grep -A 5 "func (t \*telegrafStatter) Timing"Repository: InjectiveLabs/metrics
Length of output: 322
🏁 Script executed:
# Check metrics.go to see the Statter interface and any dogstatsd implementation
cat -n metrics.go | head -100Repository: InjectiveLabs/metrics
Length of output: 3625
🏁 Script executed:
# Search for dogstatsd or statsd implementation
fd -e "go" | xargs rg -l "dogstatsd|statsd" | head -10Repository: InjectiveLabs/metrics
Length of output: 87
🏁 Script executed:
# Look for any documentation or code about default histogram buckets in otel or client.go
cat -n client.go | head -150Repository: InjectiveLabs/metrics
Length of output: 5398
🏁 Script executed:
# Check dogstatsd implementation - look at how the client is used
cat -n client.go | grep -A 30 "case DatadogAgent:"Repository: InjectiveLabs/metrics
Length of output: 1987
🏁 Script executed:
# Verify dogstatsd library uses milliseconds by searching documentation or checking if there's any explicit conversion
rg "dogstatsd|statsd" -A 5 -B 5 client.go | head -40Repository: InjectiveLabs/metrics
Length of output: 976
🏁 Script executed:
# Search for default histogram bucket boundaries in OTEL documentation or any config
web_search
OpenTelemetry Go SDK v1.43.0 Float64Histogram default bucket boundariesRepository: InjectiveLabs/metrics
Length of output: 168
🌐 Web query:
OpenTelemetry Go SDK v1.43.0 Float64Histogram default bucket boundaries milliseconds
💡 Result:
Default Float64Histogram bucket boundaries in the OpenTelemetry Go SDK (v1.43.0) are: [1][2][3][4] (in milliseconds when used for duration/latency metrics) [5]. Explanation: the Go SDK’s default histogram buckets are explicitly described as oriented around milliseconds, with the specific boundary list above [5].
Citations:
- 1: https://oneuptime.com/blog/post/2026-02-06-metric-views-histogram-bucket-boundaries/view
- 2: https://opentelemetry.io/docs/zero-code/obi/configure/metrics-histograms/
- 3: https://www.telemetryhub.com
- 4: https://www.sdk-koeln.de
- 5: Default values for histogram bucket boundaries are oriented around milliseconds rather than seconds open-telemetry/opentelemetry-go#5821
Timing records in seconds but default histogram buckets are millisecond-scaled, causing unit mismatch with other backends and loss of resolution.
Timing records value.Seconds(), while the Telegraf backend converts to milliseconds (int(value/time.Millisecond)) and Datadog's dogstatsd library uses the standard statsd protocol which expects milliseconds. The OpenTelemetry Go SDK v1.43.0 default histogram bucket boundaries (0, 5, 10, 25, 50, 75, 100, 250, 500, 750, 1000, 2500, 5000, 7500, 10000) are explicitly oriented around milliseconds. Recording seconds into these millisecond-scaled buckets collapses nearly all typical latencies into the lowest bucket, destroying histogram resolution and breaking consistency with other agents.
Either:
- Record
float64(value.Milliseconds())to match Telegraf and Datadog, or - Keep seconds and register a
Viewwith custom bucket boundaries appropriate for sub-second durations.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@otel.go` around lines 211 - 219, The Timing method records durations in
seconds but OTel's default histogram buckets are millisecond-scaled; update
otelStatter.Timing so it records milliseconds (use value.Milliseconds() as a
float64) instead of value.Seconds() to match Telegraf/Datadog and preserve
histogram resolution, and keep the existing use of s.getHistogram(name) and
s.tagsToAttrs(tags) unchanged; alternatively, if you prefer seconds, register a
corresponding View with custom sub-second bucket boundaries for the histogram
returned by getHistogram.
| func (s *otelStatter) Close() error { | ||
| return s.meterProvider.Shutdown(context.Background()) | ||
| } |
There was a problem hiding this comment.
Close can hang indefinitely on a stalled OTLP exporter.
meterProvider.Shutdown is invoked with context.Background(), so a misbehaving collector or hung gRPC connection blocks shutdown forever. Bound this with a timeout so process exit is deterministic.
🛡️ Proposed fix
func (s *otelStatter) Close() error {
- return s.meterProvider.Shutdown(context.Background())
+ ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
+ defer cancel()
+ return s.meterProvider.Shutdown(ctx)
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| func (s *otelStatter) Close() error { | |
| return s.meterProvider.Shutdown(context.Background()) | |
| } | |
| func (s *otelStatter) Close() error { | |
| ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second) | |
| defer cancel() | |
| return s.meterProvider.Shutdown(ctx) | |
| } |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@otel.go` around lines 230 - 232, The Close method currently calls
s.meterProvider.Shutdown with context.Background(), which can hang indefinitely;
replace this with a context that has a bounded timeout (e.g.,
context.WithTimeout) and pass that cancellable context to
s.meterProvider.Shutdown in otelStatter.Close, deferring cancel() so Shutdown
returns promptly even if the OTLP exporter/collector stalls.
| s.mu.RLock() | ||
| h, ok := s.histograms[name] | ||
| s.mu.RUnlock() | ||
| if ok { | ||
| return h, nil | ||
| } | ||
| s.mu.Lock() | ||
| defer s.mu.Unlock() | ||
| if h, ok = s.histograms[name]; ok { | ||
| return h, nil | ||
| } |
There was a problem hiding this comment.
dang this is smart, almost thought it was racy...
Summary by CodeRabbit
New Features
Chores