Add fallback endpoint support for OTLP exporters#8197
Add fallback endpoint support for OTLP exporters#8197sridharsurvi1 wants to merge 1 commit intoopen-telemetry:mainfrom
Conversation
When the primary OTLP endpoint fails with a transport error (after retries are exhausted), the exporter will automatically attempt to send telemetry data to a configurable fallback endpoint. This enables high-availability setups where a secondary collector can receive data when the primary is unavailable. Configuration via environment variables / system properties: - otel.exporter.otlp.fallback.endpoint (generic) - otel.exporter.otlp.<signal>.fallback.endpoint (signal-specific) Programmatic configuration via builder: - setFallbackEndpoint(String) on all exporter builders Supported for all signal types (traces, metrics, logs) and both HTTP/protobuf and gRPC protocols.
|
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #8197 +/- ##
============================================
- Coverage 90.29% 90.10% -0.19%
- Complexity 7652 7666 +14
============================================
Files 843 843
Lines 23066 23172 +106
Branches 2310 2327 +17
============================================
+ Hits 20827 20879 +52
- Misses 1520 1564 +44
- Partials 719 729 +10 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
jack-berg
left a comment
There was a problem hiding this comment.
The behavior of our OTLP exporters and corresponding environment variables is dictated by the spec: https://github.com/open-telemetry/opentelemetry-java/blob/main/CONTRIBUTING.md#project-scope
We have some examples of java specific programmatic configuration options, like the ability to set the executor service and proxy options. But these accommodate well established configuration expectations of network clients. I.e. the absence of options would be a glaring deficiency in the API.
This fallback endpoint is more complicated and more controversial, and so I would like to see it go through the spec before we consider adding it in opentelemetry-java.
Personally, wearing my other hat as a spec contributor, I would expect this problem to be solved through load balancing and retry against a single endpoint. I.e. a single endpoint routes to multiple backing instances. If an attempt against the first fails, it does so in a way that triggers the retry policy to execute a subsequent request, which has the opportunity to resolve a different instance.
|
Thanks for the review @jack-berg, appreciate the detailed feedback. I completely understand and respect the spec-first approach — that's the right governance model for cross-language consistency. That said, I'd like to share the operational context that motivated this: Why load balancing alone doesn't solve this cleanly:
The case for SDK-level fallback:
Proposed next step: I'd like to take this through the spec process. I'll open an issue (or OTEP if appropriate) in opentelemetry-specification proposing fallback/failover endpoint support for OTLP exporters. That way the broader community can weigh in on the design, and if accepted, implementations can land consistently across languages. Would you be open to keeping this PR as a reference implementation while the spec discussion happens? Happy to close it if you'd prefer, and re-open once there's spec alignment. Thanks again for pointing me in the right direction. |
Summary
otel.exporter.otlp.fallback.endpoint,otel.exporter.otlp.<signal>.fallback.endpoint) or programmatically viasetFallbackEndpoint(String)on all exporter buildersMotivation
Currently OTLP exporters only support a single endpoint. In production environments, having a fallback collector endpoint improves reliability — if the primary collector goes down, telemetry data is not lost.
Changes
HttpExporterandGrpcExporter— failover logic on transport errorsHttpExporterBuilderandGrpcExporterBuilder—setFallbackEndpoint()creates a secondary senderOtlpConfigUtil— parsesotel.exporter.otlp.fallback.endpointand signal-specific variantssetFallbackEndpoint(String)to all 6 public exporter buildersTest plan
OtlpConfigUtilTest— fallback endpoint configuration parsing (generic, signal-specific, HTTP path appending, gRPC)HttpExporterTest— failover on primary transport error, no failover on success, both endpoints failGrpcExporterTest— same three failover scenarios./gradlew spotlessApplypasses./gradlew japicmp— API diff included (additive only)🤖 Generated with Claude Code