Skip to content

Add fallback endpoint support for OTLP exporters#8197

Open
sridharsurvi1 wants to merge 1 commit intoopen-telemetry:mainfrom
sridharsurvi1:feature/otlp-fallback-endpoint
Open

Add fallback endpoint support for OTLP exporters#8197
sridharsurvi1 wants to merge 1 commit intoopen-telemetry:mainfrom
sridharsurvi1:feature/otlp-fallback-endpoint

Conversation

@sridharsurvi1
Copy link

Summary

  • Adds configurable fallback endpoint for OTLP exporters (HTTP and gRPC) across all signal types (traces, metrics, logs)
  • When the primary endpoint fails with a transport error (after retries are exhausted), the exporter automatically attempts to send to the fallback endpoint
  • Configurable via environment variables (otel.exporter.otlp.fallback.endpoint, otel.exporter.otlp.<signal>.fallback.endpoint) or programmatically via setFallbackEndpoint(String) on all exporter builders

Motivation

Currently OTLP exporters only support a single endpoint. In production environments, having a fallback collector endpoint improves reliability — if the primary collector goes down, telemetry data is not lost.

Changes

  • Core: HttpExporter and GrpcExporter — failover logic on transport errors
  • Builders: HttpExporterBuilder and GrpcExporterBuildersetFallbackEndpoint() creates a secondary sender
  • Config: OtlpConfigUtil — parses otel.exporter.otlp.fallback.endpoint and signal-specific variants
  • Public API: Added setFallbackEndpoint(String) to all 6 public exporter builders
  • Providers: All 3 autoconfigure providers wired up with fallback endpoint support

Test plan

  • OtlpConfigUtilTest — fallback endpoint configuration parsing (generic, signal-specific, HTTP path appending, gRPC)
  • HttpExporterTest — failover on primary transport error, no failover on success, both endpoints fail
  • GrpcExporterTest — same three failover scenarios
  • ./gradlew spotlessApply passes
  • ./gradlew japicmp — API diff included (additive only)
  • Integration test with real collector failover

🤖 Generated with Claude Code

When the primary OTLP endpoint fails with a transport error (after
retries are exhausted), the exporter will automatically attempt to
send telemetry data to a configurable fallback endpoint. This enables
high-availability setups where a secondary collector can receive data
when the primary is unavailable.

Configuration via environment variables / system properties:
- otel.exporter.otlp.fallback.endpoint (generic)
- otel.exporter.otlp.<signal>.fallback.endpoint (signal-specific)

Programmatic configuration via builder:
- setFallbackEndpoint(String) on all exporter builders

Supported for all signal types (traces, metrics, logs) and both
HTTP/protobuf and gRPC protocols.
@sridharsurvi1 sridharsurvi1 requested a review from a team as a code owner March 17, 2026 17:41
@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Mar 17, 2026

CLA Missing ID

@codecov
Copy link

codecov bot commented Mar 17, 2026

Codecov Report

❌ Patch coverage is 54.62185% with 54 lines in your changes missing coverage. Please review.
✅ Project coverage is 90.10%. Comparing base (2acd434) to head (be25ac6).
⚠️ Report is 11 commits behind head on main.

Files with missing lines Patch % Lines
...ry/exporter/internal/http/HttpExporterBuilder.java 13.33% 10 Missing and 3 partials ⚠️
...ry/exporter/internal/grpc/GrpcExporterBuilder.java 14.28% 10 Missing and 2 partials ⚠️
...telemetry/exporter/internal/http/HttpExporter.java 76.19% 3 Missing and 2 partials ⚠️
...telemetry/exporter/internal/grpc/GrpcExporter.java 83.33% 2 Missing and 1 partial ⚠️
...lp/http/logs/OtlpHttpLogRecordExporterBuilder.java 0.00% 3 Missing ⚠️
...lp/http/metrics/OtlpHttpMetricExporterBuilder.java 0.00% 3 Missing ⚠️
...r/otlp/http/trace/OtlpHttpSpanExporterBuilder.java 0.00% 3 Missing ⚠️
...lemetry/exporter/otlp/internal/OtlpConfigUtil.java 85.71% 1 Missing and 2 partials ⚠️
...er/otlp/logs/OtlpGrpcLogRecordExporterBuilder.java 0.00% 3 Missing ⚠️
...er/otlp/metrics/OtlpGrpcMetricExporterBuilder.java 0.00% 3 Missing ⚠️
... and 1 more
Additional details and impacted files
@@             Coverage Diff              @@
##               main    #8197      +/-   ##
============================================
- Coverage     90.29%   90.10%   -0.19%     
- Complexity     7652     7666      +14     
============================================
  Files           843      843              
  Lines         23066    23172     +106     
  Branches       2310     2327      +17     
============================================
+ Hits          20827    20879      +52     
- Misses         1520     1564      +44     
- Partials        719      729      +10     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Member

@jack-berg jack-berg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The behavior of our OTLP exporters and corresponding environment variables is dictated by the spec: https://github.com/open-telemetry/opentelemetry-java/blob/main/CONTRIBUTING.md#project-scope

We have some examples of java specific programmatic configuration options, like the ability to set the executor service and proxy options. But these accommodate well established configuration expectations of network clients. I.e. the absence of options would be a glaring deficiency in the API.

This fallback endpoint is more complicated and more controversial, and so I would like to see it go through the spec before we consider adding it in opentelemetry-java.

Personally, wearing my other hat as a spec contributor, I would expect this problem to be solved through load balancing and retry against a single endpoint. I.e. a single endpoint routes to multiple backing instances. If an attempt against the first fails, it does so in a way that triggers the retry policy to execute a subsequent request, which has the opportunity to resolve a different instance.

@sridharsurvi1
Copy link
Author

Thanks for the review @jack-berg, appreciate the detailed feedback.

I completely understand and respect the spec-first approach — that's the right governance model for cross-language consistency.

That said, I'd like to share the operational context that motivated this:

Why load balancing alone doesn't solve this cleanly:

  • In environments with 1000s of OTel collectors (each serving dedicated programs/services), setting up and managing load balancers in front of each collector solely for failover adds significant operational overhead — infrastructure provisioning, health checks, monitoring the LBs themselves, and cost.
  • A load balancer is a heavyweight solution when the actual need is simple: "if this endpoint is down, try that one." The SDK already has the context to make this decision at export time.
  • For many teams, especially those running collectors as sidecars or per-node daemonsets, there's no natural place to put a load balancer without re-architecting the deployment topology.

The case for SDK-level fallback:

  • It's zero-infrastructure — no additional components to deploy, monitor, or pay for
  • It's self-contained — each application defines its own failover, no central coordination needed
  • It complements (not replaces) load balancing — teams with LBs can ignore it, teams without get a simple safety net
  • It's a common pattern in other SDKs and clients (e.g., database drivers, HTTP clients) for exactly this reason

Proposed next step:

I'd like to take this through the spec process. I'll open an issue (or OTEP if appropriate) in opentelemetry-specification proposing fallback/failover endpoint support for OTLP exporters. That way the broader community can weigh in on the design, and if accepted, implementations can land consistently across languages.

Would you be open to keeping this PR as a reference implementation while the spec discussion happens? Happy to close it if you'd prefer, and re-open once there's spec alignment.

Thanks again for pointing me in the right direction.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants