Skip to content

Add connection pooling support to OTLP exporter for high-throughput scenarios#14364

Closed
Arunodoy18 wants to merge 4 commits intoopen-telemetry:mainfrom
Arunodoy18:feature/otlp-exporter-connection-pool
Closed

Add connection pooling support to OTLP exporter for high-throughput scenarios#14364
Arunodoy18 wants to merge 4 commits intoopen-telemetry:mainfrom
Arunodoy18:feature/otlp-exporter-connection-pool

Conversation

@Arunodoy18
Copy link
Copy Markdown

Description

This PR adds connection pooling support to the OTLP exporter to resolve performance issues in high-throughput and high-latency environments.

Motivation

As reported in the issue, users experience unreliability in the OTLP exporter with:

  • High throughput scenarios (10K+ spans/sec)
  • High-latency network connections (e.g., cross-region deployments)
  • AWS ALB limiting HTTP/2 streams to 128

The single gRPC connection becomes a bottleneck, causing queue overflow and dropped spans.

Changes:

##Core Implementation:

  • *Added connection_pool_sizeconfiguration parameter** toConfig` struct

    • Default: 0 (uses 1 connection for backward compatibility)
    • Range: 0-256 connections
    • Validated in Config.Validate()
  • Implemented connection pool in baseExporter

    • Maintains multiple gRPC connections in slices
    • Round-robin load balancing using atomic counter
    • All data types (traces, metrics, logs, profiles) use connection pool
  • Thread-safe round-robin distribution*

    • getNextExporterIndex()method usesatomic.Uint32`
    • Optimized for single-connection case (no atomic ops)

Documentation

  • Updated README.md with configuration details and examples
  • Added changelog entry in .chloggen/
  • Included high-throughput configuration example

Bug Fix

Testing

  • ✅ All existing tests pass
  • ✅ No compilation errors
  • ✅ Configuration validation works correctly
  • ✅ Backward compatible

Usage Example

``yaml
exporters:
otlp/high-throughput:
endpoint: otel-gateway:443
connection_pool_size: 5 # Creates 5 gRPC connections
compression: snappy
timeout: 20s
sending_queue:
num_consumers: 100
queue_size: 2000

When service.telemetry.metrics.level is set to 'none', the collector
should skip registering process metrics to avoid errors on platforms
where gopsutil is not supported (such as AIX).

This change conditionally registers process metrics only when the
metrics level is not LevelNone, preventing the 'failed to register
process metrics: not implemented yet' error on unsupported platforms.

Fixes regression introduced in v0.136.0 where the check for metrics
level was removed.
Similar to the resolution for pcommon.Value in previous changes, this update
ensures consistent documentation across all pdata types by clarifying that
calling functions on zero-initialized instances is invalid usage.

Changes:
- Updated template files (one_of_field.go, one_of_message_value.go) to generate
  improved comment wording
- Updated pcommon/value.go comments manually
- Updated all generated pdata files to use consistent wording:
  'is invalid and will cause a panic' instead of 'will cause a panic'

This makes it clearer that using zero-initialized instances is not just
dangerous but explicitly invalid usage, improving API documentation clarity.
…onfig file endpoints

Fixes open-telemetry#14286

When both OTEL_EXPORTER_OTLP_TRACES_ENDPOINT environment variable and
a configured endpoint in the config file are present, the URL scheme
from the environment variable was incorrectly overriding the scheme
from the config file, resulting in mixed endpoints (e.g., http scheme
from env var + path from config file).

This fix ensures that environment variables do not override explicitly
configured endpoints by temporarily unsetting the OTEL_EXPORTER_OTLP_*_ENDPOINT
environment variables before creating the SDK, then restoring them afterward.

According to the OpenTelemetry specification, explicit configuration
should take precedence over environment variables.

Changes:
- Modified sdk.go to temporarily unset OTEL_EXPORTER_OTLP_*_ENDPOINT
  environment variables before calling config.NewSDK()
- Added helper functions unsetOTLPEndpointEnvVars() and restoreEnvVars()
- Added comprehensive tests to verify env vars don't override config
…cenarios

This enhancement adds a connection_pool_size configuration option to the OTLP
exporter, enabling multiple gRPC connections with round-robin load balancing.

Key changes:
- Add connection_pool_size config parameter (default: 0, uses 1 connection)
- Implement round-robin load balancing across multiple connections
- Support for 1-256 concurrent gRPC connections
- Backward compatible: default behavior unchanged

This resolves performance issues in high-throughput environments (10K+ spans/sec)
and high-latency network scenarios where a single gRPC connection becomes a
bottleneck.

Also fixes unrelated service.go issue per contributor feedback on PR open-telemetry#14342.
@Arunodoy18 Arunodoy18 requested review from a team, bogdandrutu and dmitryax as code owners January 6, 2026 07:23
@Arunodoy18
Copy link
Copy Markdown
Author

I hope this works well, as was adressed through the issue .if any unrelated changes occur or anything which is wrong , please do tell after the review.
Thank you

Copy link
Copy Markdown
Member

@bogdandrutu bogdandrutu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you show me some data the demonstrate this is needed? gRPC says that you don't need to do this and it will automatically use multiple sockets, etc.

@tank-500m
Copy link
Copy Markdown
Contributor

It doesn’t seem general enough to justify inclusion in the core component.
Also, there appear to be viable alternatives (e.g., the loadbalancing exporter in opentelemetry-collector-contrib).

@github-actions
Copy link
Copy Markdown
Contributor

This PR was marked stale due to lack of activity. It will be closed in 14 days.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Feb 5, 2026

Closed as inactive. Feel free to reopen if this PR is still being worked on.

@djluck
Copy link
Copy Markdown

djluck commented Feb 5, 2026

@tank-500m I wanted to put across my perspective as a user that is running into this problem.

I wrote up #14249 that outlines the problem that this solution would solve.

To summarize why I think this would be a good change:

  • GRPC uses a single connection but multiplexes over this connection, which brings a couple of disadvantages:
    • Connections are subject to HOL blocking at the TCP level
    • Some LBs (e.g. AWS ALB) put an upper limit on the number of concurrent streams a single connection can support
  • Loadbalancing exporter is an option but requires maintaining duplicate DNS records and a chunk of repetitive configuration.

@Arunodoy18
Copy link
Copy Markdown
Author

Thanks for pointing me back to this thread.

I understand now that I should have continued the discussion here instead of opening new PRs for similar approaches — apologies for the noise earlier.

For now, I’m not planning to push implementation changes. I’ll keep this thread in mind if there are future design discussions around OTLP exporter connection behavior or related performance topics.

In the meantime, I’ll focus on smaller, localized contributions while continuing to learn more about the Collector architecture and past design decisions in this area.

Thanks for the guidance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants