[Discussion]: Failover Mechanism for Telemetry

So here is what's happening and how it works right now

**How it Works**

Right now we have 2 transports: 

- `curl` (uses json serializer)
- `grpc` (uses protobuf serializer)

Transport has pretty much 2 important methods: 

- `Transport::sendXXX()` - async send (logs / metrics / traces) 
- `Transport::shutdown()` - iterate over all "promises" and check results 

When otel collector is working, and its spawned locally everything is fine, app almost doesn't feel the presence of telemetry. 

Signals (logs / metrics /traces) are first stored in memory, they all come with BatchingProcessor (processor is a mechanism that takes raw signals and passes them to exporter - exporters are simply passing signals to transports).

So transports will always receive signals in batches. For example during one php process lifetime we might collect: 

- 1 trace with 20 spans
- 10 log records
- 4 metrics

When those are passed to Transport we are sending following requests: 

- 1x request with 20 spans
- 1x request with 10 logs
- 1x request with 4 metrics

And those (according to the best practices) should go to the local instance of OTEL Collector which can internally buffer them to filesystem and then process (if necessary, like sampling for example or labeling) and then push to APM like DataDog / Sentry / Aspire / etc 

I visualized this process here: https://miro.com/app/board/uXjVHYUi0kA=/?share_link_id=996309443669

<img width="1609" height="991" alt="Image" src="https://github.com/user-attachments/assets/4a35f383-bc80-46dd-a275-4d6048478c72" />

---

**The Problem**

If OTEL Collector will respond with anytning other than 2xx on any of those 3 requests we need to decide interanally what to do. 

Spans / Metrics - here we can simply ignore it, since those are anyway often sampled 
Logs - and here is our real problem. 

The problem is that because we are sending requests in async way we dont know right away that request failed. 
This means that we can send for example 5 requests with logs: 

* request 1 - 50 log records
* request 2 - 50 log records
* request 3 - 50 log records
* request 4 - 50 log records
* request** 5 - 10 log records

And then at the end of the process when we check statuses of those requests we might get this: 

* request 1 - 50 log records - 200 success
* request 2 - 50 log records - 200 success
* request 3 - 50 log records - 422 failuer
* request 4 - 50 log records - 200 success
* request 5 - 10 log records  - 200 success

So the problem is that at the shutdown level we: 

* no longer have original request 
* no longer have those logs 
* can't really retry

# OTEL Recommendations 

So the OTEL recommendation is "bounded retry queue + 5-minute backoff" pattern (Go/Java/.NET) which unfortunately is not a fit for PHP. 

# OTEL Requirements

https://opentelemetry.io/docs/specs/otel/error-handling/

The key part is this one: 

> The SDK MUST NOT throw unhandled exceptions for errors in their own operations. For example, an exporter should not throw an exception when it cannot reach the endpoint to which it sends telemetry data.

---

# Possible Solutions

## WAL (Write-Ahead Log) 

We could create a special transport that will additionally serialize all passed records to json and write them down to a temporary location. Each batch under its own file. 

Then in the shutdown, for responses different than 2xx we could move related log files to a location that otel collector is going to scan and read from. 
For 2xx responses we could even remove those logs from disk. 

This way even if the communication fail, we still have those logs on disk in a format that otel collector can pick them up as if they would be send to him. 

**Pros**: Logs are safe, regardless of what's happening with collector we are not losing anything 
**Cons**: increased IOPS, those writes wold be synchronous

## Make Transports Synchronous 

(Or at least create a synchronous versions of them) 

If we make transports synchronous we can right away detect that response is failed and only then (as we still have signals in memory) store them in some failover location. After those logs are written otel collector handles them. 

We would need to alight it also with a very short timeout. 

**Pros**: More control over the delivery and reduced risk of losing logs
**Cons**: Higher impact on the process, which is now impacted by the networking. In case of a timeout / error we are blocking the whole system process as we will need to handle failover. 

## Anything else? 

Suggestions/recommendations are very much welcome 🙏 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Discussion]: Failover Mechanism for Telemetry #2348

OTEL Recommendations

OTEL Requirements

Possible Solutions

WAL (Write-Ahead Log)

Make Transports Synchronous

Anything else?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Discussion]: Failover Mechanism for Telemetry #2348

Description

OTEL Recommendations

OTEL Requirements

Possible Solutions

WAL (Write-Ahead Log)

Make Transports Synchronous

Anything else?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions