So here is what's happening and how it works right now
How it Works
Right now we have 2 transports:
curl (uses json serializer)
grpc (uses protobuf serializer)
Transport has pretty much 2 important methods:
Transport::sendXXX() - async send (logs / metrics / traces)
Transport::shutdown() - iterate over all "promises" and check results
When otel collector is working, and its spawned locally everything is fine, app almost doesn't feel the presence of telemetry.
Signals (logs / metrics /traces) are first stored in memory, they all come with BatchingProcessor (processor is a mechanism that takes raw signals and passes them to exporter - exporters are simply passing signals to transports).
So transports will always receive signals in batches. For example during one php process lifetime we might collect:
- 1 trace with 20 spans
- 10 log records
- 4 metrics
When those are passed to Transport we are sending following requests:
- 1x request with 20 spans
- 1x request with 10 logs
- 1x request with 4 metrics
And those (according to the best practices) should go to the local instance of OTEL Collector which can internally buffer them to filesystem and then process (if necessary, like sampling for example or labeling) and then push to APM like DataDog / Sentry / Aspire / etc
I visualized this process here: https://miro.com/app/board/uXjVHYUi0kA=/?share_link_id=996309443669
The Problem
If OTEL Collector will respond with anytning other than 2xx on any of those 3 requests we need to decide interanally what to do.
Spans / Metrics - here we can simply ignore it, since those are anyway often sampled
Logs - and here is our real problem.
The problem is that because we are sending requests in async way we dont know right away that request failed.
This means that we can send for example 5 requests with logs:
- request 1 - 50 log records
- request 2 - 50 log records
- request 3 - 50 log records
- request 4 - 50 log records
- request** 5 - 10 log records
And then at the end of the process when we check statuses of those requests we might get this:
- request 1 - 50 log records - 200 success
- request 2 - 50 log records - 200 success
- request 3 - 50 log records - 422 failuer
- request 4 - 50 log records - 200 success
- request 5 - 10 log records - 200 success
So the problem is that at the shutdown level we:
- no longer have original request
- no longer have those logs
- can't really retry
OTEL Recommendations
So the OTEL recommendation is "bounded retry queue + 5-minute backoff" pattern (Go/Java/.NET) which unfortunately is not a fit for PHP.
OTEL Requirements
https://opentelemetry.io/docs/specs/otel/error-handling/
The key part is this one:
The SDK MUST NOT throw unhandled exceptions for errors in their own operations. For example, an exporter should not throw an exception when it cannot reach the endpoint to which it sends telemetry data.
Possible Solutions
WAL (Write-Ahead Log)
We could create a special transport that will additionally serialize all passed records to json and write them down to a temporary location. Each batch under its own file.
Then in the shutdown, for responses different than 2xx we could move related log files to a location that otel collector is going to scan and read from.
For 2xx responses we could even remove those logs from disk.
This way even if the communication fail, we still have those logs on disk in a format that otel collector can pick them up as if they would be send to him.
Pros: Logs are safe, regardless of what's happening with collector we are not losing anything
Cons: increased IOPS, those writes wold be synchronous
Make Transports Synchronous
(Or at least create a synchronous versions of them)
If we make transports synchronous we can right away detect that response is failed and only then (as we still have signals in memory) store them in some failover location. After those logs are written otel collector handles them.
We would need to alight it also with a very short timeout.
Pros: More control over the delivery and reduced risk of losing logs
Cons: Higher impact on the process, which is now impacted by the networking. In case of a timeout / error we are blocking the whole system process as we will need to handle failover.
Anything else?
Suggestions/recommendations are very much welcome 🙏
So here is what's happening and how it works right now
How it Works
Right now we have 2 transports:
curl(uses json serializer)grpc(uses protobuf serializer)Transport has pretty much 2 important methods:
Transport::sendXXX()- async send (logs / metrics / traces)Transport::shutdown()- iterate over all "promises" and check resultsWhen otel collector is working, and its spawned locally everything is fine, app almost doesn't feel the presence of telemetry.
Signals (logs / metrics /traces) are first stored in memory, they all come with BatchingProcessor (processor is a mechanism that takes raw signals and passes them to exporter - exporters are simply passing signals to transports).
So transports will always receive signals in batches. For example during one php process lifetime we might collect:
When those are passed to Transport we are sending following requests:
And those (according to the best practices) should go to the local instance of OTEL Collector which can internally buffer them to filesystem and then process (if necessary, like sampling for example or labeling) and then push to APM like DataDog / Sentry / Aspire / etc
I visualized this process here: https://miro.com/app/board/uXjVHYUi0kA=/?share_link_id=996309443669
The Problem
If OTEL Collector will respond with anytning other than 2xx on any of those 3 requests we need to decide interanally what to do.
Spans / Metrics - here we can simply ignore it, since those are anyway often sampled
Logs - and here is our real problem.
The problem is that because we are sending requests in async way we dont know right away that request failed.
This means that we can send for example 5 requests with logs:
And then at the end of the process when we check statuses of those requests we might get this:
So the problem is that at the shutdown level we:
OTEL Recommendations
So the OTEL recommendation is "bounded retry queue + 5-minute backoff" pattern (Go/Java/.NET) which unfortunately is not a fit for PHP.
OTEL Requirements
https://opentelemetry.io/docs/specs/otel/error-handling/
The key part is this one:
Possible Solutions
WAL (Write-Ahead Log)
We could create a special transport that will additionally serialize all passed records to json and write them down to a temporary location. Each batch under its own file.
Then in the shutdown, for responses different than 2xx we could move related log files to a location that otel collector is going to scan and read from.
For 2xx responses we could even remove those logs from disk.
This way even if the communication fail, we still have those logs on disk in a format that otel collector can pick them up as if they would be send to him.
Pros: Logs are safe, regardless of what's happening with collector we are not losing anything
Cons: increased IOPS, those writes wold be synchronous
Make Transports Synchronous
(Or at least create a synchronous versions of them)
If we make transports synchronous we can right away detect that response is failed and only then (as we still have signals in memory) store them in some failover location. After those logs are written otel collector handles them.
We would need to alight it also with a very short timeout.
Pros: More control over the delivery and reduced risk of losing logs
Cons: Higher impact on the process, which is now impacted by the networking. In case of a timeout / error we are blocking the whole system process as we will need to handle failover.
Anything else?
Suggestions/recommendations are very much welcome 🙏