Summary
ObsConnMetrics: HTTP/1.1 first request bumps protocol gauge fails intermittently on shared CI runners (GitHub Actions ubuntu-latest). Observed 1/1894 failure on a clean build with no code changes.
Failure output
[FAIL] ObsConnMetrics: HTTP/1.1 first request bumps protocol gauge
Error: got_200=1 h1=0.000000 h2=0.000000
The test receives a 200 OK (got_200=1) but the protocol gauge reads 0.0 for both h1 and h2.
Root cause
AttachTransportObservability runs on the accept dispatcher thread AFTER the connection is handed to a socket worker via EnQueue. The socket worker can parse the HTTP/1.1 request headers and call MarkApplicationProtocolConfirmed("http/1.1") before the accept dispatcher executes AttachTransportObservability. When this happens, obs_attached_.load(acquire) returns false at connection_handler.cc:68, the gauge bump is skipped entirely, and re-confirmation on subsequent keep-alive requests is suppressed by the http_protocol_label_ != nullptr early-return at line 69 (which never triggers because the label was never set in the first place).
This is the documented race from pitfalls/REACTOR_CORE.md ("Cross-thread observability pointer write without publication barrier"). The atomic guard correctly prevents undefined behavior, but it means the gauge publication is silently skipped for fast-parsed connections where the worker outpaces the accept dispatcher.
The H2 variant (TestH2PrefaceSetsProtocolGauge) does not exhibit this flake because the H2 connection setup involves multiple round trips (preface + SETTINGS + SETTINGS ACK), giving the accept dispatcher enough time to run.
Relevant code
- Test:
test/observability/observability_connection_metrics_test.h:280 (TestH1FirstRequestSetsProtocolGauge)
- Gauge publish:
server/connection_handler.cc:59 (MarkApplicationProtocolConfirmed)
- Attach point:
server/connection_handler.cc:37 (AttachTransportObservability)
- Call site:
server/http/http_connection_handler.cc:2817 (headers_complete path)
Possible fixes
-
Test-side retry loop — poll the snapshot a few times (e.g., 5 × 50ms) before declaring failure. Cheapest fix, addresses the test flake without touching production code.
-
Deferred protocol confirmation — if obs_attached_ is false at MarkApplicationProtocolConfirmed time, stash the label and let AttachTransportObservability publish it when it runs. This would fix the silent-skip for production too (the gauge undercounts on fast connections today).
Environment
- Ubuntu 24.04 (GitHub Actions
ubuntu-latest)
- GCC default, no sanitizers
- Reproduced on shared runners; not observed locally (likely timing-dependent)
Summary
ObsConnMetrics: HTTP/1.1 first request bumps protocol gaugefails intermittently on shared CI runners (GitHub Actionsubuntu-latest). Observed 1/1894 failure on a clean build with no code changes.Failure output
The test receives a 200 OK (
got_200=1) but the protocol gauge reads 0.0 for both h1 and h2.Root cause
AttachTransportObservabilityruns on the accept dispatcher thread AFTER the connection is handed to a socket worker viaEnQueue. The socket worker can parse the HTTP/1.1 request headers and callMarkApplicationProtocolConfirmed("http/1.1")before the accept dispatcher executesAttachTransportObservability. When this happens,obs_attached_.load(acquire)returnsfalseatconnection_handler.cc:68, the gauge bump is skipped entirely, and re-confirmation on subsequent keep-alive requests is suppressed by thehttp_protocol_label_ != nullptrearly-return at line 69 (which never triggers because the label was never set in the first place).This is the documented race from
pitfalls/REACTOR_CORE.md("Cross-thread observability pointer write without publication barrier"). The atomic guard correctly prevents undefined behavior, but it means the gauge publication is silently skipped for fast-parsed connections where the worker outpaces the accept dispatcher.The H2 variant (
TestH2PrefaceSetsProtocolGauge) does not exhibit this flake because the H2 connection setup involves multiple round trips (preface + SETTINGS + SETTINGS ACK), giving the accept dispatcher enough time to run.Relevant code
test/observability/observability_connection_metrics_test.h:280(TestH1FirstRequestSetsProtocolGauge)server/connection_handler.cc:59(MarkApplicationProtocolConfirmed)server/connection_handler.cc:37(AttachTransportObservability)server/http/http_connection_handler.cc:2817(headers_complete path)Possible fixes
Test-side retry loop — poll the snapshot a few times (e.g., 5 × 50ms) before declaring failure. Cheapest fix, addresses the test flake without touching production code.
Deferred protocol confirmation — if
obs_attached_is false atMarkApplicationProtocolConfirmedtime, stash the label and letAttachTransportObservabilitypublish it when it runs. This would fix the silent-skip for production too (the gauge undercounts on fast connections today).Environment
ubuntu-latest)