Skip to content

Flaky test: ObsConnMetrics H1 protocol gauge read races AttachTransportObservability #43

Description

@mwfj

Summary

ObsConnMetrics: HTTP/1.1 first request bumps protocol gauge fails intermittently on shared CI runners (GitHub Actions ubuntu-latest). Observed 1/1894 failure on a clean build with no code changes.

Failure output

[FAIL] ObsConnMetrics: HTTP/1.1 first request bumps protocol gauge
       Error: got_200=1 h1=0.000000 h2=0.000000

The test receives a 200 OK (got_200=1) but the protocol gauge reads 0.0 for both h1 and h2.

Root cause

AttachTransportObservability runs on the accept dispatcher thread AFTER the connection is handed to a socket worker via EnQueue. The socket worker can parse the HTTP/1.1 request headers and call MarkApplicationProtocolConfirmed("http/1.1") before the accept dispatcher executes AttachTransportObservability. When this happens, obs_attached_.load(acquire) returns false at connection_handler.cc:68, the gauge bump is skipped entirely, and re-confirmation on subsequent keep-alive requests is suppressed by the http_protocol_label_ != nullptr early-return at line 69 (which never triggers because the label was never set in the first place).

This is the documented race from pitfalls/REACTOR_CORE.md ("Cross-thread observability pointer write without publication barrier"). The atomic guard correctly prevents undefined behavior, but it means the gauge publication is silently skipped for fast-parsed connections where the worker outpaces the accept dispatcher.

The H2 variant (TestH2PrefaceSetsProtocolGauge) does not exhibit this flake because the H2 connection setup involves multiple round trips (preface + SETTINGS + SETTINGS ACK), giving the accept dispatcher enough time to run.

Relevant code

  • Test: test/observability/observability_connection_metrics_test.h:280 (TestH1FirstRequestSetsProtocolGauge)
  • Gauge publish: server/connection_handler.cc:59 (MarkApplicationProtocolConfirmed)
  • Attach point: server/connection_handler.cc:37 (AttachTransportObservability)
  • Call site: server/http/http_connection_handler.cc:2817 (headers_complete path)

Possible fixes

  1. Test-side retry loop — poll the snapshot a few times (e.g., 5 × 50ms) before declaring failure. Cheapest fix, addresses the test flake without touching production code.

  2. Deferred protocol confirmation — if obs_attached_ is false at MarkApplicationProtocolConfirmed time, stash the label and let AttachTransportObservability publish it when it runs. This would fix the silent-skip for production too (the gauge undercounts on fast connections today).

Environment

  • Ubuntu 24.04 (GitHub Actions ubuntu-latest)
  • GCC default, no sanitizers
  • Reproduced on shared runners; not observed locally (likely timing-dependent)

Metadata

Metadata

Assignees

Labels

No labels
No labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions