Skip to content

Optimize week valgrind ci job#44

Merged
mwfj merged 3 commits into
mainfrom
fix-valgrind-check-memory-leak-issue-2
May 27, 2026
Merged

Optimize week valgrind ci job#44
mwfj merged 3 commits into
mainfrom
fix-valgrind-check-memory-leak-issue-2

Conversation

@mwfj

@mwfj mwfj commented May 26, 2026

Copy link
Copy Markdown
Owner

fix(observability, ci): protocol-gauge race fix, 6-shard valgrind matrix, and code-review hardening

Closes #43.

Summary

Three groups of changes driven by the weekly-valgrind workflow + post-merge code review:

  1. Code (server/http/http_router.cc, prior commit) — broke an AsyncPendingState reference cycle that leaked across the auth-introspection async path. (Covers the three Valgrind shards that failed with make_shared<AsyncPendingState> leaks.)

  2. CI topology (.github/workflows/weekly-valgrind.yml) — split auth-crypto into three balanced shards (auth-jwks / auth-mgr / auth-intro) using measured per-suite times, renamed proxy-heavyproxy-mixed to flag its cross-theme content, dropped auth_race per the race-suite-under-valgrind exclusion, skipped three timing-sensitive concurrency tests under VALGRIND_TEST. Final topology: 6 shards, ~25 min critical path (down from 78 min on the prior run).

  3. Issue Flaky test: ObsConnMetrics H1 protocol gauge read races AttachTransportObservability #43 fix + code-review hardening (server/connection_handler.cc, include/connection_handler.h) — fixed the flaky ObsConnMetrics: HTTP/1.1 first request bumps protocol gauge test by adding a stash + replay protocol so MarkApplicationProtocolConfirmed no longer silently drops the gauge bump when it races ahead of AttachTransportObservability. Subsequent xhigh-effort code review surfaced 12 additional defects in the initial stash mechanism; all fixed in this PR.

Group 3 in detail (post-review revision)

The initial stash + replay fix introduced a WebSocket-upgrade race: a late-arriving replay could +1 the http/1.1 gauge for a connection that had already become a WebSocket, causing sum(http.connections.active{protocol=*}) to double-count throughout the WS session (verifier-CONFIRMED with full thread interleaving). Plus 11 latent / hardening / doc issues across atomics, exception safety, and observability.

Fixes applied:

# Finding Fix
1 WS race (CONFIRMED real defect): replay re-bumps http/1.1 after HandOffToWebSocket ran without protection Added std::atomic<bool> ws_handoff_done_HandOffToWebSocket sets it (release) + drains stash; MarkApplicationProtocolConfirmed checks it (acquire) at entry. Both run on the same worker dispatcher, so program-order + release/acquire arbitrate all interleavings.
2 TestWsUpgradeTransitionsProtocolGauge flake-regression Fixed by #1 — verified with 50× stress loop locally.
3 Exchange-then-EnQueue throw window loses stash with no rollback Pre-load shared_from_this() in try/catch(std::bad_weak_ptr) + pre-check event_dispatcher_ non-null BEFORE the exchange. Either failure logs + returns without consuming the stash.
4 AttachTransportObservability's idempotency early-return runs BEFORE the stash drain Split into two blocks: first-call publish work + always-run stash drain (idempotent via exchange semantics). A defensive second attach call now still drains.
5 shared_from_this() precondition undocumented Wrapped in try/catch + warn-log (per LOGGING_STANDARDS.md). Future caller invoking attach pre-shared_ptr-wrap now degrades gracefully instead of crashing the accept worker.
6 CAS-fail dual-label silent drop Added error-log when expected != protocol_label on CAS failure (mirrors the existing same-label-check pattern). Locks the single-protocol-per-connection invariant for future h2c-upgrade / classifier-hook contributors.
7 EnQueue/null-dispatcher silent drop + misleading "+1/-1 pairing" comment Replaced the misleading comment with a truthful explanation; added warn-logs at both null-dispatcher and bad_weak_ptr branches (per LOGGING_STANDARDS.md).
8 Header docstring claimed "single-reader" but worker also reads Rewrote field docstring to document the actual two-writer/two-reader access pattern and how exchange arbitrates.
10 Take-back identity taken == protocol_label not asserted Added error-log when they differ. Locks the single-writer invariant for future maintainers who might add a second writer to the stash slot.
11 ResetForPoolReuse missing reset for pending_http_protocol_label_ / ws_handoff_done_ Added both resets. Forward-compat for any future code path that calls Mark / HandOff on a recycled outbound connection.
13 auth_intro lived in shard proxy-heavy despite being an auth umbrella sub-suite Renamed shard proxy-heavyproxy-mixed (signals cross-theme content). Updated all doc references.
14 No hendrikmuhs/ccache-action in workflow Added ccache-linux-valgrind action with 500M cap (matches sanitizer-class size from ci.yml). Six shards share one ccache key since they build identical valgrind-flagged binaries.

Deferred (not in this PR — separate larger refactors)

  • Finding Enhance log system #9 (dtor non-atomic read TSan exposure) — would require promoting http_protocol_label_ to std::atomic<const char*> everywhere (dtor + HandOff + Mark + class header). Today safe via dispatcher-thread serialization; not needed for the immediate flake fix.
  • Finding Remove Reactor Server Demo Legacy Code #12 (const char* lifetime API enforcement) — would require API change to a tagged enum or string_view + intern table. Existing header comment ("Holds a pointer to a static label literal") already documents the contract; all current callers comply.

Test plan

  • Default -O2 build — clean, no warnings
  • Valgrind CI build (-O1 -g -fno-omit-frame-pointer) — clean, no warnings
  • VALGRIND_TEST=1 valgrind --error-exitcode=1 --leak-check=full --show-leak-kinds=definite,indirect --errors-for-leak-kinds=definite --track-origins=yes --num-callers=20 ./test_runner obs_connection_metricsERROR SUMMARY: 0 errors from 0 contexts, exit 0
  • obs_connection_metrics × 50 stress loop locally → 0 failures (issue Flaky test: ObsConnMetrics H1 protocol gauge read races AttachTransportObservability #43 reproducer no longer flakes)
  • Default full sweep — 1894 / 1894 tests pass
  • YAML matrix integrity — 56 suite invocations across 6 shards, all flags resolve to real ./test_runner suites, no cross-shard duplicates
  • Next manual / scheduled weekly-valgrind.yml run — all 6 shards green; predicted critical path ~25 min (was 78 min before the auth split)
  • Issue Flaky test: ObsConnMetrics H1 protocol gauge read races AttachTransportObservability #43 reproducer fixed in CI: TestH1FirstRequestSetsProtocolGauge stops flaking

Risk

Code changes — minimal.

  • ws_handoff_done_ is a new atomic with release-store from HandOffToWebSocket and acquire-load from MarkApplicationProtocolConfirmed's entry. Both run on the worker dispatcher (serialized), so program order plus the release/acquire pair carry the synchronisation correctly across all four worker-vs-attach interleavings (verifier-walked).
  • pending_http_protocol_label_ atomic is now consistently typed (header docstring corrected) and reset in ResetForPoolReuse for forward-compat.
  • All silent-drop branches now carry warn/error logs (LOGGING_STANDARDS.md compliance).
  • No new heap allocations on the hot path; one extra atomic load per Mark call.

CI changes — low.

  • weekly-valgrind.yml is independent of ci.yml; matrix failures only affect this scheduled workflow's badge.
  • proxy-mixed rename is purely cosmetic on the GitHub side (shard names appear in the Actions UI; no external dependencies).
  • ccache-action is the standard pattern used by every other Linux CI job per DEVELOPMENT_RULES.md; first run after merge builds cold, subsequent runs hit the warm cache.

Related (noted for follow-up)

  • A repo-wide audit for additional make_shared<AsyncPendingState> leak sites turned up no others — the cycle break in http_router.cc is structurally complete.
  • Issue Flaky test: ObsConnMetrics H1 protocol gauge read races AttachTransportObservability #43 references pitfalls/REACTOR_CORE.md's "Cross-thread observability pointer write without publication barrier" rule. This PR's fix (stash + replay + WS-commitment flag) extends that pattern's coverage to "race-loser publishes via the late attach call" — the rule entry will be updated post-merge to capture the new pattern.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses memory leaks caused by self-cycle references in AsyncPendingState by moving instead of copying the resume_cb_ callback in Complete and ArmResume, and clearing it in TripCancel. It also skips several high-overhead concurrent tests under Valgrind to prevent test timeouts. The reviewer identified a remaining memory leak scenario in ArmResume if TripCancel is invoked before ArmResume is called, and suggested checking the cancellation state to clear the callback immediately.

Comment thread server/http/http_router.cc
@mwfj

mwfj commented May 27, 2026

Copy link
Copy Markdown
Owner Author

LGTM

@mwfj mwfj merged commit a1bc1c7 into main May 27, 2026
6 checks passed
@mwfj mwfj deleted the fix-valgrind-check-memory-leak-issue-2 branch May 27, 2026 09:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Flaky test: ObsConnMetrics H1 protocol gauge read races AttachTransportObservability

1 participant