Optimize week valgrind ci job#44
Merged
Merged
Conversation
There was a problem hiding this comment.
Code Review
This pull request addresses memory leaks caused by self-cycle references in AsyncPendingState by moving instead of copying the resume_cb_ callback in Complete and ArmResume, and clearing it in TripCancel. It also skips several high-overhead concurrent tests under Valgrind to prevent test timeouts. The reviewer identified a remaining memory leak scenario in ArmResume if TripCancel is invoked before ArmResume is called, and suggested checking the cancellation state to clear the callback immediately.
Owner
Author
|
LGTM |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
fix(observability, ci): protocol-gauge race fix, 6-shard valgrind matrix, and code-review hardening
Closes #43.
Summary
Three groups of changes driven by the weekly-valgrind workflow + post-merge code review:
Code (
server/http/http_router.cc, prior commit) — broke anAsyncPendingStatereference cycle that leaked across the auth-introspection async path. (Covers the three Valgrind shards that failed withmake_shared<AsyncPendingState>leaks.)CI topology (
.github/workflows/weekly-valgrind.yml) — splitauth-cryptointo three balanced shards (auth-jwks/auth-mgr/auth-intro) using measured per-suite times, renamedproxy-heavy→proxy-mixedto flag its cross-theme content, droppedauth_raceper the race-suite-under-valgrind exclusion, skipped three timing-sensitive concurrency tests underVALGRIND_TEST. Final topology: 6 shards, ~25 min critical path (down from 78 min on the prior run).Issue Flaky test: ObsConnMetrics H1 protocol gauge read races AttachTransportObservability #43 fix + code-review hardening (
server/connection_handler.cc,include/connection_handler.h) — fixed the flakyObsConnMetrics: HTTP/1.1 first request bumps protocol gaugetest by adding a stash + replay protocol soMarkApplicationProtocolConfirmedno longer silently drops the gauge bump when it races ahead ofAttachTransportObservability. Subsequent xhigh-effort code review surfaced 12 additional defects in the initial stash mechanism; all fixed in this PR.Group 3 in detail (post-review revision)
The initial stash + replay fix introduced a WebSocket-upgrade race: a late-arriving replay could
+1thehttp/1.1gauge for a connection that had already become a WebSocket, causingsum(http.connections.active{protocol=*})to double-count throughout the WS session (verifier-CONFIRMED with full thread interleaving). Plus 11 latent / hardening / doc issues across atomics, exception safety, and observability.Fixes applied:
http/1.1afterHandOffToWebSocketran without protectionstd::atomic<bool> ws_handoff_done_—HandOffToWebSocketsets it (release) + drains stash;MarkApplicationProtocolConfirmedchecks it (acquire) at entry. Both run on the same worker dispatcher, so program-order + release/acquire arbitrate all interleavings.TestWsUpgradeTransitionsProtocolGaugeflake-regressionshared_from_this()intry/catch(std::bad_weak_ptr)+ pre-checkevent_dispatcher_non-null BEFORE the exchange. Either failure logs + returns without consuming the stash.AttachTransportObservability's idempotency early-return runs BEFORE the stash drainshared_from_this()precondition undocumentedtry/catch+ warn-log (per LOGGING_STANDARDS.md). Future caller invoking attach pre-shared_ptr-wrap now degrades gracefully instead of crashing the accept worker.error-log whenexpected != protocol_labelon CAS failure (mirrors the existing same-label-check pattern). Locks the single-protocol-per-connection invariant for future h2c-upgrade / classifier-hook contributors.bad_weak_ptrbranches (per LOGGING_STANDARDS.md).exchangearbitrates.taken == protocol_labelnot assertederror-log when they differ. Locks the single-writer invariant for future maintainers who might add a second writer to the stash slot.ResetForPoolReusemissing reset forpending_http_protocol_label_/ws_handoff_done_Mark/HandOffon a recycled outbound connection.auth_introlived in shardproxy-heavydespite being an auth umbrella sub-suiteproxy-heavy→proxy-mixed(signals cross-theme content). Updated all doc references.hendrikmuhs/ccache-actionin workflowccache-linux-valgrindaction with 500M cap (matches sanitizer-class size fromci.yml). Six shards share one ccache key since they build identical valgrind-flagged binaries.Deferred (not in this PR — separate larger refactors)
http_protocol_label_tostd::atomic<const char*>everywhere (dtor +HandOff+Mark+ class header). Today safe via dispatcher-thread serialization; not needed for the immediate flake fix.const char*lifetime API enforcement) — would require API change to a tagged enum orstring_view + intern table. Existing header comment ("Holds a pointer to a static label literal") already documents the contract; all current callers comply.Test plan
-O2build — clean, no warnings-O1 -g -fno-omit-frame-pointer) — clean, no warningsVALGRIND_TEST=1 valgrind --error-exitcode=1 --leak-check=full --show-leak-kinds=definite,indirect --errors-for-leak-kinds=definite --track-origins=yes --num-callers=20 ./test_runner obs_connection_metrics→ ERROR SUMMARY: 0 errors from 0 contexts, exit 0obs_connection_metrics× 50 stress loop locally → 0 failures (issue Flaky test: ObsConnMetrics H1 protocol gauge read races AttachTransportObservability #43 reproducer no longer flakes)./test_runnersuites, no cross-shard duplicatesweekly-valgrind.ymlrun — all 6 shards green; predicted critical path ~25 min (was 78 min before the auth split)TestH1FirstRequestSetsProtocolGaugestops flakingRisk
Code changes — minimal.
ws_handoff_done_is a new atomic with release-store fromHandOffToWebSocketand acquire-load fromMarkApplicationProtocolConfirmed's entry. Both run on the worker dispatcher (serialized), so program order plus the release/acquire pair carry the synchronisation correctly across all four worker-vs-attach interleavings (verifier-walked).pending_http_protocol_label_atomic is now consistently typed (header docstring corrected) and reset inResetForPoolReusefor forward-compat.Markcall.CI changes — low.
weekly-valgrind.ymlis independent ofci.yml; matrix failures only affect this scheduled workflow's badge.proxy-mixedrename is purely cosmetic on the GitHub side (shard names appear in the Actions UI; no external dependencies).DEVELOPMENT_RULES.md; first run after merge builds cold, subsequent runs hit the warm cache.Related (noted for follow-up)
make_shared<AsyncPendingState>leak sites turned up no others — the cycle break inhttp_router.ccis structurally complete.pitfalls/REACTOR_CORE.md's "Cross-thread observability pointer write without publication barrier" rule. This PR's fix (stash + replay + WS-commitment flag) extends that pattern's coverage to "race-loser publishes via the late attach call" — the rule entry will be updated post-merge to capture the new pattern.