bvar: fix bad_weak_ptr crash in AgentCombiner::get_or_create_tls_agent#3306
Open
Gmymymy wants to merge 1 commit into
Open
bvar: fix bad_weak_ptr crash in AgentCombiner::get_or_create_tls_agent#3306Gmymymy wants to merge 1 commit into
Gmymymy wants to merge 1 commit into
Conversation
When a bvar (Reducer/IntRecorder/Percentile) is destroyed while another thread is concurrently writing to it, the AgentCombiner may no longer be managed by any shared_ptr by the time get_or_create_tls_agent() calls shared_from_this(). This causes std::bad_weak_ptr to be thrown and the process to terminate. Fix this by wrapping shared_from_this() in a try-catch: if the combiner is no longer alive, silently return NULL and skip the recording. This is safe because the metric is being torn down anyway. Also remove the now-incorrect LOG(FATAL) in the three operator<< callers. For allocation failures, get_or_create_tls_agent() already calls LOG(FATAL) internally (and aborts); the outer LOG(FATAL) was unreachable in that case and would incorrectly abort the process for the combiner-expired case. Fixes apache#3288
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
When a bvar object (
Reducer,IntRecorder,Percentile) is destroyedwhile another thread is concurrently writing to it via
operator<<, thefollowing crash occurs:
This has been observed in high-concurrency RDMA performance testing
(see #3288).
Root Cause
AgentCombinerinherits fromstd::enable_shared_from_this.In
get_or_create_tls_agent(), when an agent'scombinerweak_ptr hasexpired (the previous combiner was destroyed and the slot was reused), the
method calls
this->shared_from_this()to re-bind the agent to thecurrent combiner.
shared_from_this()throwsstd::bad_weak_ptrif the object is notcurrently managed by any
shared_ptr. This can happen in a race where:AgentCombineris the last object keeping a bvar aliveshared_ptrto it (e.g., the owningReducergoes out of scope or is destroyed during program shutdown)get_or_create_tls_agent(),past the
agent->combiner.expired()check, and callsshared_from_this()on the now-unmanaged object →
bad_weak_ptr→terminate()Fix
Wrap
shared_from_this()in atry/catch(std::bad_weak_ptr). When caught,return
NULLand silently skip the recording. This is safe: the metric isbeing torn down, so dropping a write in flight is acceptable and far
preferable to crashing.
Also remove the
LOG(FATAL)in the threeoperator<<callers that firewhen
get_or_create_tls_agent()returnsNULL:get_or_create_tls_agent()already callsLOG(FATAL)internally (which aborts); the outerLOG(FATAL)wasunreachable in that case.
LOG(FATAL)wouldincorrectly abort the process for a benign race during teardown.
Affected files
src/bvar/detail/combiner.hbad_weak_ptringet_or_create_tls_agent()src/bvar/reducer.hLOG(FATAL)fromoperator<<src/bvar/recorder.hLOG(FATAL)fromoperator<<src/bvar/detail/percentile.cppLOG(FATAL)fromoperator<<Fixes #3288