[CELEBORN-2332] Fix self join deadlock in C++ WorkerPartitionReader fetch callbacks#3693
[CELEBORN-2332] Fix self join deadlock in C++ WorkerPartitionReader fetch callbacks#3693afterincomparableyum wants to merge 2 commits into
Conversation
19327e1 to
d9a3d8f
Compare
|
ping @SteNicholas for review please. |
d9a3d8f to
5dc5fb6
Compare
… callbacks Surfaced by the Bolt e2e Celeborn tests while adding support for push-merged data, which exercises this fetch path more aggressively and reliably triggered an EDEADLK abort. The onSuccess_/onFailure_ callbacks are invoked on the TransportClient's IO thread and capture a weak_ptr that is lifted to a shared_ptr inside the callback body. When that local shared_ptr happens to hold the last reference, dropping it inline runs WorkerPartitionReader's destructor on the IO executor's own thread, which transitively destroys the embedded TransportClient and its IOThreadPoolExecutor. The executor then attempts to pthread_join the thread that is currently executing the callback and the join fails with EDEADLK, aborting the process. Hand the final reference off to a dedicated CPUThreadPoolExecutor so destruction of the reader (and the IO executor underneath it) always happens on a different thread than the one running the callback. The executor is constructed directly rather than obtained from folly::getGlobalCPUExecutor() to avoid pulling in folly's singleton vault, which would otherwise require folly::init() and break unit tests.
5dc5fb6 to
2e13df9
Compare
1d92a40 to
cf8d472
Compare
There was a problem hiding this comment.
Pull request overview
This PR addresses a correctness issue in the C++ client fetch path where WorkerPartitionReader could be destroyed on the TransportClient IO thread, leading to a self-pthread_join (EDEADLK) abort when the embedded IO executor is torn down from within its own callback thread.
Changes:
- Introduces an off-IO-thread
folly::CPUThreadPoolExecutorto release the lastshared_ptr<WorkerPartitionReader>asynchronously. - Updates
onSuccess_andonFailure_fetch callbacks to hand off the final reference to that executor to ensure destruction happens on a different thread.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
@afterincomparableyum, thanks for contribution. Please take a look at comment of copilot. |
|
@afterincomparableyum, any update? |
|
Will address comments by tomorrow @SteNicholas |
|
regarding CoPilot comments @SteNicholas, it is right that this posts a task per callback. But I don't think the use_count() == 1 guard is safe, so I'd prefer to keep the unconditional handoff. The whole point of the offload is that the reader's last reference must never be dropped on the fetch callback threaders it destroys the reader on the IO thread and triggers the EDEADLK issue this PR fixes. The problem with use_count() is that it's a stale snapshot. The reader's owner lives on another thread and can drop its reference at any moment (cleanupReader() is just currReader_ = nullptr). So:
So the guard trades a correct path for a racy one whose failure mode is a process abort. With multiple chunks in flight, this interleaving is realistic. Regarding overhead. I will update this PR and add comments with a TODO to explore optimizations if needed. |
… behavior and a potential TODO optimization.
f6bb2d2 to
1173782
Compare
What changes were proposed in this pull request?
push-merged data, which exercises this fetch path more aggressively and reliably triggered an EDEADLK abort.
The onSuccess_/onFailure_ callbacks are invoked on the TransportClient's IO thread and capture a weak_ptr that is lifted to a shared_ptr inside the callback body. When that local shared_ptr happens to hold the last reference, dropping it inline runs WorkerPartitionReader on the IO executor's own thread, which transitively destroys the embedded TransportClient and its IOThreadPoolExecutor. The executor then attempts to pthread_join the thread that is currently executing the callback and the join fails with EDEADLK, aborting the process.
Hand the final reference off to the global CPU executor so destruction of the reader (and the IO executor underneath it) always happens on a different thread than the one running the callback.
Why are the changes needed?
When running the bytedance bolt celeborn e2e tests for push-merged data I am working on, I run into this error.
Does this PR resolve a correctness bug?
Yes
Does this PR introduce any user-facing change?
No
How was this patch tested?
I ran the Celeborn bolt e2e tests with this change and the tests passed with push merged data support.