[CELEBORN-2332] Fix self join deadlock in C++ WorkerPartitionReader fetch callbacks by afterincomparableyum · Pull Request #3693 · apache/celeborn

afterincomparableyum · 2026-05-18T05:45:35Z

What changes were proposed in this pull request?

push-merged data, which exercises this fetch path more aggressively and reliably triggered an EDEADLK abort.

The onSuccess_/onFailure_ callbacks are invoked on the TransportClient's IO thread and capture a weak_ptr that is lifted to a shared_ptr inside the callback body. When that local shared_ptr happens to hold the last reference, dropping it inline runs WorkerPartitionReader on the IO executor's own thread, which transitively destroys the embedded TransportClient and its IOThreadPoolExecutor. The executor then attempts to pthread_join the thread that is currently executing the callback and the join fails with EDEADLK, aborting the process.

Hand the final reference off to the global CPU executor so destruction of the reader (and the IO executor underneath it) always happens on a different thread than the one running the callback.

Why are the changes needed?

When running the bytedance bolt celeborn e2e tests for push-merged data I am working on, I run into this error.

Does this PR resolve a correctness bug?

Yes

Does this PR introduce any user-facing change?

No

How was this patch tested?

I ran the Celeborn bolt e2e tests with this change and the tests passed with push merged data support.

afterincomparableyum · 2026-05-25T21:58:27Z

ping @SteNicholas for review please.

… callbacks Surfaced by the Bolt e2e Celeborn tests while adding support for push-merged data, which exercises this fetch path more aggressively and reliably triggered an EDEADLK abort. The onSuccess_/onFailure_ callbacks are invoked on the TransportClient's IO thread and capture a weak_ptr that is lifted to a shared_ptr inside the callback body. When that local shared_ptr happens to hold the last reference, dropping it inline runs WorkerPartitionReader's destructor on the IO executor's own thread, which transitively destroys the embedded TransportClient and its IOThreadPoolExecutor. The executor then attempts to pthread_join the thread that is currently executing the callback and the join fails with EDEADLK, aborting the process. Hand the final reference off to a dedicated CPUThreadPoolExecutor so destruction of the reader (and the IO executor underneath it) always happens on a different thread than the one running the callback. The executor is constructed directly rather than obtained from folly::getGlobalCPUExecutor() to avoid pulling in folly's singleton vault, which would otherwise require folly::init() and break unit tests.

Copilot

Pull request overview

This PR addresses a correctness issue in the C++ client fetch path where WorkerPartitionReader could be destroyed on the TransportClient IO thread, leading to a self-pthread_join (EDEADLK) abort when the embedded IO executor is torn down from within its own callback thread.

Changes:

Introduces an off-IO-thread folly::CPUThreadPoolExecutor to release the last shared_ptr<WorkerPartitionReader> asynchronously.
Updates onSuccess_ and onFailure_ fetch callbacks to hand off the final reference to that executor to ensure destruction happens on a different thread.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

SteNicholas · 2026-05-27T12:17:31Z

@afterincomparableyum, thanks for contribution. Please take a look at comment of copilot.

SteNicholas · 2026-06-02T11:48:48Z

@afterincomparableyum, any update?

afterincomparableyum · 2026-06-03T06:59:47Z

Will address comments by tomorrow @SteNicholas

afterincomparableyum · 2026-06-04T06:02:03Z

regarding CoPilot comments @SteNicholas, it is right that this posts a task per callback. But I don't think the use_count() == 1 guard is safe, so I'd prefer to keep the unconditional handoff.

The whole point of the offload is that the reader's last reference must never be dropped on the fetch callback threaders it destroys the reader on the IO thread and triggers the EDEADLK issue this PR fixes.

The problem with use_count() is that it's a stale snapshot. The reader's owner lives on another thread and can drop its reference at any moment (cleanupReader() is just currReader_ = nullptr). So:

Callback lifts shared_this --> refcount = 2
Checks use_count() --> sees 2 --> decides to drop inline
Owner thread drops its reference --> refcount = 1
Callback returns --> shared_this destructs on the IO thread --> same EDEADLK crash

So the guard trades a correct path for a racy one whose failure mode is a process abort. With multiple chunks in flight, this interleaving is realistic.

Regarding overhead. I will update this PR and add comments with a TODO to explore optimizations if needed.

… behavior and a potential TODO optimization.

github-actions Bot added the module:cpp label May 18, 2026

afterincomparableyum mentioned this pull request May 18, 2026

[Feature] use mergeData for bolt NativeCelebornClient bytedance/bolt#370

Open

afterincomparableyum marked this pull request as draft May 18, 2026 05:59

afterincomparableyum force-pushed the celeborn-2332 branch from 19327e1 to d9a3d8f Compare May 25, 2026 21:57

afterincomparableyum marked this pull request as ready for review May 25, 2026 21:58

afterincomparableyum force-pushed the celeborn-2332 branch from d9a3d8f to 5dc5fb6 Compare May 25, 2026 22:00

github-actions Bot added the module:common label May 25, 2026

afterincomparableyum force-pushed the celeborn-2332 branch from 5dc5fb6 to 2e13df9 Compare May 25, 2026 22:00

github-actions Bot removed the module:common label May 25, 2026

SteNicholas force-pushed the main branch 2 times, most recently from 1d92a40 to cf8d472 Compare May 27, 2026 02:11

SteNicholas requested a review from Copilot May 27, 2026 05:08

Copilot started reviewing on behalf of SteNicholas May 27, 2026 05:08 View session

SteNicholas changed the title ~~[CELEBORN-2332] Fix self join deadlock in C++ WorkerPartitionReader f…~~ [CELEBORN-2332] Fix self join deadlock in C++ WorkerPartitionReader fetch callbacks May 27, 2026

Copilot AI reviewed May 27, 2026

View reviewed changes

Comment thread cpp/celeborn/client/reader/WorkerPartitionReader.cpp

Comment thread cpp/celeborn/client/reader/WorkerPartitionReader.cpp

address comments of CoPilot by adding comments explaining the current…

1173782

… behavior and a potential TODO optimization.

afterincomparableyum force-pushed the celeborn-2332 branch from f6bb2d2 to 1173782 Compare June 4, 2026 06:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CELEBORN-2332] Fix self join deadlock in C++ WorkerPartitionReader fetch callbacks#3693

[CELEBORN-2332] Fix self join deadlock in C++ WorkerPartitionReader fetch callbacks#3693
afterincomparableyum wants to merge 2 commits into
apache:mainfrom
afterincomparableyum:celeborn-2332

afterincomparableyum commented May 18, 2026

Uh oh!

afterincomparableyum commented May 25, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

SteNicholas commented May 27, 2026

Uh oh!

SteNicholas commented Jun 2, 2026

Uh oh!

afterincomparableyum commented Jun 3, 2026

Uh oh!

afterincomparableyum commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

afterincomparableyum commented May 18, 2026

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR resolve a correctness bug?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

afterincomparableyum commented May 25, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

SteNicholas commented May 27, 2026

Uh oh!

SteNicholas commented Jun 2, 2026

Uh oh!

afterincomparableyum commented Jun 3, 2026

Uh oh!

afterincomparableyum commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants