fixes for race conditions on disconnects by ryanofsky · Pull Request #249 · bitcoin-core/libmultiprocess

ryanofsky · 2026-03-11T03:50:25Z

The PR fixes 3 race conditions on disconnects that were detected in Bitcoin core CI runs and by antithesis:

DrahtBot · 2026-03-11T03:50:54Z

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Reviews

See the guideline for information on the review process.

Type	Reviewers
ACK	Sjors
Stale ACK	ismaelsadeeq

If your review is incorrectly listed, please copy-paste  into the comment that the bot should ignore.

LLM Linter (✨ experimental)

Possible typos and grammar issues:

acquires the it -> acquires it [Typo/misplaced “the” makes the sentence harder to read.]

^{2026-03-26 13:27:54}

ryanofsky · 2026-03-11T15:06:21Z

Closing this, replaced by #250!

ryanofsky · 2026-03-12T17:56:44Z

Updated 9536b63 -> 884c846 (pr/disraces.1 -> pr/disraces.2, compare) incorporating changes from #250 and making a lot of test improvements to make them more reliable and better documented and fix some bugs. Also rearranged commits so tests are added before fixes (otherwise it was difficult to revert the fixes because of merge conflicts).

Updated 884c846 -> 2fb97e8 (pr/disraces.2 -> pr/disraces.3, compare) to fix iwyu error https://github.com/bitcoin-core/libmultiprocess/actions/runs/23016144578

Sjors · 2026-03-13T10:17:26Z

ACK 2fb97e8

I checked that the fixes themselves are still the same as when I last looked (#250 (comment)). Since the tests here precede their fixes, it was also easy to confirm that each test actually caught something and the fix made it go away. I also lightly checked that they failed for the right reasons.

The improved test code looks good to me as well.

CI passed on Bitcoin Core. Our new TSan Bitcoin Core job passed too on #257, but it might be good to manually restart it a couple of times.

I'll let the TSan job run locally for a while to see if it finds anything.

test/mp/test/test.cpp

Sjors · 2026-03-13T15:38:50Z

I'll let the TSan job run locally for a while to see if it finds anything.

It's been running for well over an hour now without hitting anything.

ismaelsadeeq · 2026-03-17T13:21:41Z

Concept ACK

sedited

lgtm

Changes make sense to me, and I'm not seeing the same failures anymore after testing this out a bit, though as Sjors mentioned, hitting them can be flaky.

ismaelsadeeq

Code review ACK 2fb97e8

88cacd4 "test: worker thread destroyed before it is initialized"
f09731e "race fix: m_on_cancel called after request finishes"
75c5425 "test: getParams() called after request cancel"
e69b6bf race fix: getParams() called after request cancel
846a43a "test: m_on_cancel called after request finishes"
2fb97e8 "race fix: m_on_cancel called after request finishes"

The current test hooks in the event loop and the branching approach is a bit awkward imo.
A potentially better approach could be to have virtual notification methods in the event loop that are triggered at points in the code that tests are interested in.
In certain code paths, the event loop simply invokes those notifications unconditionally.
By default, the implementations of those methods are no-ops and cheap — no branching, no null checks.
In tests, we then subclass EventLoop and override those methods to inject test-specific behaviour, keeping test concerns entirely out of production code.

see diff

diff --git a/include/mp/proxy-io.h b/include/mp/proxy-io.h
index 3594708..5174fe8 100644
--- a/include/mp/proxy-io.h
+++ b/include/mp/proxy-io.h
@@ -251,7 +251,7 @@ public:
                 LogFn{[old_callback = std::move(old_callback)](LogMessage log_data) {old_callback(log_data.level == Log::Raise, std::move(log_data.message));}},
                 context){}

-    ~EventLoop();
+    virtual ~EventLoop();

     //! Run event loop. Does not return until shutdown. This should only be
     //! called once from the m_thread_id thread. This will block until
@@ -341,21 +341,22 @@ public:
     //! External context pointer.
     void* m_context;

-    //! Hook called when ProxyServer<ThreadMap>::makeThread() is called.
-    std::function<void()> testing_hook_makethread;
+    //! Virtual method called on the event loop thread when ProxyServer<ThreadMap>::makeThread()
+    //! is called to create a new remote worker thread.
+    virtual void onThreadCreate() { /* Nothing to notify by default. */ }

-    //! Hook called on the worker thread inside makeThread(), after the thread
-    //! context is set up and thread_context promise is fulfilled, but before it
-    //! starts waiting for requests.
-    std::function<void()> testing_hook_makethread_created;
+    //! Virtual method called on a new worker thread after the thread context
+    //! is set up and the thread_context promise is fulfilled, but before the
+    //! thread starts waiting for requests.
+    virtual void onThreadCreated() { /* Nothing to notify by default. */ }

-    //! Hook called on the worker thread when it starts to execute an async
-    //! request. Used by tests to control timing or inject behavior at this
-    //! point in execution.
-    std::function<void()> testing_hook_async_request_start;
+    //! Virtual method called on a worker thread when it starts to execute an
+    //! asynchronous request.
+    virtual void onAsyncRequestStart() { /* Nothing to notify by default. */ }

-    //! Hook called on the worker thread just before returning results.
-    std::function<void()> testing_hook_async_request_done;
+    //! Virtual method called on a worker thread after an asynchronous request
+    //! finishes executing, but before the response is sent.
+    virtual void onAsyncRequestDone() { /* Nothing to notify by default. */ }
 };

 //! Single element task queue used to handle recursive capnp calls. (If the
diff --git a/include/mp/type-context.h b/include/mp/type-context.h
index 9c7f21b..12218c3 100644
--- a/include/mp/type-context.h
+++ b/include/mp/type-context.h
@@ -73,7 +73,7 @@ auto PassField(Priority<1>, TypeList<>, ServerContext& server_context, const Fn&
     auto invoke = [self = kj::mv(self), call_context = kj::mv(server_context.call_context), &server, req, fn, args...](CancelMonitor& cancel_monitor) mutable {
                 MP_LOG(*server.m_context.loop, Log::Debug) << "IPC server executing request #" << req;
                 EventLoop& loop = *server.m_context.loop;
-                if (loop.testing_hook_async_request_start) loop.testing_hook_async_request_start();
+                loop.onAsyncRequestStart();
                 ServerContext server_context{server, call_context, req};
                 {
                     // Before invoking the function, store a reference to the
@@ -192,7 +192,7 @@ auto PassField(Priority<1>, TypeList<>, ServerContext& server_context, const Fn&
                     }
                     // End of scope: if KJ_DEFER was reached, it runs here
                 }
-                if (loop.testing_hook_async_request_done) loop.testing_hook_async_request_done();
+                loop.onAsyncRequestDone();
                 return call_context;
             };

diff --git a/src/mp/proxy.cpp b/src/mp/proxy.cpp
index f36e19f..0c9d146 100644
--- a/src/mp/proxy.cpp
+++ b/src/mp/proxy.cpp
@@ -411,7 +411,7 @@ ProxyServer<ThreadMap>::ProxyServer(Connection& connection) : m_connection(conne

 kj::Promise<void> ProxyServer<ThreadMap>::makeThread(MakeThreadContext context)
 {
-    if (m_connection.m_loop->testing_hook_makethread) m_connection.m_loop->testing_hook_makethread();
+    m_connection.m_loop->onThreadCreate();
     const std::string from = context.getParams().getName();
     std::promise<ThreadContext*> thread_context;
     std::thread thread([&thread_context, from, this]() {
@@ -420,7 +420,7 @@ kj::Promise<void> ProxyServer<ThreadMap>::makeThread(MakeThreadContext context)
         g_thread_context.waiter = std::make_unique<Waiter>();
         Lock lock(g_thread_context.waiter->m_mutex);
         thread_context.set_value(&g_thread_context);
-        if (loop.testing_hook_makethread_created) loop.testing_hook_makethread_created();
+        loop.onThreadCreated();
         // Wait for shutdown signal from ProxyServer<Thread> destructor (signal
         // is just waiter getting set to null.)
         g_thread_context.waiter->wait(lock, [] { return !g_thread_context.waiter; });
diff --git a/test/mp/test/test.cpp b/test/mp/test/test.cpp
index 4f71a55..d3dd6d8 100644
--- a/test/mp/test/test.cpp
+++ b/test/mp/test/test.cpp
@@ -60,6 +60,24 @@ static_assert(std::is_integral_v<decltype(kMP_MINOR_VERSION)>, "MP_MINOR_VERSION
  * client_disconnect manually, but false allows testing more ProxyClient
  * behavior and the "IPC client method called after disconnect" code path.
  */
+
+//! EventLoop subclass used by tests to override virtual methods and inject
+//! test-specific behavior.
+class TestEventLoop : public EventLoop
+{
+public:
+    using EventLoop::EventLoop;
+    std::function<void()> on_thread_create;
+    std::function<void()> on_thread_created;
+    std::function<void()> on_async_request_start;
+    std::function<void()> on_async_request_done;
+
+    void onThreadCreate() override { if (on_thread_create) on_thread_create(); }
+    void onThreadCreated() override { if (on_thread_created) on_thread_created(); }
+    void onAsyncRequestStart() override { if (on_async_request_start) on_async_request_start(); }
+    void onAsyncRequestDone() override { if (on_async_request_done) on_async_request_done(); }
+};
+
 class TestSetup
 {
 public:
@@ -69,17 +87,19 @@ public:
     std::promise<std::unique_ptr<ProxyClient<messages::FooInterface>>> client_promise;
     std::unique_ptr<ProxyClient<messages::FooInterface>> client;
     ProxyServer<messages::FooInterface>* server{nullptr};
+    TestEventLoop* test_loop{nullptr};
     //! Thread variable should be after other struct members so the thread does
     //! not start until the other members are initialized.
     std::thread thread;

     TestSetup(bool client_owns_connection = true)
-        : thread{[&] {
-              EventLoop loop("mptest", [](mp::LogMessage log) {
+        : thread{[&, client_owns_connection] {
+              TestEventLoop loop("mptest", [](mp::LogMessage log) {
                   // Info logs are not printed by default, but will be shown with `mptest --verbose`
                   KJ_LOG(INFO, log.level, log.message);
                   if (log.level == mp::Log::Raise) throw std::runtime_error(log.message);
               });
+              test_loop = &loop;
               auto pipe = loop.m_io_context.provider->newTwoWayPipe();

               auto server_connection =
@@ -336,31 +356,31 @@ KJ_TEST("Worker thread destroyed before it is initialized")
     // Regression test for bitcoin/bitcoin#34711, bitcoin/bitcoin#34756
     // where worker thread is destroyed before it starts.
     //
-    // The test works by using the `makethread` hook to start a disconnect as
-    // soon as ProxyServer<ThreadMap>::makeThread is called, and using the
-    // `makethread_created` hook to sleep 100ms after the thread is created but
-    // before it starts waiting, so without the bugfix,
+    // The test works by overriding the `onThreadCreate` method to start a
+    // disconnect as soon as ProxyServer<ThreadMap>::makeThread is called, and
+    // overriding the `onThreadCreated` method to sleep after the thread is
+    // created but before it starts waiting, so without the bugfix,
     // ProxyServer<Thread>::~ProxyServer would run and destroy the waiter,
     // causing a SIGSEGV in the worker thread after the sleep.
-    TestSetup setup;
-    ProxyClient<messages::FooInterface>* foo = setup.client.get();
-    foo->initThreadMap();
-    setup.server->m_impl->m_fn = [] {};
-
-    EventLoop& loop = *setup.server->m_context.connection->m_loop;
-    loop.testing_hook_makethread = [&] {
-        setup.server_disconnect_later();
-    };
-    loop.testing_hook_makethread_created = [&] {
-        std::this_thread::sleep_for(std::chrono::milliseconds(10));
-    };
-
     bool disconnected{false};
-    try {
-        foo->callFnAsync();
-    } catch (const std::runtime_error& e) {
-        KJ_EXPECT(std::string_view{e.what()} == "IPC client method call interrupted by disconnect.");
-        disconnected = true;
+    {
+        TestSetup setup;
+        setup.test_loop->on_thread_create = [&] {
+            setup.server_disconnect_later();
+        };
+        setup.test_loop->on_thread_created = [&] {
+            std::this_thread::sleep_for(std::chrono::milliseconds(10));
+        };
+        ProxyClient<messages::FooInterface>* foo = setup.client.get();
+        foo->initThreadMap();
+        setup.server->m_impl->m_fn = [] {};
+
+        try {
+            foo->callFnAsync();
+        } catch (const std::runtime_error& e) {
+            KJ_EXPECT(std::string_view{e.what()} == "IPC client method call interrupted by disconnect.");
+            disconnected = true;
+        }
     }
     KJ_EXPECT(disconnected);
 }
@@ -370,29 +390,31 @@ KJ_TEST("Calling async IPC method, with server disconnect racing the call")
     // Regression test for bitcoin/bitcoin#34777 heap-use-after-free where
     // an async request is canceled before it starts to execute.
     //
-    // Use testing_hook_async_request_start to trigger a disconnect from the
+    // Override onAsyncRequestStart to trigger a disconnect from the
     // worker thread as soon as it begins to execute an async request. Without
     // the bugfix, the worker thread would trigger a SIGSEGV after this by
     // calling call_context.getParams().
-    TestSetup setup;
-    ProxyClient<messages::FooInterface>* foo = setup.client.get();
-    foo->initThreadMap();
-    setup.server->m_impl->m_fn = [] {};
-
-    EventLoop& loop = *setup.server->m_context.connection->m_loop;
-    loop.testing_hook_async_request_start = [&] {
-        setup.server_disconnect();
-        // Sleep is neccessary to let the event loop fully clean up after the
-        // disconnect and trigger the SIGSEGV.
-        std::this_thread::sleep_for(std::chrono::milliseconds(10));
-    };
-
-    try {
-        foo->callFnAsync();
-        KJ_EXPECT(false);
-    } catch (const std::runtime_error& e) {
-        KJ_EXPECT(std::string_view{e.what()} == "IPC client method call interrupted by disconnect.");
+    bool disconnected{false};
+    {
+        TestSetup setup;
+        setup.test_loop->on_async_request_start = [&] {
+            setup.server_disconnect();
+            // Sleep is neccessary to let the event loop fully clean up after the
+            // disconnect and trigger the SIGSEGV.
+            std::this_thread::sleep_for(std::chrono::milliseconds(10));
+        };
+        ProxyClient<messages::FooInterface>* foo = setup.client.get();
+        foo->initThreadMap();
+        setup.server->m_impl->m_fn = [] {};
+
+        try {
+            foo->callFnAsync();
+        } catch (const std::runtime_error& e) {
+            KJ_EXPECT(std::string_view{e.what()} == "IPC client method call interrupted by disconnect.");
+            disconnected = true;
+        }
     }
+    KJ_EXPECT(disconnected);
 }

 KJ_TEST("Calling async IPC method, with server disconnect after cleanup")
@@ -401,27 +423,29 @@ KJ_TEST("Calling async IPC method, with server disconnect after cleanup")
     // an async request is canceled after it finishes executing but before the
     // response is sent.
     //
-    // Use testing_hook_async_request_done to trigger a disconnect from the
+    // Override onAsyncRequestDone to trigger a disconnect from the
     // worker thread after it execute an async requests but before it returns.
     // Without the bugfix, the m_on_cancel callback would be called at this
     // point accessing the cancel_mutex stack variable that had gone out of
     // scope.
-    TestSetup setup;
-    ProxyClient<messages::FooInterface>* foo = setup.client.get();
-    foo->initThreadMap();
-    setup.server->m_impl->m_fn = [] {};
-
-    EventLoop& loop = *setup.server->m_context.connection->m_loop;
-    loop.testing_hook_async_request_done = [&] {
-        setup.server_disconnect();
-    };
-
-    try {
-        foo->callFnAsync();
-        KJ_EXPECT(false);
-    } catch (const std::runtime_error& e) {
-        KJ_EXPECT(std::string_view{e.what()} == "IPC client method call interrupted by disconnect.");
+    bool disconnected{false};
+    {
+        TestSetup setup;
+        setup.test_loop->on_async_request_done = [&] {
+            setup.server_disconnect();
+        };
+        ProxyClient<messages::FooInterface>* foo = setup.client.get();
+        foo->initThreadMap();
+        setup.server->m_impl->m_fn = [] {};
+
+        try {
+            foo->callFnAsync();
+        } catch (const std::runtime_error& e) {
+            KJ_EXPECT(std::string_view{e.what()} == "IPC client method call interrupted by disconnect.");
+            disconnected = true;
+        }
     }
+    KJ_EXPECT(disconnected);
 }

 KJ_TEST("Make simultaneous IPC calls on single remote thread")

src/mp/proxy.cpp

ismaelsadeeq · 2026-03-18T09:50:51Z

test/mp/test/test.cpp

@@ -63,6 +64,7 @@ class TestSetup
 {


In "test: worker thread destroyed before it is initialized" 88cacd4

Comment can be adjusted to with disconnect_later addition

- * Provides client_disconnect and server_disconnect lambdas that can be used to - * trigger disconnects and test handling of broken and closed connections. + * Provides disconnection lambdas that can be used to trigger + * disconnects and test handling of broken and closed connections. *

re: #249 (comment)

Comment can be adjusted to with disconnect_later addition

Good catch, applied update

ismaelsadeeq · 2026-03-18T12:32:38Z

src/mp/proxy.cpp

In "race fix: m_on_cancel called after request finishes" f09731e

IIUC, the reason why there is a race is that we do not lock before setting the thread context to the thread context promise.
When a disconnect occurs in the interval between setting the thread context and acquiring the lock, a race happens because in that interval the client drops the connection, ProxyServer is destroyed, and then the worker thread tries to acquire the lock and access the waiter — but it has already been destroyed, causing a SIGSEGV.
Acquiring the lock before calling set_value prevents this by ensuring the worker thread holds the mutex before the main thread can proceed. The ProxyServer destructor must acquire the same mutex to signal shutdown, so it blocks until the worker thread is safely inside its wait loop, guaranteeing the waiter is never destroyed while the worker thread is still trying to access it.

g_thread_context.waiter = std::make_unique<Waiter>(); thread_context.set_value(&g_thread_context); // ← signals main thread // ← race window opens here Lock lock(g_thread_context.waiter->m_mutex); // ← worker tries to lock

When set_value is called, the main thread unblocks and proceeds:

Main thread Worker thread ──────────────────────────────── ──────────────────────────────── thread_context.get_future().get() set_value() → main thread wakes creates ProxyServer<Thread> client drops reference ~ProxyServer<Thread> runs waiter set to nullptr waiter destroyed Lock(g_thread_context.waiter->m_mutex) ← waiter is gone → SIGSEGV

It will be nice to mention this in the commit message.

re: #249 (comment)

IIUC, the reason why there is a race is that we do not lock before setting the thread context to the thread context promise.

Yes your description is exactly right. I tried to describe the failure in some detail in the regression test but now added more information to the commit message as well including a link to your comment.

include/mp/type-context.h

ismaelsadeeq · 2026-03-18T13:05:07Z

include/mp/type-context.h

+                            // we do not want to be notified because
+                            // cancel_mutex and server_context could be out of
+                            // scope when it happens.
+                            cancel_monitor.m_on_cancel = nullptr;


Gemini wrote a mermaid diagram

Before 2fb97e8

sequenceDiagram participant EL as Event Loop Thread participant W as Worker Thread participant CM as CancelMonitor (m_on_cancel) Note over W: --- PHASE: EXECUTION --- W->>W: fn.invoke(server_context, cancel_mutex) Note over W: invocation finished W->>W: [Pop Stack Frame] Note right of W: server_context & cancel_mutex are now FREED Note over EL, W: --- THE RACE WINDOW --- Note right of EL: Client Disconnects NOW EL->>CM: Trigger ~CancelProbe() CM->>CM: Execute m_on_cancel() CM-->>W: Access freed server_context / cancel_mutex Note over CM: CRASH: Use-After-Free W->>EL: m_loop->sync() (Too late!)

Loading

After 2fb97e8

sequenceDiagram participant EL as Event Loop Thread participant W as Worker Thread participant CM as CancelMonitor (m_on_cancel) Note over W: --- PHASE: EXECUTION --- W->>W: fn.invoke(server_context, cancel_mutex) Note over W: --- PHASE: CLEANUP --- W->>EL: m_loop->sync() [START] Note right of EL: SYNC BARRIER (On Event Loop) EL->>CM: m_on_cancel = nullptr Note right of EL: DISARMED: Callback is gone W->>W: [Pop Stack Frame] Note right of W: server_context & cancel_mutex are now FREED Note over EL, W: --- THE DISCONNECT WINDOW --- EL->>EL: Trigger ~CancelProbe() EL->>EL: Check m_on_cancel (is NULL) Note right of EL: SAFE: No callback to trigger EL-->>W: sync() returns [END] W->>W: Worker exits safely

Loading

re: #249 (comment)

Gemini wrote a mermaid diagram

Yes, this is right. In the following code:

libmultiprocess/include/mp/proxy-io.h

Lines 744 to 749 in db8f76a

kj::Maybe<kj::Exception> exception{kj::runCatchingExceptions([&]{ result_value.emplace(fn(*cancel_monitor_ptr)); })};

m_loop->sync([this, &result_value, &exception, self = kj::mv(self), result_fulfiller = kj::mv(result_fulfiller), cancel_monitor_ptr = kj::mv(cancel_monitor_ptr)]() mutable {

// Destroy CancelMonitor here before fulfilling or rejecting the

// promise so it doesn't get triggered when the promise is

// destroyed.

cancel_monitor_ptr = nullptr;

There is a brief window after fn() returns but before cancel_monitor_ptr = nullptr runs where the m_on_cancel callback previously could fire and access stack variables in fn() that were out of scope.

Sjors · 2026-03-24T14:07:01Z

@ryanofsky should we add a "Bitcoin Core v31" milestone or tag? I think only this PR is needed.

ismaelsadeeq · 2026-03-25T15:22:27Z

rfm?

cc @Eunovo

ryanofsky · 2026-03-25T16:30:57Z

@ryanofsky should we add a "Bitcoin Core v31" milestone or tag? I think only this PR is needed.

Hmm, I was thinking this should not be part of v31 because the bugs that this fixes only happen if clients disconnect unclearly or take the odd step of immediately destroying a worker thread after creating it. Also my understanding is even those things will not trigger bugs without unlucky timing, requiring tests to be run hundreds of times or more, and sometimes also needing asan or tsan or an overloaded system with lots of preemption as well.

On the other hand, the fixes in this PR are all pretty simple, just moving some assignments around, so they do seem like fairly safe changes.

Overall I'd lean not to merging this into v31 unless there is a specific reason to do that. But no strong opinion, and I've been a little out of the loop, and there's actually more time than I thought before 31.0 (bitcoin/bitcoin#33607) so it could be reasonable to target.

In any case, I'll make some minor updates here and respond to all the review comments and also refresh bitcoin/bitcoin#34804.

Sjors · 2026-03-25T18:04:30Z

I agree the fixes aren't v31 release blocking, but I do think they're worth including if possible.

Unclean disconnects is something SRI developers and users are quite likely to encounter because they stack a bunch of applications: node -> template provider -> job declarator -> translator. They can be shut down in any order, or just crash, and that's not always going to be graceful. We then get bug reports of things we already know.

This fixes a race condition in makeThread that can currently trigger segfaults as reported: bitcoin/bitcoin#34711 bitcoin/bitcoin#34756 The problem is a segfault in ProxyServer<ThreadMap>::makeThread calling `Lock lock(g_thread_context.waiter->m_mutex);` that happens because the waiter pointer is null. The waiter pointer can be null if the worker thread is destroyed immediately after it is created, because `~ProxyServer<Thread>` sets it to null. The fix works by moving the lock line above the `thread_context.set_value()` line so the worker thread can't be destroyed before it is fully initialized. A more detailed description of the bug and fix can be found in bitcoin-core#249 (comment) The bug can be reproduced by running the unit test added in the previous commit or by calling makeThread and immediately disconnecting or destroying the returned thread. The bug is not new and has existed since makeThread was implemented, but it was found due to a new functional test in bitcoin core and with antithesis testing (see details in linked issues). The fix was originally posted in bitcoin/bitcoin#34711 (comment)

This fixes a race condition in the mp.Context PassField() overload which is used to execute async requests, that can currently trigger segfaults as reported in bitcoin/bitcoin#34782 when a cancellation happens after the request executes but before it returns. The bug can be reproduced by running the unit test added in the previous commit and was also seen in antithesis (see details in linked issue), but should be unlikely to happen normally because the cancellation would have to happen in a very short window for there to be a problem. This bug was introduced commit in 0174450 which started to cancel requests on disconnects. Before that commit a cancellation callback was not present. This fix was originally posted in bitcoin/bitcoin#34782 (comment) and there is a sequence diagram explaining the bug in bitcoin-core#249 (comment)

This fixes a race condition in makeThread that can currently trigger segfaults as reported: bitcoin/bitcoin#34711 bitcoin/bitcoin#34756 The problem is a segfault in ProxyServer<ThreadMap>::makeThread calling `Lock lock(g_thread_context.waiter->m_mutex);` that happens because the waiter pointer is null. The waiter pointer can be null if the worker thread is destroyed immediately after it is created, because `~ProxyServer<Thread>` sets it to null. The fix works by moving the lock line above the `thread_context.set_value()` line so the worker thread can't be destroyed before it is fully initialized. A more detailed description of the bug and fix can be found in bitcoin-core#249 (comment) The bug can be reproduced by running the unit test added in the previous commit or by calling makeThread and immediately disconnecting or destroying the returned thread. The bug is not new and has existed since makeThread was implemented, but it was found due to a new functional test in bitcoin core and with antithesis testing (see details in linked issues). The fix was originally posted in bitcoin/bitcoin#34711 (comment)

This fixes a race condition in the mp.Context PassField() overload which is used to execute async requests, that can currently trigger segfaults as reported in bitcoin/bitcoin#34782 when a cancellation happens after the request executes but before it returns. The bug can be reproduced by running the unit test added in the previous commit and was also seen in antithesis (see details in linked issue), but should be unlikely to happen normally because the cancellation would have to happen in a very short window for there to be a problem. This bug was introduced commit in 0174450 which started to cancel requests on disconnects. Before that commit a cancellation callback was not present. This fix was originally posted in bitcoin/bitcoin#34782 (comment) and there is a sequence diagram explaining the bug in bitcoin-core#249 (comment)

This fixes a race condition in makeThread that can currently trigger segfaults as reported: bitcoin/bitcoin#34711 bitcoin/bitcoin#34756 The problem is a segfault in ProxyServer<ThreadMap>::makeThread calling `Lock lock(g_thread_context.waiter->m_mutex);` that happens because the waiter pointer is null. The waiter pointer can be null if the worker thread is destroyed immediately after it is created, because `~ProxyServer<Thread>` sets it to null. The fix works by moving the lock line above the `thread_context.set_value()` line so the worker thread can't be destroyed before it is fully initialized. A more detailed description of the bug and fix can be found in bitcoin-core#249 (comment) The bug can be reproduced by running the unit test added in the previous commit or by calling makeThread and immediately disconnecting or destroying the returned thread. The bug is not new and has existed since makeThread was implemented, but it was found due to a new functional test in bitcoin core and with antithesis testing (see details in linked issues). The fix was originally posted in bitcoin/bitcoin#34711 (comment)

This fixes a race condition in the mp.Context PassField() overload which is used to execute async requests, that can currently trigger segfaults as reported in bitcoin/bitcoin#34782 when a cancellation happens after the request executes but before it returns. The bug can be reproduced by running the unit test added in the previous commit and was also seen in antithesis (see details in linked issue), but should be unlikely to happen normally because the cancellation would have to happen in a very short window for there to be a problem. This bug was introduced commit in 0174450 which started to cancel requests on disconnects. Before that commit a cancellation callback was not present. This fix was originally posted in bitcoin/bitcoin#34782 (comment) and there is a sequence diagram explaining the bug in bitcoin-core#249 (comment)

ryanofsky · 2026-03-26T09:51:08Z

Rebased 81c2169 -> 396473f (pr/disraces.5 -> pr/disraces.6, compare) with no changes after #263, to trigger a CI run and see if interface_ipc.py is still hanging waiting for 'canceled while executing' messages (https://github.com/bitcoin-core/libmultiprocess/actions/runs/23564112620/job/68621668478?pr=249#step:21:1991)

Rebased 396473f -> ff0eed1 (pr/disraces.6 -> pr/disraces.7, compare) with no changes after #264, to trigger a CI run and see if "exit code 143" errors are fixed (https://github.com/bitcoin-core/libmultiprocess/actions/runs/23587830246)

Add test for race condition in makeThread that can currently trigger segfaults as reported: bitcoin/bitcoin#34711 bitcoin/bitcoin#34756 The test currently crashes and will be fixed in the next commit. Co-authored-by: Ryan Ofsky <ryan@ofsky.org> git-bisect-skip: yes

This fixes a race condition in makeThread that can currently trigger segfaults as reported: bitcoin/bitcoin#34711 bitcoin/bitcoin#34756 The problem is a segfault in ProxyServer<ThreadMap>::makeThread calling `Lock lock(g_thread_context.waiter->m_mutex);` that happens because the waiter pointer is null. The waiter pointer can be null if the worker thread is destroyed immediately after it is created, because `~ProxyServer<Thread>` sets it to null. The fix works by moving the lock line above the `thread_context.set_value()` line so the worker thread can't be destroyed before it is fully initialized. A more detailed description of the bug and fix can be found in bitcoin-core#249 (comment) The bug can be reproduced by running the unit test added in the previous commit or by calling makeThread and immediately disconnecting or destroying the returned thread. The bug is not new and has existed since makeThread was implemented, but it was found due to a new functional test in bitcoin core and with antithesis testing (see details in linked issues). The fix was originally posted in bitcoin/bitcoin#34711 (comment)

Add test for disconnect race condition in the mp.Context PassField() overload that can currently trigger segfaults as reported in bitcoin/bitcoin#34777. The test currently crashes and will be fixed in the next commit. Co-authored-by: Ryan Ofsky <ryan@ofsky.org> git-bisect-skip: yes

This fixes a race condition in the mp.Context PassField() overload which is used to execute async requests, that can currently trigger segfaults as reported in bitcoin/bitcoin#34777 when it calls call_context.getParams() after a disconnect. The bug can be reproduced by running the unit test added in the previous commit and was also seen in antithesis (see details in linked issue), but should be unlikely to happen normally because PassField checks for cancellation and returns early before actually using the getParams() result. This bug was introduced commit in 0174450 which started to cancel requests on disconnects. Before that commit, requests were not canceled and would continue to execute (Cap'n Proto would just discard the responses) so it was ok to call getParams(). This fix was originally posted in bitcoin/bitcoin#34777 (comment)

Add test disconnect for race condition in the mp.Context PassField() overload reported in bitcoin/bitcoin#34782. The test crashes currently with AddressSanitizer, but will be fixed in the next commit. It's also possible to reproduce the bug without AddressSanitizer by adding an assert: ```diff --- a/include/mp/type-context.h +++ b/include/mp/type-context.h @@ -101,2 +101,3 @@ auto PassField(Priority<1>, TypeList<>, ServerContext& server_context, const Fn& server_context.cancel_lock = &cancel_lock; + KJ_DEFER(server_context.cancel_lock = nullptr); server.m_context.loop->sync([&] { @@ -111,2 +112,3 @@ auto PassField(Priority<1>, TypeList<>, ServerContext& server_context, const Fn& MP_LOG(*server.m_context.loop, Log::Info) << "IPC server request #" << req << " canceled while executing."; + assert(server_context.cancel_lock); // Lock cancel_mutex here to block the event loop ``` Co-authored-by: Ryan Ofsky <ryan@ofsky.org> git-bisect-skip: yes

This fixes a race condition in the mp.Context PassField() overload which is used to execute async requests, that can currently trigger segfaults as reported in bitcoin/bitcoin#34782 when a cancellation happens after the request executes but before it returns. The bug can be reproduced by running the unit test added in the previous commit and was also seen in antithesis (see details in linked issue), but should be unlikely to happen normally because the cancellation would have to happen in a very short window for there to be a problem. This bug was introduced commit in 0174450 which started to cancel requests on disconnects. Before that commit a cancellation callback was not present. This fix was originally posted in bitcoin/bitcoin#34782 (comment) and there is a sequence diagram explaining the bug in bitcoin-core#249 (comment)

This change has no effect on behavior, it just narrows the scope of the params variable to avoid potential bugs if a cancellation happens and makes them no longer valid.

Replace `server.m_context.loop` references with `loop` in Context PassField implmentation after a `loop` variable was introduced in a recent commit. Also adjust PassField scopes and indentation without changing behavior. This commit is easiest to review ignoring whitespace.

This fixes a race condition in makeThread that can currently trigger segfaults as reported: bitcoin/bitcoin#34711 bitcoin/bitcoin#34756 The problem is a segfault in ProxyServer<ThreadMap>::makeThread calling `Lock lock(g_thread_context.waiter->m_mutex);` that happens because the waiter pointer is null. The waiter pointer can be null if the worker thread is destroyed immediately after it is created, because `~ProxyServer<Thread>` sets it to null. The fix works by moving the lock line above the `thread_context.set_value()` line so the worker thread can't be destroyed before it is fully initialized. A more detailed description of the bug and fix can be found in bitcoin-core#249 (comment) The bug can be reproduced by running the unit test added in the previous commit or by calling makeThread and immediately disconnecting or destroying the returned thread. The bug is not new and has existed since makeThread was implemented, but it was found due to a new functional test in bitcoin core and with antithesis testing (see details in linked issues). The fix was originally posted in bitcoin/bitcoin#34711 (comment)

This fixes a race condition in the mp.Context PassField() overload which is used to execute async requests, that can currently trigger segfaults as reported in bitcoin/bitcoin#34782 when a cancellation happens after the request executes but before it returns. The bug can be reproduced by running the unit test added in the previous commit and was also seen in antithesis (see details in linked issue), but should be unlikely to happen normally because the cancellation would have to happen in a very short window for there to be a problem. This bug was introduced commit in 0174450 which started to cancel requests on disconnects. Before that commit a cancellation callback was not present. This fix was originally posted in bitcoin/bitcoin#34782 (comment) and there is a sequence diagram explaining the bug in bitcoin-core#249 (comment)

This fixes a race condition in makeThread that can currently trigger segfaults as reported: bitcoin/bitcoin#34711 bitcoin/bitcoin#34756 The problem is a segfault in ProxyServer<ThreadMap>::makeThread calling `Lock lock(g_thread_context.waiter->m_mutex);` that happens because the waiter pointer is null. The waiter pointer can be null if the worker thread is destroyed immediately after it is created, because `~ProxyServer<Thread>` sets it to null. The fix works by moving the lock line above the `thread_context.set_value()` line so the worker thread can't be destroyed before it is fully initialized. A more detailed description of the bug and fix can be found in bitcoin-core#249 (comment) The bug can be reproduced by running the unit test added in the previous commit or by calling makeThread and immediately disconnecting or destroying the returned thread. The bug is not new and has existed since makeThread was implemented, but it was found due to a new functional test in bitcoin core and with antithesis testing (see details in linked issues). The fix was originally posted in bitcoin/bitcoin#34711 (comment)

This fixes a race condition in the mp.Context PassField() overload which is used to execute async requests, that can currently trigger segfaults as reported in bitcoin/bitcoin#34782 when a cancellation happens after the request executes but before it returns. The bug can be reproduced by running the unit test added in the previous commit and was also seen in antithesis (see details in linked issue), but should be unlikely to happen normally because the cancellation would have to happen in a very short window for there to be a problem. This bug was introduced commit in 0174450 which started to cancel requests on disconnects. Before that commit a cancellation callback was not present. This fix was originally posted in bitcoin/bitcoin#34782 (comment) and there is a sequence diagram explaining the bug in bitcoin-core#249 (comment)

This fixes a race condition in makeThread that can currently trigger segfaults as reported: bitcoin/bitcoin#34711 bitcoin/bitcoin#34756 The problem is a segfault in ProxyServer<ThreadMap>::makeThread calling `Lock lock(g_thread_context.waiter->m_mutex);` that happens because the waiter pointer is null. The waiter pointer can be null if the worker thread is destroyed immediately after it is created, because `~ProxyServer<Thread>` sets it to null. The fix works by moving the lock line above the `thread_context.set_value()` line so the worker thread can't be destroyed before it is fully initialized. A more detailed description of the bug and fix can be found in bitcoin-core#249 (comment) The bug can be reproduced by running the unit test added in the previous commit or by calling makeThread and immediately disconnecting or destroying the returned thread. The bug is not new and has existed since makeThread was implemented, but it was found due to a new functional test in bitcoin core and with antithesis testing (see details in linked issues). The fix was originally posted in bitcoin/bitcoin#34711 (comment)

This fixes a race condition in the mp.Context PassField() overload which is used to execute async requests, that can currently trigger segfaults as reported in bitcoin/bitcoin#34782 when a cancellation happens after the request executes but before it returns. The bug can be reproduced by running the unit test added in the previous commit and was also seen in antithesis (see details in linked issue), but should be unlikely to happen normally because the cancellation would have to happen in a very short window for there to be a problem. This bug was introduced commit in 0174450 which started to cancel requests on disconnects. Before that commit a cancellation callback was not present. This fix was originally posted in bitcoin/bitcoin#34782 (comment) and there is a sequence diagram explaining the bug in bitcoin-core#249 (comment)

Sjors · 2026-03-26T13:37:16Z

ACK ff0eed1

…da8f 70f632bda8f Merge bitcoin-core/libmultiprocess#265: ci: set LC_ALL in shell scripts 8e8e564259a Merge bitcoin-core/libmultiprocess#249: fixes for race conditions on disconnects 05d34cc2ec3 ci: set LC_ALL in shell scripts e606fd84a8c Merge bitcoin-core/libmultiprocess#264: ci: reduce nproc multipliers ff0eed1bf18 refactor: Use loop variable in type-context.h ff1d8ba172a refactor: Move type-context.h getParams() call closer to use 1dbc59a4aa3 race fix: m_on_cancel called after request finishes 1643d05ba07 test: m_on_cancel called after request finishes f5509a31fcc race fix: getParams() called after request cancel 4a60c39f24a test: getParams() called after request cancel f11ec29ed20 race fix: worker thread destroyed before it is initialized a1d643348f4 test: worker thread destroyed before it is initialized 336023382c4 ci: reduce nproc multipliers b090beb9651 Merge bitcoin-core/libmultiprocess#256: ci: cache gnu32 nix store be8622816da ci: cache gnu32 nix store 975270b619c Merge bitcoin-core/libmultiprocess#263: ci: bump timeout factor to 40 09f10e5a598 ci: bump timeout factor to 40 db8f76ad290 Merge bitcoin-core/libmultiprocess#253: ci: run some Bitcoin Core CI jobs 55a9b557b19 ci: set Bitcoin Core CI test repetition fb0fc84d556 ci: add TSan job with instrumented libc++ 0f29c38725b ci: add Bitcoin Core IPC tests (ASan + macOS) 3f64320315d Merge bitcoin-core/libmultiprocess#262: ci: enable clang-tidy in macOS job, use nullptr cd9f8bdc9f0 Merge bitcoin-core/libmultiprocess#258: log: add socket connected info message and demote destroy logs to debug b5d6258a42f Merge bitcoin-core/libmultiprocess#255: fix: use unsigned char cast and sizeof in LogEscape escape sequence d94688e2c32 Merge bitcoin-core/libmultiprocess#251: Improved CustomBuildField for std::optional in IPC/libmultiprocess a9499fad755 mp: use nullptr with pthread_threadid_np f499e37850f ci: enable clang-tidy in macOS job 98f1352159d log: add socket connected info message and demote destroy logs to debug 554a481ea73 fix: use unsigned char cast and sizeof in LogEscape escape sequence 1977b9f3f65 Use std::forward in CustomBuildField for std::optional to allow move semantics, resolves FIXME 22bec918c97 Merge bitcoin-core/libmultiprocess#247: type-map: Work around LLVM 22 "out of bounds index" error 8a5e3ae6ed2 Merge bitcoin-core/libmultiprocess#242: proxy-types: add CustomHasField hook to map Cap'n Proto values to null C++ values e8d35246918 Merge bitcoin-core/libmultiprocess#246: doc: Bump version 8 > 9 97d877053b6 proxy-types: add CustomHasField hook for nullable decode paths 8c2f10252c9 refactor: add missing includes to mp/type-data.h b1638aceb40 doc: Bump version 8 > 9 f61af487217 type-map: Work around LLVM 22 "out of bounds index" error git-subtree-dir: src/ipc/libmultiprocess git-subtree-split: 70f632bda8f80449b6240f98da768206a535a04e

…n disconnects 2478a15 Squashed 'src/ipc/libmultiprocess/' changes from 1868a84451f..70f632bda8f (Ryan Ofsky) Pull request description: Includes: - bitcoin-core/libmultiprocess#246 - bitcoin-core/libmultiprocess#242 - bitcoin-core/libmultiprocess#247 - bitcoin-core/libmultiprocess#251 - bitcoin-core/libmultiprocess#255 - bitcoin-core/libmultiprocess#258 - bitcoin-core/libmultiprocess#262 - bitcoin-core/libmultiprocess#253 - bitcoin-core/libmultiprocess#263 - bitcoin-core/libmultiprocess#256 - bitcoin-core/libmultiprocess#264 - bitcoin-core/libmultiprocess#249 - bitcoin-core/libmultiprocess#265 The main change is bitcoin-core/libmultiprocess#249 which fixes 3 intermittent race conditions detected in bitcoin core CI and antithesis: #34711/#34756, #34777, and #34782. The changes can be verified by running `test/lint/git-subtree-check.sh src/ipc/libmultiprocess` as described in [developer notes](https://github.com/bitcoin/bitcoin/blob/master/doc/developer-notes.md#subtrees) and [lint instructions](https://github.com/bitcoin/bitcoin/tree/master/test/lint#git-subtree-checksh) ACKs for top commit: Sjors: ACK 613a548 ismaelsadeeq: ACK 613a548 Tree-SHA512: d99eebc8b4f45b3c3099298167362cf5e7f3e9e622eef9f17af56388ee5207d77a04b915b2a5a894493e0395aeda70111216f2da0d2a6553f4f6396b3d31a744

ryanofsky mentioned this pull request Mar 11, 2026

mptest: mp::CancelMonitor: AddressSanitizer: stack-use-after-return bitcoin/bitcoin#34782

Open

Sjors mentioned this pull request Mar 11, 2026

Fix race conditions on disconnects #250

Closed

ryanofsky closed this Mar 11, 2026

ryanofsky reopened this Mar 12, 2026

ryanofsky force-pushed the pr/disraces branch from 9536b63 to 884c846 Compare March 12, 2026 17:51

ryanofsky force-pushed the pr/disraces branch from 884c846 to 2fb97e8 Compare March 12, 2026 17:59

ryanofsky mentioned this pull request Mar 12, 2026

Update libmultiprocess subtree to fix race conditions on disconnects bitcoin/bitcoin#34804

Merged

Sjors mentioned this pull request Mar 13, 2026

test: #249 + #253 #257

Closed

Sjors reviewed Mar 13, 2026

View reviewed changes

test/mp/test/test.cpp Outdated Show resolved Hide resolved

Sjors mentioned this pull request Mar 13, 2026

ci: run some Bitcoin Core CI jobs #253

Merged

sedited approved these changes Mar 18, 2026

View reviewed changes

ismaelsadeeq reviewed Mar 18, 2026

View reviewed changes

This was referenced Mar 20, 2026

bug: EventLoop::post() deadlocks when posted function throws #259

Open

Mining interface tracking issue bitcoin/bitcoin#33777

Open

ryanofsky force-pushed the pr/disraces branch from 2fb97e8 to 25eeb8e Compare March 25, 2026 20:38

ryanofsky force-pushed the pr/disraces branch from 25eeb8e to 81c2169 Compare March 25, 2026 21:05

ryanofsky mentioned this pull request Mar 26, 2026

ci: bump timeout factor to 40 #263

Merged

Sjors and others added 8 commits March 26, 2026 14:19

refactor: Move type-context.h getParams() call closer to use

ff1d8ba

This change has no effect on behavior, it just narrows the scope of the params variable to avoid potential bugs if a cancellation happens and makes them no longer valid.

ryanofsky force-pushed the pr/disraces branch from 396473f to ef22041 Compare March 26, 2026 13:22

ryanofsky force-pushed the pr/disraces branch from ef22041 to ff0eed1 Compare March 26, 2026 13:27

DrahtBot requested a review from ismaelsadeeq March 26, 2026 13:37

ryanofsky merged commit 8e8e564 into bitcoin-core:master Mar 27, 2026
13 checks passed

	kj::Maybe<kj::Exception> exception{kj::runCatchingExceptions([&]{ result_value.emplace(fn(*cancel_monitor_ptr)); })};
	m_loop->sync([this, &result_value, &exception, self = kj::mv(self), result_fulfiller = kj::mv(result_fulfiller), cancel_monitor_ptr = kj::mv(cancel_monitor_ptr)]() mutable {
	// Destroy CancelMonitor here before fulfilling or rejecting the
	// promise so it doesn't get triggered when the promise is
	// destroyed.
	cancel_monitor_ptr = nullptr;

Conversation

ryanofsky commented Mar 11, 2026

Uh oh!

DrahtBot commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews

LLM Linter (✨ experimental)

Uh oh!

ryanofsky commented Mar 11, 2026

Uh oh!

ryanofsky commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Sjors commented Mar 13, 2026

Uh oh!

Uh oh!

Sjors commented Mar 13, 2026

Uh oh!

ismaelsadeeq commented Mar 17, 2026

Uh oh!

sedited left a comment

Choose a reason for hiding this comment

Uh oh!

ismaelsadeeq left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ismaelsadeeq Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

ryanofsky Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

ismaelsadeeq Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

ryanofsky Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ismaelsadeeq Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

ryanofsky Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

Sjors commented Mar 24, 2026

Uh oh!

ismaelsadeeq commented Mar 25, 2026

Uh oh!

ryanofsky commented Mar 25, 2026

Uh oh!

Sjors commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ryanofsky commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Sjors commented Mar 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

DrahtBot commented Mar 11, 2026 •

edited

Loading

ryanofsky commented Mar 12, 2026 •

edited

Loading

ismaelsadeeq left a comment •

edited

Loading

Sjors commented Mar 25, 2026 •

edited

Loading

ryanofsky commented Mar 26, 2026 •

edited

Loading