Summary
Ros2FaultServiceTransport::get_snapshots() and get_rosbag() use a blocking
future.wait_for() pattern that requires the rclcpp Client to be spun on a
separate executor thread. Under ThreadSanitizer this produces flaky data race
warnings between the executor thread cleaning up the client's pending request
and the main thread destroying the local response shared_ptr at the end of
the transport function. See https://github.com/selfpatch/ros2_medkit/actions/runs/25823593340/job/75870897762
for the failing run on PR #398; the stack hits rclcpp::Executor::execute_client
in the spin thread and ~shared_ptr<Response> at ros2_fault_service_transport.cpp:361
in the main thread.
PR #398 mitigates the flake test-side by joining the spin thread before reading
the result (see test_fault_manager.cpp
after that PR merges), but the underlying transport design still requires every
caller to drive an external spin thread, which is fragile.
Proposed solution
Change Ros2FaultServiceTransport to drive the client itself instead of relying
on an external executor:
- Accept an executor (or build a single-threaded one internally for the call)
- Replace
auto future = client_->async_send_request(request); future.wait_for(timeout);
with rclcpp::spin_until_future_complete(executor, future, timeout)
- Remove the requirement that callers
start_spinning() a separate thread for
these RPC paths
- Drop the per-method
std::lock_guard if the executor itself serializes calls
This eliminates the race by ensuring the client's pending-request cleanup
(execute_client) and the response destruction happen on the same thread, in a
well-defined order. It also removes the matching footgun from
FaultManagerTest.* and from anywhere else in the gateway that calls into
these transport methods.
Additional context
Files involved:
src/ros2_medkit_gateway/src/ros2/transports/ros2_fault_service_transport.cpp
src/ros2_medkit_gateway/include/ros2_medkit_gateway/ros2/transports/ros2_fault_service_transport.hpp
src/ros2_medkit_gateway/test/test_fault_manager.cpp (callers can drop
start_spinning() / stop_spinning() around get_* calls once this lands)
Related: PR #398 ships a test-side workaround for the same race.
Summary
Ros2FaultServiceTransport::get_snapshots()andget_rosbag()use a blockingfuture.wait_for()pattern that requires the rclcpp Client to be spun on aseparate executor thread. Under ThreadSanitizer this produces flaky data race
warnings between the executor thread cleaning up the client's pending request
and the main thread destroying the local
responseshared_ptr at the end ofthe transport function. See https://github.com/selfpatch/ros2_medkit/actions/runs/25823593340/job/75870897762
for the failing run on PR #398; the stack hits
rclcpp::Executor::execute_clientin the spin thread and
~shared_ptr<Response>atros2_fault_service_transport.cpp:361in the main thread.
PR #398 mitigates the flake test-side by joining the spin thread before reading
the result (see
test_fault_manager.cppafter that PR merges), but the underlying transport design still requires every
caller to drive an external spin thread, which is fragile.
Proposed solution
Change
Ros2FaultServiceTransportto drive the client itself instead of relyingon an external executor:
auto future = client_->async_send_request(request); future.wait_for(timeout);with
rclcpp::spin_until_future_complete(executor, future, timeout)start_spinning()a separate thread forthese RPC paths
std::lock_guardif the executor itself serializes callsThis eliminates the race by ensuring the client's pending-request cleanup
(
execute_client) and the response destruction happen on the same thread, in awell-defined order. It also removes the matching footgun from
FaultManagerTest.*and from anywhere else in the gateway that calls intothese transport methods.
Additional context
Files involved:
src/ros2_medkit_gateway/src/ros2/transports/ros2_fault_service_transport.cppsrc/ros2_medkit_gateway/include/ros2_medkit_gateway/ros2/transports/ros2_fault_service_transport.hppsrc/ros2_medkit_gateway/test/test_fault_manager.cpp(callers can dropstart_spinning()/stop_spinning()aroundget_*calls once this lands)Related: PR #398 ships a test-side workaround for the same race.