Skip to content

Refactor Ros2FaultServiceTransport to use spin_until_future_complete to remove client/spin-thread race #399

@bburda

Description

@bburda

Summary

Ros2FaultServiceTransport::get_snapshots() and get_rosbag() use a blocking
future.wait_for() pattern that requires the rclcpp Client to be spun on a
separate executor thread. Under ThreadSanitizer this produces flaky data race
warnings between the executor thread cleaning up the client's pending request
and the main thread destroying the local response shared_ptr at the end of
the transport function. See https://github.com/selfpatch/ros2_medkit/actions/runs/25823593340/job/75870897762
for the failing run on PR #398; the stack hits rclcpp::Executor::execute_client
in the spin thread and ~shared_ptr<Response> at ros2_fault_service_transport.cpp:361
in the main thread.

PR #398 mitigates the flake test-side by joining the spin thread before reading
the result (see test_fault_manager.cpp
after that PR merges), but the underlying transport design still requires every
caller to drive an external spin thread, which is fragile.

Proposed solution

Change Ros2FaultServiceTransport to drive the client itself instead of relying
on an external executor:

  • Accept an executor (or build a single-threaded one internally for the call)
  • Replace auto future = client_->async_send_request(request); future.wait_for(timeout);
    with rclcpp::spin_until_future_complete(executor, future, timeout)
  • Remove the requirement that callers start_spinning() a separate thread for
    these RPC paths
  • Drop the per-method std::lock_guard if the executor itself serializes calls

This eliminates the race by ensuring the client's pending-request cleanup
(execute_client) and the response destruction happen on the same thread, in a
well-defined order. It also removes the matching footgun from
FaultManagerTest.* and from anywhere else in the gateway that calls into
these transport methods.

Additional context

Files involved:

  • src/ros2_medkit_gateway/src/ros2/transports/ros2_fault_service_transport.cpp
  • src/ros2_medkit_gateway/include/ros2_medkit_gateway/ros2/transports/ros2_fault_service_transport.hpp
  • src/ros2_medkit_gateway/test/test_fault_manager.cpp (callers can drop
    start_spinning() / stop_spinning() around get_* calls once this lands)

Related: PR #398 ships a test-side workaround for the same race.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions