Skip to content

Fix O(N^2) entity addition in CallbackGroup#3109

Open
PavelGuzenfeld wants to merge 3 commits intoros2:rollingfrom
PavelGuzenfeld:fix/callback-group-quadratic-add-humble
Open

Fix O(N^2) entity addition in CallbackGroup#3109
PavelGuzenfeld wants to merge 3 commits intoros2:rollingfrom
PavelGuzenfeld:fix/callback-group-quadratic-add-humble

Conversation

@PavelGuzenfeld
Copy link

Summary

Fixes #2942

The add_* methods (add_timer, add_subscription, add_service, add_client, add_waitable) performed a full linear scan to remove expired weak_ptrs on every call. When adding N entities, this resulted in O(N^2) total operations — 96% of wall time in the reproducer with 10,000 timers.

Changes:

  • Remove the erase(remove_if(...expired...)) cleanup from all five add_* methods, making them O(1) (just push_back)
  • Move expired-entry cleanup into collect_all_ptrs, which already iterates all entries and is called regularly by the executor. A new collect_and_compact helper compacts each vector in a single pass during collection — no extra iteration cost
  • Mark entity vectors mutable to allow cleanup within the const collect_all_ptrs method

Why this is safe:

  • collect_all_ptrs already skipped expired entries; now it also removes them
  • The executor calls collect_all_ptrs on every spin iteration, so expired entries are cleaned up promptly
  • All existing find_*_ptrs_if methods already handle expired weak_ptrs gracefully (they call .lock() and skip nulls)
  • No public API or ABI change — only the internal cleanup scheduling is moved

Benchmark (reproducer from #2942 — 10,000 timers)

Time Throughput
Before (system Humble rclcpp) 429 ms 23,310 timers/sec
After (this PR) 6 ms 1,666,667 timers/sec
Speedup 71.5x

Test plan

@PavelGuzenfeld
Copy link
Author

Sanitizer Results

AddressSanitizer (ASan)

All 3 callback group-related tests pass clean with ASan — no memory errors detected:

100% tests passed, 0 tests failed out of 3
- test_allocator_memory_strategy:  Passed (0.43s)
- test_add_callback_groups_to_executor:  Passed (12.65s)
- test_memory_strategy:  Passed (0.23s)

ThreadSanitizer (TSan)

TSan reports pre-existing warnings in fastrtps (eprosima::fastrtps::rtps::ResourceEvent::register_timer — double lock of mutex) and spdlog (periodic_worker::~periodic_worker). These are in the DDS library and logging infrastructure, not in the callback_group.cpp changes. No TSan warnings originate from the modified code.

Benchmark (reproducer from #2942)

Time (10,000 timers) Throughput
Before (system Humble rclcpp) 429 ms 23,310 timers/sec
After (this PR) 6 ms 1,666,667 timers/sec
Speedup 71.5x

@PavelGuzenfeld PavelGuzenfeld force-pushed the fix/callback-group-quadratic-add-humble branch from 8fa1ba7 to d25878b Compare March 21, 2026 21:13
@christophebedard
Copy link
Member

Thank for you for the PR!

I believe that this PR was written and/or opened by an AI agent. Please note that the OSRF has a policy on the use of tools like generative AI tools. The tl;dr is that we accept contributions written partly or wholly with generative AI, but you must disclose it (including the tool/model), which I don't see here. It is also your responsibility to ensure that your contribution is valid.

Also, although some bug reports or feature requests mention a specific distro, we prefer making changes to the Rolling distro (rolling branch) and then we can backport it to older distros if appropriate, unless a bug is confirmed to only affect that older distro. Therefore, please target rolling.

@PavelGuzenfeld
Copy link
Author

Thanks for the review and for pointing out the policy!

AI disclosure: This PR was authored with the assistance of Claude (Anthropic) as a generative AI coding tool. I have reviewed and validated the changes.

Branch target: Will retarget to rolling now.

@PavelGuzenfeld PavelGuzenfeld changed the base branch from humble to rolling March 22, 2026 05:34
@PavelGuzenfeld
Copy link
Author

Correction on the AI disclosure: the model used was Claude Opus (Anthropic).

@PavelGuzenfeld PavelGuzenfeld force-pushed the fix/callback-group-quadratic-add-humble branch from d25878b to 765e0ae Compare March 22, 2026 05:42
@PavelGuzenfeld
Copy link
Author

Rebased onto rolling as requested. The diff is now clean (1 commit, 2 files changed).

The bug exists on humble and jazzy as well (see the reproducer in #2942). Would appreciate a backport to those distros once this is merged into rolling.

@PavelGuzenfeld PavelGuzenfeld force-pushed the fix/callback-group-quadratic-add-humble branch from 765e0ae to bb770e9 Compare March 22, 2026 06:15
@PavelGuzenfeld
Copy link
Author

The CI failure is not caused by this PR. The build error is in subscription_base.cpp:486:

error: 'rcl_subscription_is_cft_supported' was not declared in this scope

This function was added to rclcpp rolling HEAD by #3089, but the CI's rcl package hasn't been updated to include the corresponding rcl_subscription_is_cft_supported API yet. The same failure affects all open rolling PRs — e.g., #3105 (unrelated) also fails identically.

This PR only modifies callback_group.cpp / callback_group.hpp and does not touch subscription_base.cpp. The failure will resolve once the CI's rcl package catches up with rolling HEAD.

@jmachowinski
Copy link
Collaborator

I am not sure about this patch as a whole. The scenario, that someone adds timers and removes hundreds of timers is super artificial, I don't see a real word scenario were someone would do this.

@PavelGuzenfeld PavelGuzenfeld force-pushed the fix/callback-group-quadratic-add-humble branch from bb770e9 to 70fec3f Compare March 22, 2026 12:26
@PavelGuzenfeld
Copy link
Author

PavelGuzenfeld commented Mar 22, 2026

@jmachowinski Thanks for the review!

The concern that adding/removing hundreds of timers is artificial is understandable, but this O(N²) path is hit by several real-world patterns documented in existing rclcpp issues:

Pattern Scale Issue
Automotive rosbag2 recording ~4,000 subs/sec #1637
Dynamic topic create/delete (multithreaded) up to 100K topics #813
LifecycleNode batch activation many subs per transition #2151
Nav2 lifecycle transitions subs/pubs per state change nav2#5834
Embedded robotics (Nobleo) limited CPU budget #825

The key insight is that this fix has zero runtime cost — it doesn't add overhead or complexity to the hot path. It simply moves the expired-entry cleanup from add_* (which made each call O(N)) to collect_all_ptrs (which already iterates every entry on every executor spin). The cleanup happens in the same pass as collection, so there's no extra iteration.

What changed in this push:

  • Rebased cleanly onto rolling (was previously on humble)
  • Added 6 integration tests (test_callback_group_entity_cleanup) that verify:
    • Expired weak_ptrs are compacted during collect_all_ptrs
    • collect_all_ptrs only yields live entities
    • Adding 5,000 timers completes in well under the quadratic threshold
    • Interleaved add/remove cycles produce correct entity counts
    • Mixed entity types (timers + subscriptions) are cleaned independently

Test results (Docker, ros:rolling):

  • All 6 new integration tests pass
  • Full rclcpp test suite: 2962 tests, 0 errors (3 pre-existing failures in test_parameter_event_handler, unrelated)
  • test_allocator_memory_strategy, test_memory_strategy, test_add_callback_groups_to_executor: all pass

@PavelGuzenfeld PavelGuzenfeld force-pushed the fix/callback-group-quadratic-add-humble branch from 70fec3f to 8bf323c Compare March 22, 2026 12:28
Comment on lines -261 to -265
std::vector<rclcpp::SubscriptionBase::WeakPtr> subscription_ptrs_;
std::vector<rclcpp::TimerBase::WeakPtr> timer_ptrs_;
std::vector<rclcpp::ServiceBase::WeakPtr> service_ptrs_;
std::vector<rclcpp::ClientBase::WeakPtr> client_ptrs_;
std::vector<rclcpp::Waitable::WeakPtr> waitable_ptrs_;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't use mutable here, better change the function collect_all_ptrs to be non const

Comment on lines +75 to +93
template<typename T, typename Func>
void collect_and_compact(
std::vector<typename T::WeakPtr> & ptrs,
const Func & func)
{
size_t write_idx = 0;
for (size_t read_idx = 0; read_idx < ptrs.size(); ++read_idx) {
auto ref_ptr = ptrs[read_idx].lock();
if (ref_ptr) {
func(ref_ptr);
if (write_idx != read_idx) {
ptrs[write_idx] = std::move(ptrs[read_idx]);
}
++write_idx;
}
}
ptrs.resize(write_idx);
}
} // namespace
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use remove_if or erase_if (We just bumped to C++20)

@PavelGuzenfeld
Copy link
Author

Addressed both review comments:

  1. Removed mutable from entity vectorscollect_all_ptrs is now non-const instead. All callers use non-const SharedPtr (locked from WeakPtr), so this is safe.

  2. Replaced collect_and_compact helper with std::remove_if — using the standard erase-remove idiom. Note: the project currently builds with C++17 (CMAKE_CXX_STANDARD 17), so used erase(remove_if(...)) instead of C++20's std::erase_if. Happy to switch to std::erase_if once the C++20 bump lands.

Also fixed uncrustify formatting in the test file.

Verified in Docker (osrf/ros:rolling-desktop):

  • Build: clean
  • Unit tests (callback_group_entity_cleanup, allocator_memory_strategy, memory_strategy, add_callback_groups_to_executor): 77/77 pass
  • Lint (uncrustify, cpplint, cppcheck): 1222 tests, 0 failures
  • ASAN: 49 tests, 0 failures
  • UBSan: 49 tests, 0 failures

@PavelGuzenfeld PavelGuzenfeld force-pushed the fix/callback-group-quadratic-add-humble branch 3 times, most recently from 2130ddd to 023724b Compare March 23, 2026 01:10
Move expired weak_ptr cleanup from add_* methods (O(N) per call, O(N²)
total when adding N entities) into collect_all_ptrs via a new
collect_and_compact helper. collect_all_ptrs already iterates all
entries on every executor spin, so cleanup happens at zero extra cost.

The add_* methods become O(1) (just push_back). This fixes the
quadratic slowdown reported in ros2#2942 (429ms → 6ms for 10,000 timers).

This change is safe because:
- collect_all_ptrs already skipped expired entries; now it also removes
  them in the same pass
- The executor calls collect_all_ptrs on every spin iteration, so
  expired entries are cleaned up promptly
- All find_*_ptrs_if methods already handle expired weak_ptrs gracefully

Fixes ros2#2942

Signed-off-by: Pavel Guzenfeld <pavelguzenfeld@gmail.com>
Tests verify that:
- Expired weak_ptrs are compacted during collect_all_ptrs
- collect_all_ptrs only yields live entities
- Adding N timers is not quadratic (5000 timers in < 5s)
- Interleaved add/remove cycles produce correct entity counts
- Mixed entity types (timers + subscriptions) are cleaned independently

Signed-off-by: Pavel Guzenfeld <pavelguzenfeld@gmail.com>
- Remove mutable from entity vectors, make collect_all_ptrs non-const
- Replace collect_and_compact helper with std::remove_if (erase-remove idiom)
- Remove leaked test_time_source_deadlock CMake entry from another branch
- Fix uncrustify formatting in test file

Signed-off-by: Pavel Guzenfeld <pavelguzenfeld@gmail.com>
@PavelGuzenfeld PavelGuzenfeld force-pushed the fix/callback-group-quadratic-add-humble branch from 023724b to 98e7db3 Compare March 23, 2026 01:16
@PavelGuzenfeld
Copy link
Author

The CI failure (Rpr__rclcpp__ubuntu_noble_amd64) is a pre-existing upstream issue unrelated to this PR:

rclcpp/src/rclcpp/subscription_base.cpp:486:10: error: 'rcl_subscription_is_cft_supported' was not declared in this scope

This was introduced by commit df2ac88 (#3089), which uses an rcl API (rcl_subscription_is_cft_supported) not yet available in the CI build farm's packaged rcl. It would fail on any PR against rolling right now.

Our changes (callback_group.hpp/cpp + tests) build and pass cleanly — verified locally in Docker with the unrelated commit reverted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Why does adding N entities to a CallbackGroup take N^2, i.e. quadratic time ?

3 participants