Fix: fanin producer dedup lookup cost#1066
Conversation
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughReplaces O(N²) linear-scan producer deduplication in the a2a3 and a5 PTO2 orchestrators and the common hierarchical orchestrator with O(1) mechanisms. The PTO2 path gains per-ring epoch buffers (new fields, arena reservation/init/wire/destroy lifecycle), a ChangesEpoch-based O(1) fan-in deduplication
Sequence Diagram(s)sequenceDiagram
participant submit_task_common
participant next_fanin_seen_epoch
participant PTO2FaninBuilder
participant append_fanin_or_fail
participant mark_seen
submit_task_common->>next_fanin_seen_epoch: advance epoch (zero rings on wrap)
next_fanin_seen_epoch-->>submit_task_common: seen_epoch
submit_task_common->>PTO2FaninBuilder: construct(orch, seen_epoch)
note over submit_task_common,PTO2FaninBuilder: replaces prior contains() scan approach
loop each producer dependency (explicit or tensormap-driven)
submit_task_common->>append_fanin_or_fail: producer_task_id, prod_state, builder, ring_id
append_fanin_or_fail->>mark_seen: producer_task_id
mark_seen-->>append_fanin_or_fail: slot epoch matches? → already seen
append_fanin_or_fail-->>submit_task_common: skip duplicate / record fanin slot
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request introduces an epoch-based fan-in deduplication mechanism to the orchestrator runtimes to prevent duplicate dependencies, along with corresponding unit tests. The reviewer's feedback highlights two key areas for improvement: first, adding runtime assertions to guarantee that the task window sizes are powers of two, which is required for the bitwise indexing used in the deduplication logic; second, replacing the newly introduced std::unordered_set in the hot-path dependency inference with a linear scan to avoid dynamic memory allocation overhead.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In
`@src/a2a3/runtime/tensormap_and_ringbuffer/runtime/shared/pto_runtime2_init.cpp`:
- Around line 204-207: The epoch dedup table uses bitmasking
(producer_task_id.local() & (capacity - 1)) which requires capacity to be a
positive power of two; if not, unrelated producers can alias causing missed
dependencies. At
src/a2a3/runtime/tensormap_and_ringbuffer/runtime/shared/pto_runtime2_init.cpp
lines 204-207, add validation to check that task_window_sizes[r] is a positive
power of two before reserving the seen_epoch_bytes for
layout.fanin_seen_epoch[r]. Apply the same power-of-two validation at
src/a5/runtime/tensormap_and_ringbuffer/runtime/shared/pto_runtime2_init.cpp
lines 204-207. At
src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp lines
250-265, add an assertion to verify that layout.fanin_seen_capacity[r] is a
positive power of two before using the bitmask index operation. Apply the same
assertion at
src/a5/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp lines
245-260.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 3f9ba8a9-96fd-490e-99c3-cff9039534c2
📒 Files selected for processing (11)
src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cppsrc/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.hsrc/a2a3/runtime/tensormap_and_ringbuffer/runtime/shared/pto_runtime2_init.cppsrc/a5/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cppsrc/a5/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.hsrc/a5/runtime/tensormap_and_ringbuffer/runtime/shared/pto_runtime2_init.cppsrc/common/hierarchical/orchestrator.cpptests/ut/cpp/CMakeLists.txttests/ut/cpp/a2a3/test_orchestrator_fanin.cpptests/ut/cpp/a5/test_orchestrator_fanin.cpptests/ut/cpp/hierarchical/test_orchestrator.cpp
62c8c62 to
ead2517
Compare
ead2517 to
3f19e04
Compare
0fc0c90 to
d381d2f
Compare
- Add epoch-table fanin producer dedup for a2a3 and a5 runtime orchestrators. - Guard the power-of-two window invariant used by bitmask slot lookup. - Keep hierarchical producer dedup scalable and add fanin regression coverage. - Add an a2a3 fanin lookup performance scene.
d381d2f to
bb1354a
Compare
Summary
Fixes #1013
Tests
ctest --test-dir tests/ut/cpp/build-oldabi --output-on-failure-> 41/41 passedNotes