Skip to content

Fix: fanin producer dedup lookup cost#1066

Open
sunkaixuan2018 wants to merge 1 commit into
hw-native-sys:mainfrom
sunkaixuan2018:fix-issue-1013
Open

Fix: fanin producer dedup lookup cost#1066
sunkaixuan2018 wants to merge 1 commit into
hw-native-sys:mainfrom
sunkaixuan2018:fix-issue-1013

Conversation

@sunkaixuan2018

Copy link
Copy Markdown
Contributor

Summary

  • Replace linear fanin producer dedup checks with per-ring epoch markers in a2a3/a5 runtime fanin building.
  • Deduplicate producers in the common hierarchical infer_deps path with a set.
  • Add runtime C++ UT coverage for duplicate explicit producers entering fanin only once on both a2a3 and a5.

Fixes #1013

Tests

  • Remote a2a3/a5 orchestrator fanin UT binaries: passed
  • Remote C++ UT: ctest --test-dir tests/ut/cpp/build-oldabi --output-on-failure -> 41/41 passed
  • Earlier remote Python no-hardware UT: 391 passed, 10 deselected
  • Earlier a2a3 hardware scene smokes: vector_example, test_l3_dependency.py, test_l3_group.py passed

Notes

  • Full hardware UT still has pre-existing HCCL/domain failures in comm allocation/lifecycle paths (507899/507018 class), not in the fanin dedup path.

@coderabbitai

coderabbitai Bot commented Jun 16, 2026

Copy link
Copy Markdown

Review Change Stack

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: f66acd7f-f425-46dd-a8f9-0bc621dd43a7

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Replaces O(N²) linear-scan producer deduplication in the a2a3 and a5 PTO2 orchestrators and the common hierarchical orchestrator with O(1) mechanisms. The PTO2 path gains per-ring epoch buffers (new fields, arena reservation/init/wire/destroy lifecycle), a next_fanin_seen_epoch helper, and a mark_seen slot-indexing check inside PTO2FaninBuilder. The hierarchical path switches to std::unordered_set. Unit tests covering duplicate-dependency scenarios are added for all three paths.

Changes

Epoch-based O(1) fan-in deduplication

Layer / File(s) Summary
PTO2OrchestratorLayout and State: epoch field declarations
src/a2a3/.../pto_orchestrator.h, src/a5/.../pto_orchestrator.h
Adds off_fanin_seen_epoch[], fanin_seen_capacity[], fanin_seen_epoch[], and fanin_seen_current_epoch fields to PTO2OrchestratorLayout and PTO2OrchestratorState in both a2a3 and a5 headers.
Arena reserve, init, wire, and destroy for epoch buffers
src/a2a3/.../shared/pto_runtime2_init.cpp, src/a5/.../shared/pto_runtime2_init.cpp
Extends reserve_layout, init_data_from_layout, wire_arena_pointers, and destroy in both shared init files to allocate, zero-initialize, wire, and clear per-ring fanin_seen_epoch buffers sized from task_window_sizes.
next_fanin_seen_epoch, PTO2FaninBuilder::mark_seen, and append_fanin_or_fail
src/a2a3/.../pto_orchestrator.cpp, src/a5/.../pto_orchestrator.cpp
Adds next_fanin_seen_epoch() (counter advance and per-ring zeroing on wrap), extends PTO2FaninBuilder with orch pointer and epoch, implements mark_seen() using producer_task_id.local() masked by capacity-1, and updates append_fanin_or_fail signature and all call sites in submit_task_common.
Hierarchical orchestrator: unordered_set dedup in infer_deps
src/common/hierarchical/orchestrator.cpp
Replaces per-candidate linear scan of the producers vector with a pre-reserved std::unordered_set<TaskSlot>, adding the required <unordered_set> header.
Unit tests: duplicate producer deduplication
tests/ut/cpp/CMakeLists.txt, tests/ut/cpp/a2a3/test_orchestrator_fanin.cpp, tests/ut/cpp/a5/test_orchestrator_fanin.cpp, tests/ut/cpp/hierarchical/test_orchestrator.cpp
Adds OrchestratorFaninTest fixtures for a2a3 and a5 asserting fanin_actual_count == 1 for a consumer with duplicate explicit deps, GroupDuplicateInputsKeepOneProducer for the hierarchical path, and CMake helper functions plus test targets for both runtime variants.

Sequence Diagram(s)

sequenceDiagram
  participant submit_task_common
  participant next_fanin_seen_epoch
  participant PTO2FaninBuilder
  participant append_fanin_or_fail
  participant mark_seen

  submit_task_common->>next_fanin_seen_epoch: advance epoch (zero rings on wrap)
  next_fanin_seen_epoch-->>submit_task_common: seen_epoch
  submit_task_common->>PTO2FaninBuilder: construct(orch, seen_epoch)
  note over submit_task_common,PTO2FaninBuilder: replaces prior contains() scan approach
  loop each producer dependency (explicit or tensormap-driven)
    submit_task_common->>append_fanin_or_fail: producer_task_id, prod_state, builder, ring_id
    append_fanin_or_fail->>mark_seen: producer_task_id
    mark_seen-->>append_fanin_or_fail: slot epoch matches? → already seen
    append_fanin_or_fail-->>submit_task_common: skip duplicate / record fanin slot
  end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Poem

🐇 No more O(N²) for this bunny's hop!
Each epoch I stamp makes the scanning stop.
A ring buffer, a mask, and a counter of one—
Duplicate deps? They're already done.
Hash-based sets where linear once crept,
My fan-in is tidy, my throughput adept! ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 15.63% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description check ✅ Passed The description clearly relates to the changeset, explaining the optimization approach (epoch markers and set-based dedup), test coverage, and testing results.
Linked Issues check ✅ Passed All requirements from issue #1013 are met: replaced O(N²) linear scans with per-ring epoch markers in a2a3/a5 [#1013], added set-based dedup in hierarchical path [#1013], and preserved producer ordering while achieving O(N) complexity [#1013].
Out of Scope Changes check ✅ Passed All changes are directly related to the issue's deduplication performance improvements: fanin epoch tracking, set-based lookups, test coverage for duplicate producers, and CMake test configuration.
Title check ✅ Passed The title 'Fix: fanin producer dedup lookup cost' directly addresses the main change—replacing O(N²) linear scan deduplication with constant-time epoch/set-based lookups to improve fanin producer deduplication performance.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces an epoch-based fan-in deduplication mechanism to the orchestrator runtimes to prevent duplicate dependencies, along with corresponding unit tests. The reviewer's feedback highlights two key areas for improvement: first, adding runtime assertions to guarantee that the task window sizes are powers of two, which is required for the bitwise indexing used in the deduplication logic; second, replacing the newly introduced std::unordered_set in the hot-path dependency inference with a linear scan to avoid dynamic memory allocation overhead.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread src/a2a3/runtime/tensormap_and_ringbuffer/runtime/shared/pto_runtime2_init.cpp Outdated
Comment thread src/a5/runtime/tensormap_and_ringbuffer/runtime/shared/pto_runtime2_init.cpp Outdated
Comment thread src/common/hierarchical/orchestrator.cpp
Comment thread src/common/hierarchical/orchestrator.cpp

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@src/a2a3/runtime/tensormap_and_ringbuffer/runtime/shared/pto_runtime2_init.cpp`:
- Around line 204-207: The epoch dedup table uses bitmasking
(producer_task_id.local() & (capacity - 1)) which requires capacity to be a
positive power of two; if not, unrelated producers can alias causing missed
dependencies. At
src/a2a3/runtime/tensormap_and_ringbuffer/runtime/shared/pto_runtime2_init.cpp
lines 204-207, add validation to check that task_window_sizes[r] is a positive
power of two before reserving the seen_epoch_bytes for
layout.fanin_seen_epoch[r]. Apply the same power-of-two validation at
src/a5/runtime/tensormap_and_ringbuffer/runtime/shared/pto_runtime2_init.cpp
lines 204-207. At
src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp lines
250-265, add an assertion to verify that layout.fanin_seen_capacity[r] is a
positive power of two before using the bitmask index operation. Apply the same
assertion at
src/a5/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp lines
245-260.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 3f9ba8a9-96fd-490e-99c3-cff9039534c2

📥 Commits

Reviewing files that changed from the base of the PR and between 19b2c0b and f3e3a4f.

📒 Files selected for processing (11)
  • src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp
  • src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.h
  • src/a2a3/runtime/tensormap_and_ringbuffer/runtime/shared/pto_runtime2_init.cpp
  • src/a5/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp
  • src/a5/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.h
  • src/a5/runtime/tensormap_and_ringbuffer/runtime/shared/pto_runtime2_init.cpp
  • src/common/hierarchical/orchestrator.cpp
  • tests/ut/cpp/CMakeLists.txt
  • tests/ut/cpp/a2a3/test_orchestrator_fanin.cpp
  • tests/ut/cpp/a5/test_orchestrator_fanin.cpp
  • tests/ut/cpp/hierarchical/test_orchestrator.cpp

Comment thread src/a2a3/runtime/tensormap_and_ringbuffer/runtime/shared/pto_runtime2_init.cpp Outdated
@sunkaixuan2018 sunkaixuan2018 force-pushed the fix-issue-1013 branch 3 times, most recently from 62c8c62 to ead2517 Compare June 16, 2026 12:51
@sunkaixuan2018 sunkaixuan2018 changed the title Fix fanin producer dedup lookup cost Fix: fanin producer dedup lookup cost Jun 16, 2026
Comment thread src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.h Outdated
Comment thread src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp Outdated
@sunkaixuan2018 sunkaixuan2018 force-pushed the fix-issue-1013 branch 9 times, most recently from 0fc0c90 to d381d2f Compare June 17, 2026 13:11
- Add epoch-table fanin producer dedup for a2a3 and a5 runtime orchestrators.

- Guard the power-of-two window invariant used by bitmask slot lookup.

- Keep hierarchical producer dedup scalable and add fanin regression coverage.

- Add an a2a3 fanin lookup performance scene.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Performance] Orchestration fanin dedup uses O(N^2) linear scans during TensorMap lookup

2 participants