perf(l3): remove useless exchange buffer from ring allreduce kernel#1059
perf(l3): remove useless exchange buffer from ring allreduce kernel#1059georgebisbas wants to merge 5 commits into
Conversation
…al, add CopyChunk Remove vestigial exchange publish buffer (peers read chunks[] directly via CommRemotePtr, no local publish needed). Compact per-round signal matrix from kMaxSupportedRanks-stride to nranks-stride (50-87% memory saved). Extract shared CopyChunk template for stage-in/stage-out. Issue remote and local TLOADs back-to-back in RS loop to overlap latency. Add explicit signal zero-initialization at kernel entry. Algorithm-neutral: same RS+AG schedule, same golden.
The back-to-back dual-TLOAD in the reduce-scatter loop caused 507018 (aicore error) on both simulator and NPU — all 4 ranks deadlocked in RoundBarrier. On A2/A3, MTE2 can only pipeline one outstanding vector load; issuing a second TLOAD before the first wait_flag corrupts the event pipeline, preventing TNOTIFY from reaching peers. Root cause: hw constraint, not a logic error. Restore original pattern: TLOAD(remote) -> wait -> TLOAD(local) -> wait -> TADD -> TSTORE. All other v2 improvements (exchange removal, compact signal, CopyChunk helper, signal zero-init) are unaffected.
Explicit zero-initialization of signal slots at kernel entry races with in-flight peer TNOTIFY AtomicAdds across the HCCL window. This was the root cause of the 507018 aicore error — the same RingZeroSignals race documented and fixed in PR hw-native-sys#975 (commit 690efbc). The HCCL window is guaranteed zero-initialized by the runtime; per-round AtomicAdd 0→1 + TWAIT GE 1 is safe without explicit reset. Restore the original simpler:main pattern: no signal zeroing, rely on fresh window.
…er:main patterns The CopyChunk template function and the compact nranks-stride signal layout both caused 507015/507018 on NPU. Revert to inline Phase 1/4 code and kMaxSupportedRanks signal stride — both verified working in simpler:main and in the pypto runtime copy of this kernel. Only the exchange buffer removal remains as the intentional diff from simpler:main.
📝 WalkthroughWalkthroughThe ring allreduce scratch window layout is simplified by removing the intermediate per-round exchange/publish slot. Both the Python host sizing ( ChangesRing Allreduce Scratch Layout Simplification
Possibly Related PRs
Estimated Code Review Effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request simplifies the distributed all-reduce ring implementation by removing the separate exchange publish buffer, allowing peers to read directly from each other's chunk slots. This eliminates redundant data transfers and reduces the scratch buffer size. The review feedback suggests optimizing the scratch offset calculations in both the C++ kernel and Python orchestrator by using the compile-time constant ALLREDUCE_COUNT directly, which avoids runtime calculations and prevents memory over-allocation.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
There was a problem hiding this comment.
🧹 Nitpick comments (1)
examples/workers/l3/allreduce_ring_distributed/main.py (1)
31-33: 💤 Low valueMinor: Ambiguous multiplication sign character.
The docstring uses
×(Unicode multiplication sign) which Ruff flags as potentially confusing. Consider usingxfor consistency with ASCII text, though this is purely cosmetic.-Scratch layout: P equal chunk slots followed by a 2(P-1)×kMax signal +Scratch layout: P equal chunk slots followed by a 2(P-1) x kMax signal🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@examples/workers/l3/allreduce_ring_distributed/main.py` around lines 31 - 33, The docstring in the scratch layout comment uses a Unicode multiplication sign `×` instead of an ASCII character, which Ruff flags as potentially confusing. In the docstring starting at line 31 that describes the scratch layout with "2(P-1)×kMax signal matrix", replace the `×` character with a regular ASCII `x` to improve consistency and avoid linting warnings.Source: Linters/SAST tools
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@examples/workers/l3/allreduce_ring_distributed/main.py`:
- Around line 31-33: The docstring in the scratch layout comment uses a Unicode
multiplication sign `×` instead of an ASCII character, which Ruff flags as
potentially confusing. In the docstring starting at line 31 that describes the
scratch layout with "2(P-1)×kMax signal matrix", replace the `×` character with
a regular ASCII `x` to improve consistency and avoid linting warnings.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: c2bbc3eb-2e92-4f50-ad10-530e667f168b
📒 Files selected for processing (2)
examples/workers/l3/allreduce_ring_distributed/kernels/aiv/allreduce_ring_kernel.cppexamples/workers/l3/allreduce_ring_distributed/main.py
…e x) - clang-format: keep RoundBarrier signature on a single line - main.py docstring: replace U+00D7 multiplication sign with ASCII x
Summary
Remove the never-read
exchangepublish buffer from the ring allreduce AIV kernel. The kernel wrote a send chunk to a localexchangeslot before each per-round barrier, but peers always read directly from each other'schunks[]slots viaCommRemotePtr— theexchangewas written to but never consumed. Pure dead-code removal; the RS+AG schedule, signal layout, and golden are unchanged.What changed
kernels/aiv/allreduce_ring_kernel.cppexchangevariable, both RS+AG publish blocks (TLOAD→TSTORE→pipe_barrier), deadsend_idxcomputationmain.pyscratch_float_elems():(P+1)*chunk→P*chunk(saves 512 B for P=2); update docstringWhy
The exchange buffer was an artifact from the original PR #975 design where peers were expected to read from each other's
exchangeslot. That design was changed during review (commit690efbcf) — peers readchunks[]directly — but the exchange publish code and its memory allocation were never cleaned up.