Skip to content

perf(l3): remove useless exchange buffer from ring allreduce kernel#1059

Open
georgebisbas wants to merge 5 commits into
hw-native-sys:mainfrom
georgebisbas:feat/simpler/l3-ring-allreduce-v2
Open

perf(l3): remove useless exchange buffer from ring allreduce kernel#1059
georgebisbas wants to merge 5 commits into
hw-native-sys:mainfrom
georgebisbas:feat/simpler/l3-ring-allreduce-v2

Conversation

@georgebisbas

@georgebisbas georgebisbas commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Summary

Remove the never-read exchange publish buffer from the ring allreduce AIV kernel. The kernel wrote a send chunk to a local exchange slot before each per-round barrier, but peers always read directly from each other's chunks[] slots via CommRemotePtr — the exchange was written to but never consumed. Pure dead-code removal; the RS+AG schedule, signal layout, and golden are unchanged.

What changed

File Change
kernels/aiv/allreduce_ring_kernel.cpp Remove exchange variable, both RS+AG publish blocks (TLOAD→TSTORE→pipe_barrier), dead send_idx computation
main.py scratch_float_elems(): (P+1)*chunkP*chunk (saves 512 B for P=2); update docstring

Why

The exchange buffer was an artifact from the original PR #975 design where peers were expected to read from each other's exchange slot. That design was changed during review (commit 690efbcf) — peers read chunks[] directly — but the exchange publish code and its memory allocation were never cleaned up.

…al, add CopyChunk

Remove vestigial exchange publish buffer (peers read chunks[] directly via
CommRemotePtr, no local publish needed). Compact per-round signal matrix
from kMaxSupportedRanks-stride to nranks-stride (50-87% memory saved).
Extract shared CopyChunk template for stage-in/stage-out. Issue remote
and local TLOADs back-to-back in RS loop to overlap latency. Add explicit
signal zero-initialization at kernel entry.

Algorithm-neutral: same RS+AG schedule, same golden.
The back-to-back dual-TLOAD in the reduce-scatter loop caused 507018
(aicore error) on both simulator and NPU — all 4 ranks deadlocked in
RoundBarrier. On A2/A3, MTE2 can only pipeline one outstanding vector
load; issuing a second TLOAD before the first wait_flag corrupts the
event pipeline, preventing TNOTIFY from reaching peers.

Root cause: hw constraint, not a logic error. Restore original pattern:
TLOAD(remote) -> wait -> TLOAD(local) -> wait -> TADD -> TSTORE.
All other v2 improvements (exchange removal, compact signal, CopyChunk
helper, signal zero-init) are unaffected.
Explicit zero-initialization of signal slots at kernel entry races with
in-flight peer TNOTIFY AtomicAdds across the HCCL window.  This was the
root cause of the 507018 aicore error — the same RingZeroSignals race
documented and fixed in PR hw-native-sys#975 (commit 690efbc).

The HCCL window is guaranteed zero-initialized by the runtime; per-round
AtomicAdd 0→1 + TWAIT GE 1 is safe without explicit reset.  Restore the
original simpler:main pattern: no signal zeroing, rely on fresh window.
…er:main patterns

The CopyChunk template function and the compact nranks-stride signal
layout both caused 507015/507018 on NPU. Revert to inline Phase 1/4
code and kMaxSupportedRanks signal stride — both verified working in
simpler:main and in the pypto runtime copy of this kernel.

Only the exchange buffer removal remains as the intentional diff from
simpler:main.
@coderabbitai

coderabbitai Bot commented Jun 15, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

The ring allreduce scratch window layout is simplified by removing the intermediate per-round exchange/publish slot. Both the Python host sizing (SCRATCH_FLOAT_ELEMS_MAX, scratch_float_elems) and the C++ kernel (signal_base offset, Phase 2 reduce-scatter, Phase 3 allgather step loops) are updated to reflect that peers now read directly from each other's chunks[] slots via CommRemotePtr after a RoundBarrier.

Changes

Ring Allreduce Scratch Layout Simplification

Layer / File(s) Summary
Scratch layout contract: sizing constants and signal_base
examples/workers/l3/allreduce_ring_distributed/main.py, examples/workers/l3/allreduce_ring_distributed/kernels/aiv/allreduce_ring_kernel.cpp
SCRATCH_FLOAT_ELEMS_MAX changes from (K_MAX_SUPPORTED_RANKS+1)*CHUNK_MAX to K_MAX_SUPPORTED_RANKS*CHUNK_MAX; scratch_float_elems() returns nranks*chunk; signal_base in the kernel now points to scratch + nranks*chunk_elems; docs in both files updated to remove mention of exchange buffer.
Phase 2/3 loop simplification: remove exchange-publish path
examples/workers/l3/allreduce_ring_distributed/kernels/aiv/allreduce_ring_kernel.cpp
Reduce-scatter and allgather step-loop preludes drop the exchange-store + pipe_barrier path; loops now perform RoundBarrier then load from the neighbor's chunks[] slot via CommRemotePtr; RoundBarrier comments updated to document fresh-window zero-init and MTE pipe-drain contract.

Possibly Related PRs

  • hw-native-sys/simpler#975: Introduced the ring allreduce scratch layout and indexing that this PR modifies, specifically the exchange region and signal_base placement being changed here.

Estimated Code Review Effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐇 No more exchange slot in between,
The chunks now speak directly, clean!
signal_base slides up a notch,
RoundBarrier drains the MTE watch.
Less scratch, same allreduce—
This bunny approves the tighter use! 🌟

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Title check ✅ Passed The title accurately describes the main change: removing an unused exchange buffer from the ring allreduce kernel to reduce memory allocation. It directly aligns with the changeset.
Description check ✅ Passed The PR description accurately describes the changeset: removal of a never-read exchange buffer from the ring allreduce kernel, with supporting details about which files changed and why.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request simplifies the distributed all-reduce ring implementation by removing the separate exchange publish buffer, allowing peers to read directly from each other's chunk slots. This eliminates redundant data transfers and reduces the scratch buffer size. The review feedback suggests optimizing the scratch offset calculations in both the C++ kernel and Python orchestrator by using the compile-time constant ALLREDUCE_COUNT directly, which avoids runtime calculations and prevents memory over-allocation.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread examples/workers/l3/allreduce_ring_distributed/main.py

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
examples/workers/l3/allreduce_ring_distributed/main.py (1)

31-33: 💤 Low value

Minor: Ambiguous multiplication sign character.

The docstring uses × (Unicode multiplication sign) which Ruff flags as potentially confusing. Consider using x for consistency with ASCII text, though this is purely cosmetic.

-Scratch layout: P equal chunk slots followed by a 2(P-1)×kMax signal
+Scratch layout: P equal chunk slots followed by a 2(P-1) x kMax signal
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/workers/l3/allreduce_ring_distributed/main.py` around lines 31 - 33,
The docstring in the scratch layout comment uses a Unicode multiplication sign
`×` instead of an ASCII character, which Ruff flags as potentially confusing. In
the docstring starting at line 31 that describes the scratch layout with
"2(P-1)×kMax signal matrix", replace the `×` character with a regular ASCII `x`
to improve consistency and avoid linting warnings.

Source: Linters/SAST tools

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@examples/workers/l3/allreduce_ring_distributed/main.py`:
- Around line 31-33: The docstring in the scratch layout comment uses a Unicode
multiplication sign `×` instead of an ASCII character, which Ruff flags as
potentially confusing. In the docstring starting at line 31 that describes the
scratch layout with "2(P-1)×kMax signal matrix", replace the `×` character with
a regular ASCII `x` to improve consistency and avoid linting warnings.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: c2bbc3eb-2e92-4f50-ad10-530e667f168b

📥 Commits

Reviewing files that changed from the base of the PR and between ea4d204 and 200ab2b.

📒 Files selected for processing (2)
  • examples/workers/l3/allreduce_ring_distributed/kernels/aiv/allreduce_ring_kernel.cpp
  • examples/workers/l3/allreduce_ring_distributed/main.py

…e x)

- clang-format: keep RoundBarrier signature on a single line
- main.py docstring: replace U+00D7 multiplication sign with ASCII x
@georgebisbas georgebisbas changed the title perf(l3): remove vestigial exchange buffer from ring allreduce kernel perf(l3): remove useless exchange buffer from ring allreduce kernel Jun 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant