perf(l3): remove useless exchange buffer from ring allreduce kernel by georgebisbas · Pull Request #1059 · hw-native-sys/simpler

georgebisbas · 2026-06-15T12:48:29Z

Summary

Remove the never-read exchange publish buffer from the ring allreduce AIV kernel. The kernel wrote a send chunk to a local exchange slot before each per-round barrier, but peers always read directly from each other's chunks[] slots via CommRemotePtr — the exchange was written to but never consumed. Pure dead-code removal; the RS+AG schedule, signal layout, and golden are unchanged.

What changed

File	Change
`kernels/aiv/allreduce_ring_kernel.cpp`	Remove `exchange` variable, both RS+AG publish blocks (TLOAD→TSTORE→pipe_barrier), dead `send_idx` computation
`main.py`	`scratch_float_elems()`: `(P+1)chunk` → `Pchunk` (saves 512 B for P=2); update docstring

Why

The exchange buffer was an artifact from the original PR #975 design where peers were expected to read from each other's exchange slot. That design was changed during review (commit 690efbcf) — peers read chunks[] directly — but the exchange publish code and its memory allocation were never cleaned up.

…al, add CopyChunk Remove vestigial exchange publish buffer (peers read chunks[] directly via CommRemotePtr, no local publish needed). Compact per-round signal matrix from kMaxSupportedRanks-stride to nranks-stride (50-87% memory saved). Extract shared CopyChunk template for stage-in/stage-out. Issue remote and local TLOADs back-to-back in RS loop to overlap latency. Add explicit signal zero-initialization at kernel entry. Algorithm-neutral: same RS+AG schedule, same golden.

The back-to-back dual-TLOAD in the reduce-scatter loop caused 507018 (aicore error) on both simulator and NPU — all 4 ranks deadlocked in RoundBarrier. On A2/A3, MTE2 can only pipeline one outstanding vector load; issuing a second TLOAD before the first wait_flag corrupts the event pipeline, preventing TNOTIFY from reaching peers. Root cause: hw constraint, not a logic error. Restore original pattern: TLOAD(remote) -> wait -> TLOAD(local) -> wait -> TADD -> TSTORE. All other v2 improvements (exchange removal, compact signal, CopyChunk helper, signal zero-init) are unaffected.

Explicit zero-initialization of signal slots at kernel entry races with in-flight peer TNOTIFY AtomicAdds across the HCCL window. This was the root cause of the 507018 aicore error — the same RingZeroSignals race documented and fixed in PR hw-native-sys#975 (commit 690efbc). The HCCL window is guaranteed zero-initialized by the runtime; per-round AtomicAdd 0→1 + TWAIT GE 1 is safe without explicit reset. Restore the original simpler:main pattern: no signal zeroing, rely on fresh window.

…er:main patterns The CopyChunk template function and the compact nranks-stride signal layout both caused 507015/507018 on NPU. Revert to inline Phase 1/4 code and kMaxSupportedRanks signal stride — both verified working in simpler:main and in the pypto runtime copy of this kernel. Only the exchange buffer removal remains as the intentional diff from simpler:main.

coderabbitai · 2026-06-15T12:48:43Z

📝 Walkthrough

Walkthrough

The ring allreduce scratch window layout is simplified by removing the intermediate per-round exchange/publish slot. Both the Python host sizing (SCRATCH_FLOAT_ELEMS_MAX, scratch_float_elems) and the C++ kernel (signal_base offset, Phase 2 reduce-scatter, Phase 3 allgather step loops) are updated to reflect that peers now read directly from each other's chunks[] slots via CommRemotePtr after a RoundBarrier.

Changes

Ring Allreduce Scratch Layout Simplification

Layer / File(s)	Summary
Scratch layout contract: sizing constants and signal_base `examples/workers/l3/allreduce_ring_distributed/main.py`, `examples/workers/l3/allreduce_ring_distributed/kernels/aiv/allreduce_ring_kernel.cpp`	`SCRATCH_FLOAT_ELEMS_MAX` changes from `(K_MAX_SUPPORTED_RANKS+1)CHUNK_MAX` to `K_MAX_SUPPORTED_RANKSCHUNK_MAX`; `scratch_float_elems()` returns `nrankschunk`; `signal_base` in the kernel now points to `scratch + nrankschunk_elems`; docs in both files updated to remove mention of exchange buffer.
Phase 2/3 loop simplification: remove exchange-publish path `examples/workers/l3/allreduce_ring_distributed/kernels/aiv/allreduce_ring_kernel.cpp`	Reduce-scatter and allgather step-loop preludes drop the exchange-store + `pipe_barrier` path; loops now perform `RoundBarrier` then load from the neighbor's `chunks[]` slot via `CommRemotePtr`; `RoundBarrier` comments updated to document fresh-window zero-init and MTE pipe-drain contract.

Possibly Related PRs

hw-native-sys/simpler#975: Introduced the ring allreduce scratch layout and indexing that this PR modifies, specifically the exchange region and signal_base placement being changed here.

Estimated Code Review Effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐇 No more exchange slot in between,
The chunks now speak directly, clean!
signal_base slides up a notch,
RoundBarrier drains the MTE watch.
Less scratch, same allreduce—
This bunny approves the tighter use! 🌟

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Title check	✅ Passed	The title accurately describes the main change: removing an unused exchange buffer from the ring allreduce kernel to reduce memory allocation. It directly aligns with the changeset.
Description check	✅ Passed	The PR description accurately describes the changeset: removal of a never-read exchange buffer from the ring allreduce kernel, with supporting details about which files changed and why.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request simplifies the distributed all-reduce ring implementation by removing the separate exchange publish buffer, allowing peers to read directly from each other's chunk slots. This eliminates redundant data transfers and reduces the scratch buffer size. The review feedback suggests optimizing the scratch offset calculations in both the C++ kernel and Python orchestrator by using the compile-time constant ALLREDUCE_COUNT directly, which avoids runtime calculations and prevents memory over-allocation.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

coderabbitai

🧹 Nitpick comments (1)

examples/workers/l3/allreduce_ring_distributed/main.py (1)
31-33: 💤 Low value

Minor: Ambiguous multiplication sign character.

The docstring uses × (Unicode multiplication sign) which Ruff flags as potentially confusing. Consider using x for consistency with ASCII text, though this is purely cosmetic.
-Scratch layout: P equal chunk slots followed by a 2(P-1)×kMax signal
+Scratch layout: P equal chunk slots followed by a 2(P-1) x kMax signal
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/workers/l3/allreduce_ring_distributed/main.py` around lines 31 - 33,
The docstring in the scratch layout comment uses a Unicode multiplication sign
`×` instead of an ASCII character, which Ruff flags as potentially confusing. In
the docstring starting at line 31 that describes the scratch layout with
"2(P-1)×kMax signal matrix", replace the `×` character with a regular ASCII `x`
to improve consistency and avoid linting warnings.
Source: Linters/SAST tools

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@examples/workers/l3/allreduce_ring_distributed/main.py`:
- Around line 31-33: The docstring in the scratch layout comment uses a Unicode
multiplication sign `×` instead of an ASCII character, which Ruff flags as
potentially confusing. In the docstring starting at line 31 that describes the
scratch layout with "2(P-1)×kMax signal matrix", replace the `×` character with
a regular ASCII `x` to improve consistency and avoid linting warnings.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: c2bbc3eb-2e92-4f50-ad10-530e667f168b

📥 Commits

Reviewing files that changed from the base of the PR and between ea4d204 and 200ab2b.

📒 Files selected for processing (2)

examples/workers/l3/allreduce_ring_distributed/kernels/aiv/allreduce_ring_kernel.cpp
examples/workers/l3/allreduce_ring_distributed/main.py

…e x) - clang-format: keep RoundBarrier signature on a single line - main.py docstring: replace U+00D7 multiplication sign with ASCII x

georgebisbas added 4 commits June 15, 2026 14:47

gemini-code-assist Bot reviewed Jun 15, 2026

View reviewed changes

Comment thread examples/workers/l3/allreduce_ring_distributed/kernels/aiv/allreduce_ring_kernel.cpp

Comment thread examples/workers/l3/allreduce_ring_distributed/main.py

coderabbitai Bot reviewed Jun 15, 2026

View reviewed changes

style: fix clang-format (RoundBarrier sig) and CodeRabbit nit (Unicod…

583d415

…e x) - clang-format: keep RoundBarrier signature on a single line - main.py docstring: replace U+00D7 multiplication sign with ASCII x

georgebisbas changed the title ~~perf(l3): remove vestigial exchange buffer from ring allreduce kernel~~ perf(l3): remove useless exchange buffer from ring allreduce kernel Jun 15, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(l3): remove useless exchange buffer from ring allreduce kernel#1059

perf(l3): remove useless exchange buffer from ring allreduce kernel#1059
georgebisbas wants to merge 5 commits into
hw-native-sys:mainfrom
georgebisbas:feat/simpler/l3-ring-allreduce-v2

georgebisbas commented Jun 15, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented Jun 15, 2026 •

edited

Loading

Walkthrough

Changes

Possibly Related PRs

Estimated Code Review Effort

Poem

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

georgebisbas commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What changed

Why

Uh oh!

coderabbitai Bot commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Possibly Related PRs

Estimated Code Review Effort

Poem

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

georgebisbas commented Jun 15, 2026 •

edited

Loading

coderabbitai Bot commented Jun 15, 2026 •

edited

Loading