Fix: reset signal slots after Phase 2 barrier for reentrant collectiv… by georgebisbas · Pull Request #1 · georgebisbas/simpler

georgebisbas · 2026-05-26T08:49:22Z

…e kernels

The TNOTIFY(AtomicAdd, 1) + TWAIT(GE, 1) barrier pattern leaves each signal slot permanently at 1 after the first kernel invocation. A second call on the same HCCL window (e.g. inside a training loop) would find signal_base[peer] >= 1 already, causing every TWAIT to return immediately without actually synchronising — a silent correctness bug.

Fix: after all TWAIT calls complete and pipe_barrier(PIPE_ALL) drains in-flight MTE2 reads, zero signal_base[0..nranks-1] in the local window and flush with a second pipe_barrier(PIPE_ALL).

Safety of the reset:
signal_base[i] in OUR window is where peer i writes its arrival
notification. TWAIT(signal_base[i], GE 1) returning guarantees that
peer i's AtomicAdd DMA has already committed to our local memory; no
further write to signal_base[i] can arrive from the current invocation.
We are the exclusive writer of our own signal_base, so zeroing it here
races with nothing.

Applied uniformly to all three collective kernels:

allreduce_distributed
allgather_distributed (new, added in parent commit)
reduce_scatter_distributed (new, added in parent commit)

Addresses the HIGH-priority barrier comment from the Gemini review on hw-native-sys#842

…e kernels The TNOTIFY(AtomicAdd, 1) + TWAIT(GE, 1) barrier pattern leaves each signal slot permanently at 1 after the first kernel invocation. A second call on the same HCCL window (e.g. inside a training loop) would find signal_base[peer] >= 1 already, causing every TWAIT to return immediately without actually synchronising — a silent correctness bug. Fix: after all TWAIT calls complete and pipe_barrier(PIPE_ALL) drains in-flight MTE2 reads, zero signal_base[0..nranks-1] in the local window and flush with a second pipe_barrier(PIPE_ALL). Safety of the reset: signal_base[i] in OUR window is where peer i writes its arrival notification. TWAIT(signal_base[i], GE 1) returning guarantees that peer i's AtomicAdd DMA has already committed to our local memory; no further write to signal_base[i] can arrive from the current invocation. We are the exclusive writer of our own signal_base, so zeroing it here races with nothing. Applied uniformly to all three collective kernels: - allreduce_distributed - allgather_distributed (new, added in parent commit) - reduce_scatter_distributed (new, added in parent commit) Addresses the HIGH-priority barrier comment from the Gemini review on hw-native-sys#842

Keep kernel source concise while preserving full safety rationale in prior commit message on this branch (60db4df). Behavior is unchanged.

…unning-onboard (hw-native-sys#990) `.claude/rules/task-submit-isolation.md` asserted a specific causal chain — "another user on the same physical chip" → contend on FFTS / shared L2 / MMU TLB / register MMIO → AICore ACK-no-FIN → 507018 — and treated it as a known mechanism. The whole chain rests on a single anecdote (the "npu-lock 8 held 5 hours" example) where the fix was "rerun under task-submit lock and observe pass," which is correlation, not a controlled measurement of the mechanism. There is no controlled repro of "two users, same chip, sibling die, deterministic 507018." In a recent debugging session a stale build artifact (binary built before later C++ edits) reproduced 507018 deterministically across 4 attempts on different chip pairs with full chip-pair locks. Re-running pip install fixed it. The repeated "must be chip-shared contention" hypothesis I cited from this rule was wrong and burned several diagnostic rounds before someone called it out. Changes: * Rename .claude/rules/task-submit-isolation.md -> running-onboard.md. The file's title is already "NPU Hardware Isolation" — sim variants (a2a3sim / a5sim) are silicon-agnostic and never need task-submit; the filename now reflects that scope. 5 references in skills and cann-examples READMEs updated to the new path. * "Why" rewritten as a queue-management rule: shared dev box, polite device allocation, parity with CI which always uses task-submit. No causal claim about specific failure modes. * "If unlocked results contradict CI ... check another process on the same chip" replaced with "rule out binary/source skew, wrong arch, stale CMake cache first." * Anti-pattern #1 reworded: unlocked shell loops break queue accounting and race other users for devices — not "your iter results are entangled" (which implied the chip-shared contention model). * Quick-reference rename: "Per-die contention map" -> "Per-die utilization + process table." npu-smi info shows utilization, not a contention map. Kept: * The rule that hardware work goes through task-submit when available. * The fallback / unisolated guidance for environments without task-submit. * The arch-precheck section (the 507018/507899 wrong-arch failure mode is documented separately in onboard-arch-precheck and is backed by controlled repro). Co-authored-by: Chao Wang <26245345+ChaoWao@users.noreply.github.com>

georgebisbas added 2 commits May 26, 2026 10:25

Chore: condense barrier-reset rationale comments to 2 lines

9ae0fc8

Keep kernel source concise while preserving full safety rationale in prior commit message on this branch (60db4df). Behavior is unchanged.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix: reset signal slots after Phase 2 barrier for reentrant collectiv…#1

Fix: reset signal slots after Phase 2 barrier for reentrant collectiv…#1
georgebisbas wants to merge 2 commits into
feat/allgather-reduce-scatter-examplesfrom
feat/reentrant-barrier-signal-reset

georgebisbas commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

georgebisbas commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant