Skip to content

docs(hardware): MMIO performance reference + 2 cann-example probe tools#1057

Open
hw-native-sys-bot wants to merge 1 commit into
hw-native-sys:mainfrom
hw-native-sys-bot:docs/mmio-hardware-and-cann-refs
Open

docs(hardware): MMIO performance reference + 2 cann-example probe tools#1057
hw-native-sys-bot wants to merge 1 commit into
hw-native-sys:mainfrom
hw-native-sys-bot:docs/mmio-hardware-and-cann-refs

Conversation

@hw-native-sys-bot

Copy link
Copy Markdown
Collaborator

Summary

  • New docs/hardware/mmio-performance.md is the authoritative reference for AIC_CTRL register-window behavior: memory attribute (proven Device-nGnRE from driver source, not nGnRnE as previously claimed in 4 places), single STR/LDR cost numbers, single-thread serial vs multi-thread parallel LDR scaling, and the DMB-is-hardware-unidirectional constraint.
  • Two new tools/cann-examples/ probe tools make the measurements reproducible without depending on the runtime in src/:
    • aicpu-mmio-probes/ — AICPU SO + host launcher, Phase 4 (STR DMB burst + round trip) + Phase 12 (single + multi-thread LDR COND scaling).
    • aicore-notification-perf/ — AICore producer + AICPU consumer + host launcher, Phase 13 (idle LDR rate) + Phase 14 (per-event E2E latency for GM-dcci vs COND MMIO paths).
  • Two docs/investigations/ entries close out the negative findings (AICore can't reach SPR MMIO from inside; CCEC has no MOV DATA_MAIN_BASE, x encoding; GM vs COND notification tradeoff matrix).
  • A complementary docs/hardware/cann-source-references.md catalogs the gitcode.com/cann/{driver,hccl,hcomm} opensource trees we hit during simpler dev with the clone-to-build/ convention.
  • Fixes the Device-nGnRnE -> Device-nGnRE claim in docs/hardware/cache-coherency.md (3 places) + src/a2a3/platform/shared/aicpu/platform_regs.cpp (1 place); the prior claim was contradicted by burst-STR posted-write behavior measured here.
  • .claude/rules/ascend.md gets an always-loaded hard-constraint section so future proposals to make AICore write DMB / reach SPR MMIO are short-circuited.

How the measurements line up with the new docs

Phase What Measured Where it lives
4 Burst STR DMB + round trip 5 ns/STR, 240–300 ns aicpu-mmio-probes
12-A/B Single thread LDR COND, same / rotating cores 95 ns/LDR aicpu-mmio-probes
12-C Multi-thread LDR COND (M = 1..3) 95 ns/thread (linear scale) aicpu-mmio-probes
13 Idle-state LDR rate (GM cache-hot vs COND MMIO) 3 ns vs 100 ns aicore-notification-perf
14 E2E "AICore writes -> AICPU sees" GM 1040 ns avg / COND 600 ns avg aicore-notification-perf

Test plan

  • pre-commit run — clang-format / clang-tidy / cpplint / markdownlint all pass.
  • Build the two tools end-to-end with ASCEND_HOME_PATH set, run via task-submit against a free NPU, verify the numbers reproduce within ±10%.
  • Cross-link review against docs/hardware/cache-coherency.md and docs/aicore-kernel-programming.md to ensure terminology stays consistent.

Notes

  • No production runtime code is touched apart from the file-header comment in platform_regs.cpp. The experimental probe code that produced the numbers stays on experiment/dmb-64bit-probe and is not part of this PR; the tools in this PR are the standalone, branch-independent reproduction path.
  • Phase 5/6 AIC-vs-AIV microarchitecture findings (AIC == AIV on hot set_cond+read_reg; AIC rotation no penalty, AIV rotation 5x) are documented in docs/aicore-kernel-programming.md §4.1 with an explicit "single-run, verify before relying" caveat.

@coderabbitai

coderabbitai Bot commented Jun 15, 2026

Copy link
Copy Markdown

Review Change Stack

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 18826a44-5fbf-4fa8-8027-415eaab888cb

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This PR corrects the ARM64 MMIO memory-type label from Device-nGnRnE to Device-nGnRE across documentation and a production source comment, adds hardware-constraint documentation (mmio-performance.md, aicore-kernel-programming.md extensions, two investigation write-ups, CANN source references), and introduces two standalone benchmark tools—aicpu-mmio-probes and aicore-notification-perf—each with shared ABI headers, device-side implementations, and C++ host launchers.

Changes

Hardware Constraint Documentation + Memory-Type Correction

Layer / File(s) Summary
Device-nGnRE memory-type correction
docs/hardware/cache-coherency.md, src/a2a3/platform/shared/aicpu/platform_regs.cpp, .claude/rules/ascend.md
Replaces Device-nGnRnE with Device-nGnRE in three call-sites in the cache-coherency doc, the MMIO ordering comment in platform_regs.cpp, and the AI rules hard-constraints section.
New mmio-performance.md and AICore hard-constraints doc
docs/hardware/mmio-performance.md, docs/aicore-kernel-programming.md
Adds mmio-performance.md covering Device-nGnRE attribute evidence, per-operation cost table, concurrency model, DATA_MAIN_BASE directionality constraints, COND-vs-GM trade-off table, and rerun instructions. Extends aicore-kernel-programming.md with a hard-constraints section and AIC-vs-AIV SPR self-access timing measurements.
Investigation write-ups and index
docs/investigations/2026-06-aicore-mmio-to-spr.md, docs/investigations/2026-06-cond-vs-gm-notification.md, docs/investigations/README.md
Adds the AICore→SPR MMIO investigation (Phase 10 hang, Phase 11 compile error) and the COND-vs-GM notification comparison (polling and latency tables, COND-as-default conclusion), with two new entries in the investigations index.
CANN source references and navigation additions
docs/hardware/cann-source-references.md, docs/hardware/chip-architecture.md, tools/README.md
Adds cann-source-references.md (repo list, clone script, per-repo cheat sheets), two new rows in the chip-architecture "What to read next" table, and two new tool subsections in tools/README.md.

AICPU MMIO Probes and AICore Notification-Perf Benchmark Tools

Layer / File(s) Summary
Shared ABI headers for both tools
tools/cann-examples/aicpu-mmio-probes/shared/probes_types.h, tools/cann-examples/aicore-notification-perf/shared/handshake.h
Defines MmioProbeResult, MmioProbeDeviceArgs, and register constants for the MMIO probes tool; defines NotifPerfHandshake, NotifPerfResult, NotifPerfDeviceArgs, NotifPerfMode enum, and bitmask constants for the notification-perf tool. Both use static_assert-guarded cache-line-aligned layouts.
aicpu-mmio-probes: device-side probe SO
tools/cann-examples/aicpu-mmio-probes/device/probes.cpp, tools/cann-examples/aicpu-mmio-probes/device/CMakeLists.txt
Implements SysCntAicpu() timing via cntvct_el0, Phase 4 STR-burst DMB, Phase 12-A/B single-thread LDR-COND loops, Phase 12-C multi-threaded LDR-COND with pthread fan-out and best-effort cleanup, and exported simpler_aicpu_init/simpler_aicpu_run entrypoints.
aicpu-mmio-probes: host launcher and docs
tools/cann-examples/aicpu-mmio-probes/host/launch.cpp, tools/cann-examples/aicpu-mmio-probes/host/CMakeLists.txt, tools/cann-examples/aicpu-mmio-probes/README.md
Implements ELF build-id fingerprinting with FNV-1a fallback, RAII ACL wrappers, halMemCtl AIC-ctrl-base query, JSON descriptor generation, rtsLaunchCpuKernel dispatch, D2H result copy, and PrintResultTable covering Phase 4 and Phase 12 per-concurrency metrics.
aicore-notification-perf: AICore producer and AICPU consumer
tools/cann-examples/aicore-notification-perf/device-aicore/producer.cce, tools/cann-examples/aicore-notification-perf/device-aicore/CMakeLists.txt, tools/cann-examples/aicore-notification-perf/device-aicpu/consumer.cpp, tools/cann-examples/aicore-notification-perf/device-aicpu/CMakeLists.txt
AICore kernel: dcci-invalidated go-wait loop, mode-aware GM vs COND publish paths with set_cond, throttle spin. AICPU consumer: RunSubtestGm polling p_seq, RunSubtestCond polling MMIO cond_addr, idle-LDR sweep, simpler_aicpu_run with COND address computation.
aicore-notification-perf: host launcher and docs
tools/cann-examples/aicore-notification-perf/host/launch.cpp, tools/cann-examples/aicore-notification-perf/host/CMakeLists.txt, tools/cann-examples/aicore-notification-perf/README.md
Host launcher: ELF fingerprinting, DeviceContext with two streams, rtsBinaryLoadFromFile for consumer SO, rtRegisterAllKernel for producer .o, dual-stream launch and synchronize, D2H NotifPerfResult copy, PrintResultTable emitting GM-vs-COND latency and idle-LDR rates. README documents the full pipeline, build/run steps, expected output, and portability caveats.

Sequence Diagram

sequenceDiagram
    participant Host as launch_notif_perf
    participant ACL as ACL runtime
    participant AICoreStream as AICore stream
    participant AICPUStream as AICPU stream
    participant GM as NotifPerfHandshake
    participant COND as COND register

    Host->>ACL: aclInit, alloc GM buffers
    Host->>AICPUStream: bootstrap rtsLaunchCpuKernel init
    Host->>ACL: rtsBinaryLoadFromFile consumer SO
    Host->>ACL: rtRegisterAllKernel producer object
    Host->>AICoreStream: launch notif_perf_producer
    Host->>AICPUStream: launch simpler_aicpu_run
    AICPUStream->>GM: set mode=GM, go=1
    AICoreStream->>GM: dcci invalidate, write p_tw + p_seq, dcci flush
    AICPUStream->>GM: poll p_seq change, record tick delta
    AICPUStream->>GM: set mode=COND, go=1
    AICoreStream->>GM: dcci invalidate, write p_tw
    AICoreStream->>COND: set_cond with encoded seq
    AICPUStream->>COND: poll cond_addr change, record tick delta
    Host->>ACL: synchronize both streams
    Host->>Host: D2H NotifPerfResult, PrintResultTable
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

🐇 Hoppity-hop through the MMIO lane,
Device-nGnRE — the right name, hooray!
Two kernels now measure on AICore and CPU,
COND beats GM by a tick or two.
The rabbit has benchmarked, the hardware confessed,
STR posts fast, LDR serialized best! 🎉

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 29.09% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main additions: a new MMIO performance documentation reference and two CANN example probe tools for reproducible measurements.
Description check ✅ Passed The description is directly related to the changeset, providing a detailed summary of the new documentation, probe tools, corrections to prior claims, and a measurement-to-documentation mapping table.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds extensive documentation and standalone microbenchmarking tools (aicpu-mmio-probes and aicore-notification-perf) detailing the hardware constraints, memory attributes, and performance characteristics of the Ascend chip architecture. It corrects the memory attribute of the AIC_CTRL MMIO region from Device-nGnRnE to Device-nGnRE and documents critical constraints, such as the AICore's inability to write to DATA_MAIN_BASE or access the SPR MMIO window. The code review feedback suggests optimizing a hot-path modulo operation in the MMIO probes tool by using a power-of-two buffer size and a bitwise AND operation to avoid expensive integer division.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread tools/cann-examples/aicpu-mmio-probes/device/probes.cpp
@hw-native-sys-bot hw-native-sys-bot force-pushed the docs/mmio-hardware-and-cann-refs branch from 18ca388 to 3f220fd Compare June 16, 2026 01:36
@hw-native-sys-bot

Copy link
Copy Markdown
Collaborator Author

The 3 CI failures (st-onboard-a2a3, st-sim-a2a3 ubuntu, st-sim-a2a3 macos) are all the same test — spmd_paged_attention_highperf::test_run — and are pre-existing on upstream/main. Tracked as #1022.

Verification: same test fails on recent main commits 2668be5a (#1042), ea4d2045 (#1051) without this PR in the picture. This PR only changes a file-header comment in src/a2a3/platform/shared/aicpu/platform_regs.cpp and otherwise adds docs + standalone CMake tools under tools/cann-examples/ — none of it can affect the PA highperf golden output.

Not blocking the PR on these; they should clear when #1022 is fixed.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (2)
tools/cann-examples/aicpu-mmio-probes/shared/probes_types.h (1)

46-71: 💤 Low value

Consider adding static_assert for ABI stability.

The sibling header handshake.h uses static_assert(sizeof(NotifPerfHandshake) == 128, ...) to guard against cross-compiler layout drift. MmioProbeResult lacks this protection, yet it is shared between the host (x86-64 or aarch64) and the AICPU device (aarch64). Adding a size assertion would catch accidental layout changes during maintenance.

Suggested addition after line 71
 };
+static_assert(sizeof(MmioProbeResult) == 176, "MmioProbeResult layout must match across compilers");
 
 constexpr uint32_t kMmioProbeResultMagic = 0xABCD1234u;
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/cann-examples/aicpu-mmio-probes/shared/probes_types.h` around lines 46
- 71, Add a static_assert after the MmioProbeResult struct definition to verify
its size and guard against accidental layout changes. Following the pattern used
in handshake.h for NotifPerfHandshake, insert a static_assert that checks the
sizeof(MmioProbeResult) equals the expected size (you will need to determine the
correct expected size based on the struct's current layout). This protects
against cross-compiler ABI drift since MmioProbeResult is shared between the
host and AICPU device.
tools/cann-examples/aicpu-mmio-probes/host/launch.cpp (1)

450-458: 💤 Low value

init_handle is retrieved but never invoked.

The kernel framework typically expects simpler_aicpu_init to be called before simpler_aicpu_run. While the current simpler_aicpu_init is a no-op (returns 0), skipping it entirely may cause issues if the init function is extended later or if the runtime framework expects the init callback to have been registered/invoked.

Consider invoking init before run
     RT_CHECK(rtsFuncGetByName(binary_handle, init_op, &init_handle), "rtsFuncGetByName init");
     RT_CHECK(rtsFuncGetByName(binary_handle, run_op, &run_handle), "rtsFuncGetByName run");
 }
-(void)init_handle;
+
+// ---- Launch init (no-op, but keeps contract consistent) ----
+{
+    struct LaunchArgs {
+        uint64_t _pad[5] = {0};
+        uint64_t device_args_ptr = 0;
+        uint64_t reserved[20] = {0};
+    } la = {};
+    la.device_args_ptr = reinterpret_cast<uint64_t>(dev_args.ptr);
+    rtCpuKernelArgs_t cpu_args = {};
+    cpu_args.baseArgs.args = &la;
+    cpu_args.baseArgs.argsSize = sizeof(la);
+    rtLaunchKernelAttr_t attr = {};
+    rtKernelLaunchCfg_t cfg = {&attr, 0};
+    RT_CHECK(rtsLaunchCpuKernel(init_handle, 1u, stream, &cfg, &cpu_args), "rtsLaunchCpuKernel init");
+}
+ACL_CHECK(aclrtSynchronizeStream(stream), "sync after init");
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/cann-examples/aicpu-mmio-probes/host/launch.cpp` around lines 450 -
458, The `init_handle` is retrieved using `rtsFuncGetByName` but is never
invoked before the run function; instead it is marked as unused with
`(void)init_handle;`. The kernel framework expects `simpler_aicpu_init` to be
called before `simpler_aicpu_run` to properly initialize the kernel. Remove the
unused variable cast and add an invocation of the `init_handle` function before
the `run_handle` is called, following the proper initialization-then-execution
pattern.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tools/cann-examples/aicore-notification-perf/device-aicpu/consumer.cpp`:
- Around line 62-67: The WaitForChange function contains an unbounded polling
loop that can spin indefinitely if the poll condition is never satisfied,
blocking host stream synchronization. Add a timeout or deadline parameter to the
function and implement a bounded wait mechanism that exits after a reasonable
time interval. When the timeout is exceeded without a successful poll result,
propagate the failure by setting an appropriate error code to
result->consumer_rc instead of returning an arbitrary value, ensuring callers
can detect and handle this timeout condition gracefully.
- Around line 108-110: The consumer is setting hank->go = 0 at multiple
locations (at the start of each subtest phase) which signals the producer to
exit, but the producer checks go == 0 as its exit condition and will terminate
before the COND test phase completes, causing deadlock. Refactor to maintain a
single producer lifetime across both E2E subtests by removing the go = 0
assignments before the subtest phases at all three affected sites (lines
108-110, 160-161, and 217-218 in consumer.cpp), and instead set go = 0 only once
at the very end after all subtests complete. Alternatively, introduce explicit
pause and terminate state variables to distinguish between temporarily stopping
the producer and permanently exiting it, rather than overloading the go flag for
both purposes.

In `@tools/cann-examples/aicore-notification-perf/host/launch.cpp`:
- Around line 332-334: The code casts results from std::atoi directly to
uint32_t for device_id, target_core_idx, and n_samples without validating that
the parsed integers are non-negative. Negative values from std::atoi become
large unsigned values when cast to uint32_t, causing invalid COND MMIO
addressing and pathological runtime behavior. Add validation after each
std::atoi call to check that the parsed integer is non-negative before casting
to uint32_t, and handle invalid input appropriately (e.g., exit with an error or
use a default value).

In `@tools/cann-examples/aicore-notification-perf/README.md`:
- Around line 123-125: The `target_core_idx` parameter description in the
README.md is incorrect and misleading. The current text states it selects which
AIC core to drive the producer on, but the actual behavior is that the launcher
uses it to choose which COND register the consumer polls. Correct the
description of the `target_core_idx` parameter to accurately reflect that it
determines which COND register the consumer polls, removing the incorrect
reference to the producer core.

---

Nitpick comments:
In `@tools/cann-examples/aicpu-mmio-probes/host/launch.cpp`:
- Around line 450-458: The `init_handle` is retrieved using `rtsFuncGetByName`
but is never invoked before the run function; instead it is marked as unused
with `(void)init_handle;`. The kernel framework expects `simpler_aicpu_init` to
be called before `simpler_aicpu_run` to properly initialize the kernel. Remove
the unused variable cast and add an invocation of the `init_handle` function
before the `run_handle` is called, following the proper
initialization-then-execution pattern.

In `@tools/cann-examples/aicpu-mmio-probes/shared/probes_types.h`:
- Around line 46-71: Add a static_assert after the MmioProbeResult struct
definition to verify its size and guard against accidental layout changes.
Following the pattern used in handshake.h for NotifPerfHandshake, insert a
static_assert that checks the sizeof(MmioProbeResult) equals the expected size
(you will need to determine the correct expected size based on the struct's
current layout). This protects against cross-compiler ABI drift since
MmioProbeResult is shared between the host and AICPU device.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 3e32bef6-7e8b-41d5-b393-2e3045b76318

📥 Commits

Reviewing files that changed from the base of the PR and between ea4d204 and 3f220fd.

📒 Files selected for processing (25)
  • .claude/rules/ascend.md
  • docs/aicore-kernel-programming.md
  • docs/hardware/cache-coherency.md
  • docs/hardware/cann-source-references.md
  • docs/hardware/chip-architecture.md
  • docs/hardware/mmio-performance.md
  • docs/investigations/2026-06-aicore-mmio-to-spr.md
  • docs/investigations/2026-06-cond-vs-gm-notification.md
  • docs/investigations/README.md
  • src/a2a3/platform/shared/aicpu/platform_regs.cpp
  • tools/README.md
  • tools/cann-examples/aicore-notification-perf/README.md
  • tools/cann-examples/aicore-notification-perf/device-aicore/CMakeLists.txt
  • tools/cann-examples/aicore-notification-perf/device-aicore/producer.cce
  • tools/cann-examples/aicore-notification-perf/device-aicpu/CMakeLists.txt
  • tools/cann-examples/aicore-notification-perf/device-aicpu/consumer.cpp
  • tools/cann-examples/aicore-notification-perf/host/CMakeLists.txt
  • tools/cann-examples/aicore-notification-perf/host/launch.cpp
  • tools/cann-examples/aicore-notification-perf/shared/handshake.h
  • tools/cann-examples/aicpu-mmio-probes/README.md
  • tools/cann-examples/aicpu-mmio-probes/device/CMakeLists.txt
  • tools/cann-examples/aicpu-mmio-probes/device/probes.cpp
  • tools/cann-examples/aicpu-mmio-probes/host/CMakeLists.txt
  • tools/cann-examples/aicpu-mmio-probes/host/launch.cpp
  • tools/cann-examples/aicpu-mmio-probes/shared/probes_types.h

Comment thread tools/cann-examples/aicore-notification-perf/device-aicpu/consumer.cpp Outdated
Comment thread tools/cann-examples/aicore-notification-perf/device-aicpu/consumer.cpp Outdated
Comment thread tools/cann-examples/aicore-notification-perf/host/launch.cpp
Comment thread tools/cann-examples/aicore-notification-perf/README.md
@hw-native-sys-bot hw-native-sys-bot force-pushed the docs/mmio-hardware-and-cann-refs branch from 3f220fd to 97dcb19 Compare June 16, 2026 01:54
@hw-native-sys-bot

Copy link
Copy Markdown
Collaborator Author

Second CI run swapped failure modes — instead of PA highperf, every distributed/HCCL test fails with comm_alloc_domain_windows failed with code -1. That is the documented runner-contention pattern (wedged device state from a prior CI job, per feedback-no-chip-shared-contention-model lineage); not PR-related. Force-push from the CodeRabbit follow-up just re-queued the same shared runner.

Still recommending we ignore the st-onboard-a2a3 CI signal on this PR — none of the failures touch code under tools/cann-examples/ or docs/.

@hw-native-sys-bot hw-native-sys-bot force-pushed the docs/mmio-hardware-and-cann-refs branch from 97dcb19 to 1c0db0c Compare June 16, 2026 02:13
@hw-native-sys-bot

Copy link
Copy Markdown
Collaborator Author

CI failure root-cause

Both failing jobs (st-onboard-a2a3, ut-a2a3) hit the identical error:

RuntimeError: comm_alloc_domain_windows failed with code -1

on devices 5 / 8 / 9 / 10 / 11. Every failing case is an HCCL multi-chip test (alloc_domain -> control_alloc_domain -> comm_alloc_domain_windows).

Not caused by this PR

  • This PR touches docs, tools/cann-examples/ (standalone CMake, not built by CI), and one 3-line comment in src/a2a3/platform/shared/aicpu/platform_regs.cpp. None of that can affect HCCL setup.
  • Most recently merged PRs on main pass both jobs:
    • #1051 ut-a2a3 SUCCESS, st-onboard-a2a3 SUCCESS
    • #1055 ut-a2a3 SUCCESS, st-onboard-a2a3 SKIPPED
  • The 2 commits I was behind upstream (#1051, #1056) are tensor-arg refactors — neither touches comm, hccl, alloc_domain, or distributed code.

Most likely cause: runner-contention / wedged device state

The failing chips (8–11) come from the shared onboard runner. The HCCL comm_alloc_domain_windows -1 failure pattern is consistent with the runner having stale comm-domain state from a prior CI job (similar to the documented 507899 runner-contention pattern but on a different error code).

Action

Just rebased onto fresh upstream/main and force-pushed (1c0db0c0). If the next run still hits the same comm_alloc_domain_windows -1 on the same chips, the runner needs a reset rather than a code fix.

New docs/hardware/mmio-performance.md is the authoritative reference
for AIC_CTRL register-window behavior:

- Memory attribute proven Device-nGnRE (not nGnRnE) by driver source
  trace (gitcode.com/cann/driver) cross-validated against measured
  posted-STR / nR-LDR / non-gather tear behavior.
- Cost table: STR posted 5 ns/burst, LDR strict-ordered 95-105 ns,
  E2E write-to-read 140 ns.
- Concurrency model: single AICPU thread LDR COND is strictly serial
  (nR attribute); multi-thread cross-core is fully parallel — fixes
  the common "polling COND is sequential" miscommunication.
- DATA_MAIN_BASE is hardware-unidirectional: AICPU writes, AICore
  SPR-reads. No write path exists on the AICore side at all.

Two new tools/cann-examples/ artifacts make Phase 4 + 12 (AICPU side
microbench) and Phase 13 + 14 (GM vs COND notification comparison)
reproducible without depending on the runtime in src/:

- tools/cann-examples/aicpu-mmio-probes/ — single AICPU SO + host
  launcher. Phase 4 burst STR + STR/LDR round trip, Phase 12 single +
  multi-thread LDR COND scaling.
- tools/cann-examples/aicore-notification-perf/ — AICore producer +
  AICPU consumer + host launcher. Phase 13 idle LDR rate, Phase 14
  per-event E2E latency for both notification paths.

Two new docs/investigations/ entries close out the negative findings:

- 2026-06-aicore-mmio-to-spr.md — AICore-side LSU cannot reach SPR
  MMIO (hangs chip); CCEC has no destination encoding for
  MOV DATA_MAIN_BASE.
- 2026-06-cond-vs-gm-notification.md — COND keeps the FIN-signal
  role (1.7x E2E latency win); GM-coherent open for future polling-
  rate-bound hint channels.

Existing files updated for cross-linking and to fix prior wrong
attribute claims:

- docs/hardware/cache-coherency.md — 3 places nGnRnE -> nGnRE.
- src/a2a3/platform/shared/aicpu/platform_regs.cpp — comment fix.
- docs/hardware/chip-architecture.md — "what to read next" rows.
- docs/aicore-kernel-programming.md — new section 4 listing AICore
  hardware constraints; section 4.1 captures one-run AIC-vs-AIV
  SPR microarchitecture surprises (AIC and AIV equal on hot
  set_cond+read_reg; AIC rotation no penalty, AIV rotation 5x).
- .claude/rules/ascend.md — always-loaded hard-constraint section.
- tools/README.md + docs/investigations/README.md — index entries.

A complementary docs/hardware/cann-source-references.md catalogs the
gitcode.com/cann/{driver,hccl,hcomm} opensource trees we hit most
often during simpler dev, with the clone-to-build/ convention from
.claude/skills/multi-repo-setup/.
@hw-native-sys-bot hw-native-sys-bot force-pushed the docs/mmio-hardware-and-cann-refs branch from 1c0db0c to 3dfd68a Compare June 16, 2026 12:45
@hw-native-sys-bot

Copy link
Copy Markdown
Collaborator Author

CI re-run after rebase — mostly clean

Rebased onto upstream/main (which now has #1068 skipping the failing PA + sdma_async tests). Both previously failing a2a3 jobs now PASS:

  • st-onboard-a2a3 ✅ (was failing on comm_alloc_domain_windows)
  • ut-a2a3 ✅ (same)

3 remaining failures — all transient network glitch

profiling-flags-smoke, st-sim-a2a3 (ubuntu), st-sim-a2a3 (macos) all failed at the same setup step — "Set up C++ compiler":

ValueError: invalid literal for int() with base 16: b
  File "/usr/lib/python3.12/http/client.py", line 600, in _read_chunked

This is Python stdlib HTTP chunked-transfer parsing failing while downloading compiler binaries — transient runner network glitch / upstream artifact-server flake. The same "Set up C++ compiler" step succeeded on 13 other jobs in this exact run (st-onboard-a5, st-sim-a5 ubuntu+macos, ut x4, packaging x2, etc.) — so it is not deterministic.

Recommend a rerun-failed of just the 3 affected jobs once the runner network is healthy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants