docs(hardware): MMIO performance reference + 2 cann-example probe tools#1057
docs(hardware): MMIO performance reference + 2 cann-example probe tools#1057hw-native-sys-bot wants to merge 1 commit into
Conversation
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughThis PR corrects the ARM64 MMIO memory-type label from ChangesHardware Constraint Documentation + Memory-Type Correction
AICPU MMIO Probes and AICore Notification-Perf Benchmark Tools
Sequence DiagramsequenceDiagram
participant Host as launch_notif_perf
participant ACL as ACL runtime
participant AICoreStream as AICore stream
participant AICPUStream as AICPU stream
participant GM as NotifPerfHandshake
participant COND as COND register
Host->>ACL: aclInit, alloc GM buffers
Host->>AICPUStream: bootstrap rtsLaunchCpuKernel init
Host->>ACL: rtsBinaryLoadFromFile consumer SO
Host->>ACL: rtRegisterAllKernel producer object
Host->>AICoreStream: launch notif_perf_producer
Host->>AICPUStream: launch simpler_aicpu_run
AICPUStream->>GM: set mode=GM, go=1
AICoreStream->>GM: dcci invalidate, write p_tw + p_seq, dcci flush
AICPUStream->>GM: poll p_seq change, record tick delta
AICPUStream->>GM: set mode=COND, go=1
AICoreStream->>GM: dcci invalidate, write p_tw
AICoreStream->>COND: set_cond with encoded seq
AICPUStream->>COND: poll cond_addr change, record tick delta
Host->>ACL: synchronize both streams
Host->>Host: D2H NotifPerfResult, PrintResultTable
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request adds extensive documentation and standalone microbenchmarking tools (aicpu-mmio-probes and aicore-notification-perf) detailing the hardware constraints, memory attributes, and performance characteristics of the Ascend chip architecture. It corrects the memory attribute of the AIC_CTRL MMIO region from Device-nGnRnE to Device-nGnRE and documents critical constraints, such as the AICore's inability to write to DATA_MAIN_BASE or access the SPR MMIO window. The code review feedback suggests optimizing a hot-path modulo operation in the MMIO probes tool by using a power-of-two buffer size and a bitwise AND operation to avoid expensive integer division.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
18ca388 to
3f220fd
Compare
|
The 3 CI failures ( Verification: same test fails on recent main commits Not blocking the PR on these; they should clear when #1022 is fixed. |
There was a problem hiding this comment.
Actionable comments posted: 4
🧹 Nitpick comments (2)
tools/cann-examples/aicpu-mmio-probes/shared/probes_types.h (1)
46-71: 💤 Low valueConsider adding
static_assertfor ABI stability.The sibling header
handshake.husesstatic_assert(sizeof(NotifPerfHandshake) == 128, ...)to guard against cross-compiler layout drift.MmioProbeResultlacks this protection, yet it is shared between the host (x86-64 or aarch64) and the AICPU device (aarch64). Adding a size assertion would catch accidental layout changes during maintenance.Suggested addition after line 71
}; +static_assert(sizeof(MmioProbeResult) == 176, "MmioProbeResult layout must match across compilers"); constexpr uint32_t kMmioProbeResultMagic = 0xABCD1234u;🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tools/cann-examples/aicpu-mmio-probes/shared/probes_types.h` around lines 46 - 71, Add a static_assert after the MmioProbeResult struct definition to verify its size and guard against accidental layout changes. Following the pattern used in handshake.h for NotifPerfHandshake, insert a static_assert that checks the sizeof(MmioProbeResult) equals the expected size (you will need to determine the correct expected size based on the struct's current layout). This protects against cross-compiler ABI drift since MmioProbeResult is shared between the host and AICPU device.tools/cann-examples/aicpu-mmio-probes/host/launch.cpp (1)
450-458: 💤 Low value
init_handleis retrieved but never invoked.The kernel framework typically expects
simpler_aicpu_initto be called beforesimpler_aicpu_run. While the currentsimpler_aicpu_initis a no-op (returns 0), skipping it entirely may cause issues if the init function is extended later or if the runtime framework expects the init callback to have been registered/invoked.Consider invoking init before run
RT_CHECK(rtsFuncGetByName(binary_handle, init_op, &init_handle), "rtsFuncGetByName init"); RT_CHECK(rtsFuncGetByName(binary_handle, run_op, &run_handle), "rtsFuncGetByName run"); } -(void)init_handle; + +// ---- Launch init (no-op, but keeps contract consistent) ---- +{ + struct LaunchArgs { + uint64_t _pad[5] = {0}; + uint64_t device_args_ptr = 0; + uint64_t reserved[20] = {0}; + } la = {}; + la.device_args_ptr = reinterpret_cast<uint64_t>(dev_args.ptr); + rtCpuKernelArgs_t cpu_args = {}; + cpu_args.baseArgs.args = &la; + cpu_args.baseArgs.argsSize = sizeof(la); + rtLaunchKernelAttr_t attr = {}; + rtKernelLaunchCfg_t cfg = {&attr, 0}; + RT_CHECK(rtsLaunchCpuKernel(init_handle, 1u, stream, &cfg, &cpu_args), "rtsLaunchCpuKernel init"); +} +ACL_CHECK(aclrtSynchronizeStream(stream), "sync after init");🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tools/cann-examples/aicpu-mmio-probes/host/launch.cpp` around lines 450 - 458, The `init_handle` is retrieved using `rtsFuncGetByName` but is never invoked before the run function; instead it is marked as unused with `(void)init_handle;`. The kernel framework expects `simpler_aicpu_init` to be called before `simpler_aicpu_run` to properly initialize the kernel. Remove the unused variable cast and add an invocation of the `init_handle` function before the `run_handle` is called, following the proper initialization-then-execution pattern.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tools/cann-examples/aicore-notification-perf/device-aicpu/consumer.cpp`:
- Around line 62-67: The WaitForChange function contains an unbounded polling
loop that can spin indefinitely if the poll condition is never satisfied,
blocking host stream synchronization. Add a timeout or deadline parameter to the
function and implement a bounded wait mechanism that exits after a reasonable
time interval. When the timeout is exceeded without a successful poll result,
propagate the failure by setting an appropriate error code to
result->consumer_rc instead of returning an arbitrary value, ensuring callers
can detect and handle this timeout condition gracefully.
- Around line 108-110: The consumer is setting hank->go = 0 at multiple
locations (at the start of each subtest phase) which signals the producer to
exit, but the producer checks go == 0 as its exit condition and will terminate
before the COND test phase completes, causing deadlock. Refactor to maintain a
single producer lifetime across both E2E subtests by removing the go = 0
assignments before the subtest phases at all three affected sites (lines
108-110, 160-161, and 217-218 in consumer.cpp), and instead set go = 0 only once
at the very end after all subtests complete. Alternatively, introduce explicit
pause and terminate state variables to distinguish between temporarily stopping
the producer and permanently exiting it, rather than overloading the go flag for
both purposes.
In `@tools/cann-examples/aicore-notification-perf/host/launch.cpp`:
- Around line 332-334: The code casts results from std::atoi directly to
uint32_t for device_id, target_core_idx, and n_samples without validating that
the parsed integers are non-negative. Negative values from std::atoi become
large unsigned values when cast to uint32_t, causing invalid COND MMIO
addressing and pathological runtime behavior. Add validation after each
std::atoi call to check that the parsed integer is non-negative before casting
to uint32_t, and handle invalid input appropriately (e.g., exit with an error or
use a default value).
In `@tools/cann-examples/aicore-notification-perf/README.md`:
- Around line 123-125: The `target_core_idx` parameter description in the
README.md is incorrect and misleading. The current text states it selects which
AIC core to drive the producer on, but the actual behavior is that the launcher
uses it to choose which COND register the consumer polls. Correct the
description of the `target_core_idx` parameter to accurately reflect that it
determines which COND register the consumer polls, removing the incorrect
reference to the producer core.
---
Nitpick comments:
In `@tools/cann-examples/aicpu-mmio-probes/host/launch.cpp`:
- Around line 450-458: The `init_handle` is retrieved using `rtsFuncGetByName`
but is never invoked before the run function; instead it is marked as unused
with `(void)init_handle;`. The kernel framework expects `simpler_aicpu_init` to
be called before `simpler_aicpu_run` to properly initialize the kernel. Remove
the unused variable cast and add an invocation of the `init_handle` function
before the `run_handle` is called, following the proper
initialization-then-execution pattern.
In `@tools/cann-examples/aicpu-mmio-probes/shared/probes_types.h`:
- Around line 46-71: Add a static_assert after the MmioProbeResult struct
definition to verify its size and guard against accidental layout changes.
Following the pattern used in handshake.h for NotifPerfHandshake, insert a
static_assert that checks the sizeof(MmioProbeResult) equals the expected size
(you will need to determine the correct expected size based on the struct's
current layout). This protects against cross-compiler ABI drift since
MmioProbeResult is shared between the host and AICPU device.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 3e32bef6-7e8b-41d5-b393-2e3045b76318
📒 Files selected for processing (25)
.claude/rules/ascend.mddocs/aicore-kernel-programming.mddocs/hardware/cache-coherency.mddocs/hardware/cann-source-references.mddocs/hardware/chip-architecture.mddocs/hardware/mmio-performance.mddocs/investigations/2026-06-aicore-mmio-to-spr.mddocs/investigations/2026-06-cond-vs-gm-notification.mddocs/investigations/README.mdsrc/a2a3/platform/shared/aicpu/platform_regs.cpptools/README.mdtools/cann-examples/aicore-notification-perf/README.mdtools/cann-examples/aicore-notification-perf/device-aicore/CMakeLists.txttools/cann-examples/aicore-notification-perf/device-aicore/producer.ccetools/cann-examples/aicore-notification-perf/device-aicpu/CMakeLists.txttools/cann-examples/aicore-notification-perf/device-aicpu/consumer.cpptools/cann-examples/aicore-notification-perf/host/CMakeLists.txttools/cann-examples/aicore-notification-perf/host/launch.cpptools/cann-examples/aicore-notification-perf/shared/handshake.htools/cann-examples/aicpu-mmio-probes/README.mdtools/cann-examples/aicpu-mmio-probes/device/CMakeLists.txttools/cann-examples/aicpu-mmio-probes/device/probes.cpptools/cann-examples/aicpu-mmio-probes/host/CMakeLists.txttools/cann-examples/aicpu-mmio-probes/host/launch.cpptools/cann-examples/aicpu-mmio-probes/shared/probes_types.h
3f220fd to
97dcb19
Compare
|
Second CI run swapped failure modes — instead of PA highperf, every distributed/HCCL test fails with Still recommending we ignore the |
97dcb19 to
1c0db0c
Compare
CI failure root-causeBoth failing jobs ( on devices 5 / 8 / 9 / 10 / 11. Every failing case is an HCCL multi-chip test ( Not caused by this PR
Most likely cause: runner-contention / wedged device stateThe failing chips (8–11) come from the shared onboard runner. The HCCL ActionJust rebased onto fresh |
New docs/hardware/mmio-performance.md is the authoritative reference
for AIC_CTRL register-window behavior:
- Memory attribute proven Device-nGnRE (not nGnRnE) by driver source
trace (gitcode.com/cann/driver) cross-validated against measured
posted-STR / nR-LDR / non-gather tear behavior.
- Cost table: STR posted 5 ns/burst, LDR strict-ordered 95-105 ns,
E2E write-to-read 140 ns.
- Concurrency model: single AICPU thread LDR COND is strictly serial
(nR attribute); multi-thread cross-core is fully parallel — fixes
the common "polling COND is sequential" miscommunication.
- DATA_MAIN_BASE is hardware-unidirectional: AICPU writes, AICore
SPR-reads. No write path exists on the AICore side at all.
Two new tools/cann-examples/ artifacts make Phase 4 + 12 (AICPU side
microbench) and Phase 13 + 14 (GM vs COND notification comparison)
reproducible without depending on the runtime in src/:
- tools/cann-examples/aicpu-mmio-probes/ — single AICPU SO + host
launcher. Phase 4 burst STR + STR/LDR round trip, Phase 12 single +
multi-thread LDR COND scaling.
- tools/cann-examples/aicore-notification-perf/ — AICore producer +
AICPU consumer + host launcher. Phase 13 idle LDR rate, Phase 14
per-event E2E latency for both notification paths.
Two new docs/investigations/ entries close out the negative findings:
- 2026-06-aicore-mmio-to-spr.md — AICore-side LSU cannot reach SPR
MMIO (hangs chip); CCEC has no destination encoding for
MOV DATA_MAIN_BASE.
- 2026-06-cond-vs-gm-notification.md — COND keeps the FIN-signal
role (1.7x E2E latency win); GM-coherent open for future polling-
rate-bound hint channels.
Existing files updated for cross-linking and to fix prior wrong
attribute claims:
- docs/hardware/cache-coherency.md — 3 places nGnRnE -> nGnRE.
- src/a2a3/platform/shared/aicpu/platform_regs.cpp — comment fix.
- docs/hardware/chip-architecture.md — "what to read next" rows.
- docs/aicore-kernel-programming.md — new section 4 listing AICore
hardware constraints; section 4.1 captures one-run AIC-vs-AIV
SPR microarchitecture surprises (AIC and AIV equal on hot
set_cond+read_reg; AIC rotation no penalty, AIV rotation 5x).
- .claude/rules/ascend.md — always-loaded hard-constraint section.
- tools/README.md + docs/investigations/README.md — index entries.
A complementary docs/hardware/cann-source-references.md catalogs the
gitcode.com/cann/{driver,hccl,hcomm} opensource trees we hit most
often during simpler dev, with the clone-to-build/ convention from
.claude/skills/multi-repo-setup/.
1c0db0c to
3dfd68a
Compare
CI re-run after rebase — mostly cleanRebased onto
3 remaining failures — all transient network glitch
This is Python stdlib HTTP chunked-transfer parsing failing while downloading compiler binaries — transient runner network glitch / upstream artifact-server flake. The same "Set up C++ compiler" step succeeded on 13 other jobs in this exact run (st-onboard-a5, st-sim-a5 ubuntu+macos, ut x4, packaging x2, etc.) — so it is not deterministic. Recommend a rerun-failed of just the 3 affected jobs once the runner network is healthy. |
Summary
docs/hardware/mmio-performance.mdis the authoritative reference for AIC_CTRL register-window behavior: memory attribute (provenDevice-nGnREfrom driver source, notnGnRnEas previously claimed in 4 places), single STR/LDR cost numbers, single-thread serial vs multi-thread parallel LDR scaling, and the DMB-is-hardware-unidirectional constraint.tools/cann-examples/probe tools make the measurements reproducible without depending on the runtime insrc/:aicpu-mmio-probes/— AICPU SO + host launcher, Phase 4 (STR DMB burst + round trip) + Phase 12 (single + multi-thread LDR COND scaling).aicore-notification-perf/— AICore producer + AICPU consumer + host launcher, Phase 13 (idle LDR rate) + Phase 14 (per-event E2E latency for GM-dcci vs COND MMIO paths).docs/investigations/entries close out the negative findings (AICore can't reach SPR MMIO from inside; CCEC has noMOV DATA_MAIN_BASE, xencoding; GM vs COND notification tradeoff matrix).docs/hardware/cann-source-references.mdcatalogs thegitcode.com/cann/{driver,hccl,hcomm}opensource trees we hit during simpler dev with the clone-to-build/convention.Device-nGnRnE->Device-nGnREclaim indocs/hardware/cache-coherency.md(3 places) +src/a2a3/platform/shared/aicpu/platform_regs.cpp(1 place); the prior claim was contradicted by burst-STR posted-write behavior measured here..claude/rules/ascend.mdgets an always-loaded hard-constraint section so future proposals to make AICore write DMB / reach SPR MMIO are short-circuited.How the measurements line up with the new docs
Test plan
pre-commit run— clang-format / clang-tidy / cpplint / markdownlint all pass.ASCEND_HOME_PATHset, run viatask-submitagainst a free NPU, verify the numbers reproduce within ±10%.docs/hardware/cache-coherency.mdanddocs/aicore-kernel-programming.mdto ensure terminology stays consistent.Notes
platform_regs.cpp. The experimental probe code that produced the numbers stays onexperiment/dmb-64bit-probeand is not part of this PR; the tools in this PR are the standalone, branch-independent reproduction path.docs/aicore-kernel-programming.md§4.1 with an explicit "single-run, verify before relying" caveat.