Refactor: drop per-run AICPU init launch; per-thread run-wall capture by hw-native-sys-bot · Pull Request #1061 · hw-native-sys/simpler

hw-native-sys-bot · 2026-06-16T02:01:06Z

Summary

The onboard runtime launched a single-threaded simpler_aicpu_init kernel (block_dim=1) on every run whose only jobs were: snapshot CANN log severity, capture the run-wall start cycle, and zero the wall buffer. This removes that extra rtsLaunchCpuKernel round-trip and rehomes its work:

init_log_switch() now runs at the top of simpler_aicpu_exec (idempotent across the concurrent exec threads).
Run-wall reworked to per-thread buffers, no atomics. The wall buffer now holds one {start_cycle, end_cycle} pair per launched AICPU thread; each surviving thread plain-stores into its own slot (indexed by the affinity gate's exec index — now exposed for the a2a3 legacy gate too), and the host reduces max(end) - min(start) → ns on readback. The host resets the buffer each run. No cross-thread contention; visibility is the same post-stream-sync path the old single-store wall used.
simpler_aicpu_init is kept as a stub so host bootstrap's rtsFuncGetByName(InitName) still resolves.

Also in this PR:

Init phase timing: an INIT BREAKDOWN device-log line (PTO2_PROFILING, V9) reporting the orchestrator's pre-cycle init phases (so_load/config/arena/sched_wait/final_prep) — useful for diagnosing AICPU startup cost.
Doc fix: a stale comment claimed the init kernel populated workers[].core_type; it is actually written by the AICore handshake, so the host read reflects the prior run's values. Comment corrected.
Agent rule (.claude/rules/running-onboard.md): auto-detect arch capability + task-submit and invoke device work without prompting.

Mirrored across a2a3 and a5 onboard.

Testing

a2a3 onboard: 4-card distributed replay, exit 0; init launch gone, all chip processes ready, replay finished
a2a3 onboard: device_wall_ns reduced correctly (~1.51–1.55 s, sane min_start < max_end)
Builds clean on both a2a3 and a5; pre-commit hooks pass
a5 onboard: build-verified only — no a5 silicon on this host

coderabbitai · 2026-06-16T02:01:16Z

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: fb0a68a7-c93c-4d63-b728-c7be74ef6a08

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

📝 Walkthrough

Walkthrough

Run-wall measurement is reworked across a2a3 and a5 platforms: each AICPU exec thread now writes its own {start_cycle, end_cycle} pair into a multi-slot device buffer indexed by a unified tl_exec_idx (from platform_aicpu_affinity), and the host reduces these to a single wall duration via max(end) - min(start). simpler_aicpu_init becomes a no-op. A PTO2_PROFILING init-phase breakdown is added to the a2a3 orchestrator. A .claude/rules doc adds autonomous NPU task invocation logic.

Changes

Per-thread run-wall timing rework

Layer / File(s)	Summary
Unified per-thread slot index `src/common/platform/onboard/aicpu/platform_aicpu_affinity.cpp`	Introduces shared `tl_exec_idx` thread-local replacing `tl_filter_exec_idx`; both legacy and filter gates assign survivor slot or `-1`; `platform_aicpu_affinity_thread_idx()` returns `tl_exec_idx`.
Host multi-slot wall buffer allocation and reduction `src/common/platform/onboard/host/device_runner_base.cpp`, `src/a2a3/platform/include/common/kernel_args.h`	`ensure_device_wall_buffer()` allocates a `PLATFORM_MAX_AICPU_THREADS_JUST_FOR_LAUNCH`-slot `{start,end}` buffer with sentinel reset each run; `read_device_wall_ns()` reduces all slots via `max(end) - min(start)`; `kernel_args.h` comment updated to document new layout.
a2a3 AICPU kernel per-thread recording `src/a2a3/platform/onboard/aicpu/kernel.cpp`, `src/a2a3/platform/onboard/host/device_runner.cpp`	`simpler_aicpu_init` becomes a no-op stub; `simpler_aicpu_exec` calls `init_log_switch()` per thread and writes `start`/`end` sys-counter cycles into the affinity-indexed wall slot; comment in `device_runner.cpp` updated for `core_type` residency.
a5 AICPU kernel per-thread recording `src/a5/platform/onboard/aicpu/kernel.cpp`, `src/a5/platform/onboard/host/device_runner.cpp`	Mirrors a2a3 changes: `simpler_aicpu_init` reduced to `(void)arg` no-op; `simpler_aicpu_exec` gains `init_log_switch()` and guarded per-slot `start`/`end` writes; `device_runner.cpp` comment updated.
PTO2_PROFILING orchestrator init breakdown `src/a2a3/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp`	Under `PTO2_PROFILING` guard, adds timestamp captures at SO load, config validation, arena/SM init, scheduler wait, and pre-loop phases; emits an INIT BREAKDOWN log line with per-phase microsecond durations.

Autonomous NPU invocation rule

Layer / File(s)	Summary
Autonomous invocation rule `.claude/rules/running-onboard.md`	Adds a section instructing detection of onboard silicon via `onboard-arch-precheck`, conditional `task-submit` invocation without user prompting, and fallback when precheck or `task-submit` is unavailable.

Sequence Diagram(s)

sequenceDiagram
    participant Host as DeviceRunnerBase (host)
    participant Affinity as platform_aicpu_affinity
    participant Exec as simpler_aicpu_exec (per thread)
    participant Buf as device_wall_data_base

    Host->>Buf: ensure_device_wall_buffer() — alloc + sentinel reset all slots
    Host->>Exec: launch N AICPU exec threads
    Exec->>Affinity: platform_aicpu_affinity_thread_idx() → tl_exec_idx
    Exec->>Buf: slot[tl_exec_idx].start = get_sys_cnt_aicpu()
    Note over Exec: kernel work executes
    Exec->>Buf: slot[tl_exec_idx].end = get_sys_cnt_aicpu()
    Host->>Buf: read_device_wall_ns() — D2H copy full buffer
    Host->>Host: reduce: max(end) - min(start) → device_wall_ns_

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

hw-native-sys/simpler#896: Both PRs modify .claude/rules/running-onboard.md to govern autonomous task-submit invocation for onboard NPU tasks, with overlapping logic for onboard-arch-precheck detection and fallback behavior.

Poem

🐇 Hop hop, no more global clock race,
Each thread now timestamps its own space!
Start and end, a pair so neat,
Host reduces — max minus min's the feat.
From sentinel slots to nanoseconds bright,
The bunny measures wall time right! ⏱️✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 38.89% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately summarizes the main changes: eliminating the per-run AICPU init launch and refactoring run-wall capture to a per-thread mechanism.
Description check	✅ Passed	The description provides comprehensive context for all major changes including init launch removal, per-thread wall capture, init phase timing, documentation fixes, and agent rules, all directly related to the changeset.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request removes the single-threaded simpler_aicpu_init kernel launch and transitions run-wall time measurement to a per-thread slot buffer, allowing the host to reduce the overall duration as max(end) - min(start). It also introduces detailed pre-orchestrator cycle initialization profiling and updates autonomous execution rules. The review feedback points out a potential issue where the static variable s_reported in platform_aicpu_affinity.cpp persists across AICPU launches, and suggests implementing an explicit reset function to prevent state leakage between runs.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

hw-native-sys-bot · 2026-06-16T02:29:05Z

CI triage: `st-onboard-a2a3` / `ut-a2a3` reds are known issue #1037, not this PR

Both a2a3-onboard reds fail in the cross-rank comm IPC path, not anything this PR touches:

test_two_rank_allocate_release_round_trip / sdma_async_completion_demo
RuntimeError: comm_alloc_domain_windows failed with code -1
domain_alloc_via_ipc: [comm_hccl.cpp:810] alloc_domain: ImportByKey(...) -> 507899

This is #1037 — aclrtIpcMemImportByKey returning 507899 because the runner's driver has support_shmem_map_exbus=0. Evidence it's the shared runner (npu-runner-2), not code:

Docs-only PR docs(hardware): MMIO performance reference + 2 cann-example probe tools #1057 (no runtime code) fails with the byte-identical signature on the same runner (ut-a2a3 run).
main's latest CI run passes both jobs.
This PR's diff is entirely in the AICPU compute-launch path (init launch removal, run-wall buffer, affinity-gate index) — it does not touch comm_hccl / HCCL IPC / alloc_domain. Those comm-domain tests don't go through DeviceRunner::run where the init launch lived.

Re-running won't clear it (it's persistent driver/device state, reproduced across two consecutive runs here). Needs the #1037 driver fix or a runner with support_shmem_map_exbus=1. The one real review comment (Gemini, s_reported early-path leak) is fixed in 052dbde1 and resolved. st-sim-a2a3/a5, ut, pre-commit, packaging are all green.

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

src/common/platform/onboard/host/device_runner_base.cpp (1)

916-922: ⚡ Quick win

Update stale comment to reflect the multi-slot wall buffer.

The comment still describes an “8-byte device buffer,” but this path now reads an array of per-thread {start_cycle,end_cycle} pairs.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/common/platform/onboard/host/device_runner_base.cpp` around lines 916 -
922, Update the stale comment starting with "Pull the platform-level device
wall" to accurately describe the current implementation. The comment incorrectly
refers to an "8-byte device buffer," but the code now reads an array of
per-thread {start_cycle,end_cycle} pairs instead. Revise the comment to reflect
this multi-slot wall buffer structure while preserving the context about
KernelArgs::device_wall_data_base, CANN's rtAicpuKernelLaunchExWithArgs, and the
soft warn failure path behavior.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/common/platform/onboard/host/device_runner_base.cpp`:
- Around line 815-822: The copy_to_device call that resets the device wall is
currently cast to void, ignoring any potential failure. This can cause stale
slot data to persist into the next run's reduction, corrupting device_wall_ns_.
Instead of discarding the return value in the copy_to_device call that
initializes the device wall slots (the call that copies the init array to
device_wall_dev_ptr_), capture the return value and explicitly check for
failure. If the copy fails, log an error message indicating the device wall
reset failure and either handle the error appropriately or prevent the operation
from proceeding.

---

Nitpick comments:
In `@src/common/platform/onboard/host/device_runner_base.cpp`:
- Around line 916-922: Update the stale comment starting with "Pull the
platform-level device wall" to accurately describe the current implementation.
The comment incorrectly refers to an "8-byte device buffer," but the code now
reads an array of per-thread {start_cycle,end_cycle} pairs instead. Revise the
comment to reflect this multi-slot wall buffer structure while preserving the
context about KernelArgs::device_wall_data_base, CANN's
rtAicpuKernelLaunchExWithArgs, and the soft warn failure path behavior.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: a21a9156-6707-47bc-8031-164e0b6fe3b0

📥 Commits

Reviewing files that changed from the base of the PR and between 7b4989a and 5ec0624.

📒 Files selected for processing (9)

.claude/rules/running-onboard.md
src/a2a3/platform/include/common/kernel_args.h
src/a2a3/platform/onboard/aicpu/kernel.cpp
src/a2a3/platform/onboard/host/device_runner.cpp
src/a2a3/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp
src/a5/platform/onboard/aicpu/kernel.cpp
src/a5/platform/onboard/host/device_runner.cpp
src/common/platform/onboard/aicpu/platform_aicpu_affinity.cpp
src/common/platform/onboard/host/device_runner_base.cpp

The onboard runtime launched a single-threaded simpler_aicpu_init kernel (block_dim=1) on every run whose only jobs were: snapshot the CANN log severity, capture the run-wall start cycle, and zero the wall buffer. That extra rtsLaunchCpuKernel round-trip is removed and its work rehomed: - init_log_switch() now runs at the top of simpler_aicpu_exec (idempotent across the concurrent exec threads). - Run-wall start/end are captured per surviving exec thread. The wall buffer now holds one {start_cycle, end_cycle} pair per launched AICPU thread; each thread plain-stores into its own slot (indexed by the affinity gate's exec index, now exposed for the a2a3 legacy gate too) with no cross-thread atomics, and the host reduces max(end) - min(start) -> ns on readback. The host resets the buffer each run. - simpler_aicpu_init is kept as a stub so host bootstrap's rtsFuncGetByName(InitName) still resolves. Also: - Add an INIT BREAKDOWN device-log line (PTO2_PROFILING, V9) reporting the orchestrator's pre-cycle init phases (so_load/config/arena/sched_wait/ final_prep) for diagnosing AICPU startup cost. - Fix a stale comment claiming the init kernel populated workers[].core_type; it is written by the AICore handshake, so the host read reflects the prior run's values. - Add an agent rule (running-onboard.md): auto-detect arch capability + task-submit and invoke device work without prompting. Mirrored across a2a3 and a5 onboard. Verified on a2a3 (4-card replay, exit 0; device_wall_ns ~1.5s). a5 build-verified only (no a5 silicon on this host).

gemini-code-assist Bot reviewed Jun 16, 2026

View reviewed changes

Comment thread src/common/platform/onboard/aicpu/platform_aicpu_affinity.cpp

hw-native-sys-bot force-pushed the refactor/aicpu-init-launch-removal branch from 0223da7 to 052dbde Compare June 16, 2026 02:19

hw-native-sys-bot force-pushed the refactor/aicpu-init-launch-removal branch from 052dbde to 5ec0624 Compare June 16, 2026 12:42

coderabbitai Bot reviewed Jun 16, 2026

View reviewed changes

Comment thread src/common/platform/onboard/host/device_runner_base.cpp

hw-native-sys-bot force-pushed the refactor/aicpu-init-launch-removal branch from 5ec0624 to 85eff0e Compare June 16, 2026 13:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor: drop per-run AICPU init launch; per-thread run-wall capture#1061

Refactor: drop per-run AICPU init launch; per-thread run-wall capture#1061
hw-native-sys-bot wants to merge 1 commit into
hw-native-sys:mainfrom
hw-native-sys-bot:refactor/aicpu-init-launch-removal

hw-native-sys-bot commented Jun 16, 2026

Uh oh!

coderabbitai Bot commented Jun 16, 2026 •

edited

Loading

Review skipped

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

hw-native-sys-bot commented Jun 16, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

hw-native-sys-bot commented Jun 16, 2026

Summary

Testing

Uh oh!

coderabbitai Bot commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

hw-native-sys-bot commented Jun 16, 2026

CI triage: st-onboard-a2a3 / ut-a2a3 reds are known issue #1037, not this PR

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coderabbitai Bot commented Jun 16, 2026 •

edited

Loading

CI triage: `st-onboard-a2a3` / `ut-a2a3` reds are known issue #1037, not this PR