Skip to content

Refactor: drop per-run AICPU init launch; per-thread run-wall capture#1061

Open
hw-native-sys-bot wants to merge 1 commit into
hw-native-sys:mainfrom
hw-native-sys-bot:refactor/aicpu-init-launch-removal
Open

Refactor: drop per-run AICPU init launch; per-thread run-wall capture#1061
hw-native-sys-bot wants to merge 1 commit into
hw-native-sys:mainfrom
hw-native-sys-bot:refactor/aicpu-init-launch-removal

Conversation

@hw-native-sys-bot

Copy link
Copy Markdown
Collaborator

Summary

The onboard runtime launched a single-threaded simpler_aicpu_init kernel (block_dim=1) on every run whose only jobs were: snapshot CANN log severity, capture the run-wall start cycle, and zero the wall buffer. This removes that extra rtsLaunchCpuKernel round-trip and rehomes its work:

  • init_log_switch() now runs at the top of simpler_aicpu_exec (idempotent across the concurrent exec threads).
  • Run-wall reworked to per-thread buffers, no atomics. The wall buffer now holds one {start_cycle, end_cycle} pair per launched AICPU thread; each surviving thread plain-stores into its own slot (indexed by the affinity gate's exec index — now exposed for the a2a3 legacy gate too), and the host reduces max(end) - min(start) → ns on readback. The host resets the buffer each run. No cross-thread contention; visibility is the same post-stream-sync path the old single-store wall used.
  • simpler_aicpu_init is kept as a stub so host bootstrap's rtsFuncGetByName(InitName) still resolves.

Also in this PR:

  • Init phase timing: an INIT BREAKDOWN device-log line (PTO2_PROFILING, V9) reporting the orchestrator's pre-cycle init phases (so_load/config/arena/sched_wait/final_prep) — useful for diagnosing AICPU startup cost.
  • Doc fix: a stale comment claimed the init kernel populated workers[].core_type; it is actually written by the AICore handshake, so the host read reflects the prior run's values. Comment corrected.
  • Agent rule (.claude/rules/running-onboard.md): auto-detect arch capability + task-submit and invoke device work without prompting.

Mirrored across a2a3 and a5 onboard.

Testing

  • a2a3 onboard: 4-card distributed replay, exit 0; init launch gone, all chip processes ready, replay finished
  • a2a3 onboard: device_wall_ns reduced correctly (~1.51–1.55 s, sane min_start < max_end)
  • Builds clean on both a2a3 and a5; pre-commit hooks pass
  • a5 onboard: build-verified only — no a5 silicon on this host

@coderabbitai

coderabbitai Bot commented Jun 16, 2026

Copy link
Copy Markdown

Review Change Stack

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: fb0a68a7-c93c-4d63-b728-c7be74ef6a08

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Run-wall measurement is reworked across a2a3 and a5 platforms: each AICPU exec thread now writes its own {start_cycle, end_cycle} pair into a multi-slot device buffer indexed by a unified tl_exec_idx (from platform_aicpu_affinity), and the host reduces these to a single wall duration via max(end) - min(start). simpler_aicpu_init becomes a no-op. A PTO2_PROFILING init-phase breakdown is added to the a2a3 orchestrator. A .claude/rules doc adds autonomous NPU task invocation logic.

Changes

Per-thread run-wall timing rework

Layer / File(s) Summary
Unified per-thread slot index
src/common/platform/onboard/aicpu/platform_aicpu_affinity.cpp
Introduces shared tl_exec_idx thread-local replacing tl_filter_exec_idx; both legacy and filter gates assign survivor slot or -1; platform_aicpu_affinity_thread_idx() returns tl_exec_idx.
Host multi-slot wall buffer allocation and reduction
src/common/platform/onboard/host/device_runner_base.cpp, src/a2a3/platform/include/common/kernel_args.h
ensure_device_wall_buffer() allocates a PLATFORM_MAX_AICPU_THREADS_JUST_FOR_LAUNCH-slot {start,end} buffer with sentinel reset each run; read_device_wall_ns() reduces all slots via max(end) - min(start); kernel_args.h comment updated to document new layout.
a2a3 AICPU kernel per-thread recording
src/a2a3/platform/onboard/aicpu/kernel.cpp, src/a2a3/platform/onboard/host/device_runner.cpp
simpler_aicpu_init becomes a no-op stub; simpler_aicpu_exec calls init_log_switch() per thread and writes start/end sys-counter cycles into the affinity-indexed wall slot; comment in device_runner.cpp updated for core_type residency.
a5 AICPU kernel per-thread recording
src/a5/platform/onboard/aicpu/kernel.cpp, src/a5/platform/onboard/host/device_runner.cpp
Mirrors a2a3 changes: simpler_aicpu_init reduced to (void)arg no-op; simpler_aicpu_exec gains init_log_switch() and guarded per-slot start/end writes; device_runner.cpp comment updated.
PTO2_PROFILING orchestrator init breakdown
src/a2a3/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp
Under PTO2_PROFILING guard, adds timestamp captures at SO load, config validation, arena/SM init, scheduler wait, and pre-loop phases; emits an INIT BREAKDOWN log line with per-phase microsecond durations.

Autonomous NPU invocation rule

Layer / File(s) Summary
Autonomous invocation rule
.claude/rules/running-onboard.md
Adds a section instructing detection of onboard silicon via onboard-arch-precheck, conditional task-submit invocation without user prompting, and fallback when precheck or task-submit is unavailable.

Sequence Diagram(s)

sequenceDiagram
    participant Host as DeviceRunnerBase (host)
    participant Affinity as platform_aicpu_affinity
    participant Exec as simpler_aicpu_exec (per thread)
    participant Buf as device_wall_data_base

    Host->>Buf: ensure_device_wall_buffer() — alloc + sentinel reset all slots
    Host->>Exec: launch N AICPU exec threads
    Exec->>Affinity: platform_aicpu_affinity_thread_idx() → tl_exec_idx
    Exec->>Buf: slot[tl_exec_idx].start = get_sys_cnt_aicpu()
    Note over Exec: kernel work executes
    Exec->>Buf: slot[tl_exec_idx].end = get_sys_cnt_aicpu()
    Host->>Buf: read_device_wall_ns() — D2H copy full buffer
    Host->>Host: reduce: max(end) - min(start) → device_wall_ns_
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

  • hw-native-sys/simpler#896: Both PRs modify .claude/rules/running-onboard.md to govern autonomous task-submit invocation for onboard NPU tasks, with overlapping logic for onboard-arch-precheck detection and fallback behavior.

Poem

🐇 Hop hop, no more global clock race,
Each thread now timestamps its own space!
Start and end, a pair so neat,
Host reduces — max minus min's the feat.
From sentinel slots to nanoseconds bright,
The bunny measures wall time right! ⏱️✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 38.89% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main changes: eliminating the per-run AICPU init launch and refactoring run-wall capture to a per-thread mechanism.
Description check ✅ Passed The description provides comprehensive context for all major changes including init launch removal, per-thread wall capture, init phase timing, documentation fixes, and agent rules, all directly related to the changeset.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request removes the single-threaded simpler_aicpu_init kernel launch and transitions run-wall time measurement to a per-thread slot buffer, allowing the host to reduce the overall duration as max(end) - min(start). It also introduces detailed pre-orchestrator cycle initialization profiling and updates autonomous execution rules. The review feedback points out a potential issue where the static variable s_reported in platform_aicpu_affinity.cpp persists across AICPU launches, and suggests implementing an explicit reset function to prevent state leakage between runs.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread src/common/platform/onboard/aicpu/platform_aicpu_affinity.cpp
@hw-native-sys-bot hw-native-sys-bot force-pushed the refactor/aicpu-init-launch-removal branch from 0223da7 to 052dbde Compare June 16, 2026 02:19
@hw-native-sys-bot

Copy link
Copy Markdown
Collaborator Author

CI triage: st-onboard-a2a3 / ut-a2a3 reds are known issue #1037, not this PR

Both a2a3-onboard reds fail in the cross-rank comm IPC path, not anything this PR touches:

test_two_rank_allocate_release_round_trip / sdma_async_completion_demo
RuntimeError: comm_alloc_domain_windows failed with code -1
domain_alloc_via_ipc: [comm_hccl.cpp:810] alloc_domain: ImportByKey(...) -> 507899

This is #1037aclrtIpcMemImportByKey returning 507899 because the runner's driver has support_shmem_map_exbus=0. Evidence it's the shared runner (npu-runner-2), not code:

  • Docs-only PR docs(hardware): MMIO performance reference + 2 cann-example probe tools #1057 (no runtime code) fails with the byte-identical signature on the same runner (ut-a2a3 run).
  • main's latest CI run passes both jobs.
  • This PR's diff is entirely in the AICPU compute-launch path (init launch removal, run-wall buffer, affinity-gate index) — it does not touch comm_hccl / HCCL IPC / alloc_domain. Those comm-domain tests don't go through DeviceRunner::run where the init launch lived.

Re-running won't clear it (it's persistent driver/device state, reproduced across two consecutive runs here). Needs the #1037 driver fix or a runner with support_shmem_map_exbus=1. The one real review comment (Gemini, s_reported early-path leak) is fixed in 052dbde1 and resolved. st-sim-a2a3/a5, ut, pre-commit, packaging are all green.

@hw-native-sys-bot hw-native-sys-bot force-pushed the refactor/aicpu-init-launch-removal branch from 052dbde to 5ec0624 Compare June 16, 2026 12:42

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
src/common/platform/onboard/host/device_runner_base.cpp (1)

916-922: ⚡ Quick win

Update stale comment to reflect the multi-slot wall buffer.

The comment still describes an “8-byte device buffer,” but this path now reads an array of per-thread {start_cycle,end_cycle} pairs.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/common/platform/onboard/host/device_runner_base.cpp` around lines 916 -
922, Update the stale comment starting with "Pull the platform-level device
wall" to accurately describe the current implementation. The comment incorrectly
refers to an "8-byte device buffer," but the code now reads an array of
per-thread {start_cycle,end_cycle} pairs instead. Revise the comment to reflect
this multi-slot wall buffer structure while preserving the context about
KernelArgs::device_wall_data_base, CANN's rtAicpuKernelLaunchExWithArgs, and the
soft warn failure path behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/common/platform/onboard/host/device_runner_base.cpp`:
- Around line 815-822: The copy_to_device call that resets the device wall is
currently cast to void, ignoring any potential failure. This can cause stale
slot data to persist into the next run's reduction, corrupting device_wall_ns_.
Instead of discarding the return value in the copy_to_device call that
initializes the device wall slots (the call that copies the init array to
device_wall_dev_ptr_), capture the return value and explicitly check for
failure. If the copy fails, log an error message indicating the device wall
reset failure and either handle the error appropriately or prevent the operation
from proceeding.

---

Nitpick comments:
In `@src/common/platform/onboard/host/device_runner_base.cpp`:
- Around line 916-922: Update the stale comment starting with "Pull the
platform-level device wall" to accurately describe the current implementation.
The comment incorrectly refers to an "8-byte device buffer," but the code now
reads an array of per-thread {start_cycle,end_cycle} pairs instead. Revise the
comment to reflect this multi-slot wall buffer structure while preserving the
context about KernelArgs::device_wall_data_base, CANN's
rtAicpuKernelLaunchExWithArgs, and the soft warn failure path behavior.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: a21a9156-6707-47bc-8031-164e0b6fe3b0

📥 Commits

Reviewing files that changed from the base of the PR and between 7b4989a and 5ec0624.

📒 Files selected for processing (9)
  • .claude/rules/running-onboard.md
  • src/a2a3/platform/include/common/kernel_args.h
  • src/a2a3/platform/onboard/aicpu/kernel.cpp
  • src/a2a3/platform/onboard/host/device_runner.cpp
  • src/a2a3/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp
  • src/a5/platform/onboard/aicpu/kernel.cpp
  • src/a5/platform/onboard/host/device_runner.cpp
  • src/common/platform/onboard/aicpu/platform_aicpu_affinity.cpp
  • src/common/platform/onboard/host/device_runner_base.cpp

Comment thread src/common/platform/onboard/host/device_runner_base.cpp
@hw-native-sys-bot hw-native-sys-bot force-pushed the refactor/aicpu-init-launch-removal branch from 5ec0624 to 85eff0e Compare June 16, 2026 13:14
The onboard runtime launched a single-threaded simpler_aicpu_init kernel
(block_dim=1) on every run whose only jobs were: snapshot the CANN log
severity, capture the run-wall start cycle, and zero the wall buffer. That
extra rtsLaunchCpuKernel round-trip is removed and its work rehomed:

- init_log_switch() now runs at the top of simpler_aicpu_exec (idempotent
  across the concurrent exec threads).
- Run-wall start/end are captured per surviving exec thread. The wall buffer
  now holds one {start_cycle, end_cycle} pair per launched AICPU thread; each
  thread plain-stores into its own slot (indexed by the affinity gate's exec
  index, now exposed for the a2a3 legacy gate too) with no cross-thread
  atomics, and the host reduces max(end) - min(start) -> ns on readback. The
  host resets the buffer each run.
- simpler_aicpu_init is kept as a stub so host bootstrap's
  rtsFuncGetByName(InitName) still resolves.

Also:
- Add an INIT BREAKDOWN device-log line (PTO2_PROFILING, V9) reporting the
  orchestrator's pre-cycle init phases (so_load/config/arena/sched_wait/
  final_prep) for diagnosing AICPU startup cost.
- Fix a stale comment claiming the init kernel populated workers[].core_type;
  it is written by the AICore handshake, so the host read reflects the prior
  run's values.
- Add an agent rule (running-onboard.md): auto-detect arch capability +
  task-submit and invoke device work without prompting.

Mirrored across a2a3 and a5 onboard. Verified on a2a3 (4-card replay, exit 0;
device_wall_ns ~1.5s). a5 build-verified only (no a5 silicon on this host).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants