Refactor: drop per-run AICPU init launch; per-thread run-wall capture#1061
Refactor: drop per-run AICPU init launch; per-thread run-wall capture#1061hw-native-sys-bot wants to merge 1 commit into
Conversation
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughRun-wall measurement is reworked across a2a3 and a5 platforms: each AICPU exec thread now writes its own ChangesPer-thread run-wall timing rework
Autonomous NPU invocation rule
Sequence Diagram(s)sequenceDiagram
participant Host as DeviceRunnerBase (host)
participant Affinity as platform_aicpu_affinity
participant Exec as simpler_aicpu_exec (per thread)
participant Buf as device_wall_data_base
Host->>Buf: ensure_device_wall_buffer() — alloc + sentinel reset all slots
Host->>Exec: launch N AICPU exec threads
Exec->>Affinity: platform_aicpu_affinity_thread_idx() → tl_exec_idx
Exec->>Buf: slot[tl_exec_idx].start = get_sys_cnt_aicpu()
Note over Exec: kernel work executes
Exec->>Buf: slot[tl_exec_idx].end = get_sys_cnt_aicpu()
Host->>Buf: read_device_wall_ns() — D2H copy full buffer
Host->>Host: reduce: max(end) - min(start) → device_wall_ns_
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request removes the single-threaded simpler_aicpu_init kernel launch and transitions run-wall time measurement to a per-thread slot buffer, allowing the host to reduce the overall duration as max(end) - min(start). It also introduces detailed pre-orchestrator cycle initialization profiling and updates autonomous execution rules. The review feedback points out a potential issue where the static variable s_reported in platform_aicpu_affinity.cpp persists across AICPU launches, and suggests implementing an explicit reset function to prevent state leakage between runs.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
0223da7 to
052dbde
Compare
CI triage:
|
052dbde to
5ec0624
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
src/common/platform/onboard/host/device_runner_base.cpp (1)
916-922: ⚡ Quick winUpdate stale comment to reflect the multi-slot wall buffer.
The comment still describes an “8-byte device buffer,” but this path now reads an array of per-thread
{start_cycle,end_cycle}pairs.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/common/platform/onboard/host/device_runner_base.cpp` around lines 916 - 922, Update the stale comment starting with "Pull the platform-level device wall" to accurately describe the current implementation. The comment incorrectly refers to an "8-byte device buffer," but the code now reads an array of per-thread {start_cycle,end_cycle} pairs instead. Revise the comment to reflect this multi-slot wall buffer structure while preserving the context about KernelArgs::device_wall_data_base, CANN's rtAicpuKernelLaunchExWithArgs, and the soft warn failure path behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/common/platform/onboard/host/device_runner_base.cpp`:
- Around line 815-822: The copy_to_device call that resets the device wall is
currently cast to void, ignoring any potential failure. This can cause stale
slot data to persist into the next run's reduction, corrupting device_wall_ns_.
Instead of discarding the return value in the copy_to_device call that
initializes the device wall slots (the call that copies the init array to
device_wall_dev_ptr_), capture the return value and explicitly check for
failure. If the copy fails, log an error message indicating the device wall
reset failure and either handle the error appropriately or prevent the operation
from proceeding.
---
Nitpick comments:
In `@src/common/platform/onboard/host/device_runner_base.cpp`:
- Around line 916-922: Update the stale comment starting with "Pull the
platform-level device wall" to accurately describe the current implementation.
The comment incorrectly refers to an "8-byte device buffer," but the code now
reads an array of per-thread {start_cycle,end_cycle} pairs instead. Revise the
comment to reflect this multi-slot wall buffer structure while preserving the
context about KernelArgs::device_wall_data_base, CANN's
rtAicpuKernelLaunchExWithArgs, and the soft warn failure path behavior.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: a21a9156-6707-47bc-8031-164e0b6fe3b0
📒 Files selected for processing (9)
.claude/rules/running-onboard.mdsrc/a2a3/platform/include/common/kernel_args.hsrc/a2a3/platform/onboard/aicpu/kernel.cppsrc/a2a3/platform/onboard/host/device_runner.cppsrc/a2a3/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cppsrc/a5/platform/onboard/aicpu/kernel.cppsrc/a5/platform/onboard/host/device_runner.cppsrc/common/platform/onboard/aicpu/platform_aicpu_affinity.cppsrc/common/platform/onboard/host/device_runner_base.cpp
5ec0624 to
85eff0e
Compare
The onboard runtime launched a single-threaded simpler_aicpu_init kernel
(block_dim=1) on every run whose only jobs were: snapshot the CANN log
severity, capture the run-wall start cycle, and zero the wall buffer. That
extra rtsLaunchCpuKernel round-trip is removed and its work rehomed:
- init_log_switch() now runs at the top of simpler_aicpu_exec (idempotent
across the concurrent exec threads).
- Run-wall start/end are captured per surviving exec thread. The wall buffer
now holds one {start_cycle, end_cycle} pair per launched AICPU thread; each
thread plain-stores into its own slot (indexed by the affinity gate's exec
index, now exposed for the a2a3 legacy gate too) with no cross-thread
atomics, and the host reduces max(end) - min(start) -> ns on readback. The
host resets the buffer each run.
- simpler_aicpu_init is kept as a stub so host bootstrap's
rtsFuncGetByName(InitName) still resolves.
Also:
- Add an INIT BREAKDOWN device-log line (PTO2_PROFILING, V9) reporting the
orchestrator's pre-cycle init phases (so_load/config/arena/sched_wait/
final_prep) for diagnosing AICPU startup cost.
- Fix a stale comment claiming the init kernel populated workers[].core_type;
it is written by the AICore handshake, so the host read reflects the prior
run's values.
- Add an agent rule (running-onboard.md): auto-detect arch capability +
task-submit and invoke device work without prompting.
Mirrored across a2a3 and a5 onboard. Verified on a2a3 (4-card replay, exit 0;
device_wall_ns ~1.5s). a5 build-verified only (no a5 silicon on this host).
Summary
The onboard runtime launched a single-threaded
simpler_aicpu_initkernel (block_dim=1) on every run whose only jobs were: snapshot CANN log severity, capture the run-wall start cycle, and zero the wall buffer. This removes that extrartsLaunchCpuKernelround-trip and rehomes its work:init_log_switch()now runs at the top ofsimpler_aicpu_exec(idempotent across the concurrent exec threads).{start_cycle, end_cycle}pair per launched AICPU thread; each surviving thread plain-stores into its own slot (indexed by the affinity gate's exec index — now exposed for the a2a3 legacy gate too), and the host reducesmax(end) - min(start)→ ns on readback. The host resets the buffer each run. No cross-thread contention; visibility is the same post-stream-sync path the old single-store wall used.simpler_aicpu_initis kept as a stub so host bootstrap'srtsFuncGetByName(InitName)still resolves.Also in this PR:
INIT BREAKDOWNdevice-log line (PTO2_PROFILING, V9) reporting the orchestrator's pre-cycle init phases (so_load/config/arena/sched_wait/final_prep) — useful for diagnosing AICPU startup cost.workers[].core_type; it is actually written by the AICore handshake, so the host read reflects the prior run's values. Comment corrected..claude/rules/running-onboard.md): auto-detect arch capability +task-submitand invoke device work without prompting.Mirrored across a2a3 and a5 onboard.
Testing
device_wall_nsreduced correctly (~1.51–1.55 s, sanemin_start < max_end)