Launch AICore before AICPU Run (a5+a2a3): first-launch latency + op-timeout defense#1060
Conversation
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughFixes a race condition in ChangesAICore launch order fix and GM model documentation
Sequence Diagram(s)sequenceDiagram
participant run as DeviceRunner::run
participant aicore as AICore stream
participant aicpu as AICPU stream
participant sync as sync_run_streams()
rect rgba(255, 80, 80, 0.5)
Note over run,sync: Old order (both a2a3 & a5)
run->>aicpu: launch_aicpu_kernel (RunName)
run->>aicore: launch_aicore_kernel
run->>sync: sync_run_streams()
end
rect rgba(80, 180, 80, 0.5)
Note over run,sync: New order (both a2a3 & a5)
run->>run: init_device_kernel_args(mem_alloc_)
run->>aicore: launch_aicore_kernel(stream_aicore_, device_k_args_)
alt AICore launch fails
run->>run: recover_device_or_mark_unusable(rc)
end
run->>aicpu: launch_aicpu_kernel (RunName)
run->>sync: sync_run_streams()
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request updates the chip architecture documentation to clarify GM sharing and modifies the kernel launch sequence in both a2a3 and a5 device runners. Specifically, it launches the AICore worker before the AICPU Run task to avoid task-submission path contention and handshake timeouts. Feedback suggests adding device recovery logic in the a2a3 runner if the AICPU kernel launch fails after the AICore kernel has already been launched, preventing the device context from being poisoned.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/a5/platform/onboard/host/device_runner.cpp (1)
382-395:⚠️ Potential issue | 🟠 Major | ⚡ Quick winRecover when AICPU Run launch fails after AICore is already submitted.
The shared root cause is the new AICore-first order: once
launch_aicore_kernelsucceeds, the AICore worker can be pending onaicpu_ready. If the subsequent AICPU Run launch fails,run()skipssync_run_streams()and returns without draining or marking the device unusable.
src/a5/platform/onboard/host/device_runner.cpp#L382-L395: callrecover_device_or_mark_unusable(rc)before returning from the main Run launch error path.src/a2a3/platform/onboard/host/device_runner.cpp#L396-L403: add the same recovery/poison-marking before returning.Suggested recovery hook
diff --git a/src/a5/platform/onboard/host/device_runner.cpp b/src/a5/platform/onboard/host/device_runner.cpp @@ rc = launch_aicpu_kernel(stream_aicpu_, &kernel_args_.args, host::KernelNames::RunName, aicpu_launch_n); if (rc != 0) { LOG_ERROR("launch_aicpu_kernel (main) failed: %d", rc); + recover_device_or_mark_unusable(rc); return rc; } diff --git a/src/a2a3/platform/onboard/host/device_runner.cpp b/src/a2a3/platform/onboard/host/device_runner.cpp @@ if (rc != 0) { LOG_ERROR("launch_aicpu_kernel (main) failed: %d", rc); + recover_device_or_mark_unusable(rc); return rc; }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/a5/platform/onboard/host/device_runner.cpp` around lines 382 - 395, When launch_aicpu_kernel fails after AICore is already submitted, the device must be recovered or marked unusable to prevent deadlock. In src/a5/platform/onboard/host/device_runner.cpp at the error handling path for launch_aicpu_kernel (lines 382-395), call recover_device_or_mark_unusable(rc) before the return statement. Apply the same fix at the corresponding error handling location in src/a2a3/platform/onboard/host/device_runner.cpp (lines 396-403) to ensure consistent recovery behavior across both device runner implementations.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@src/a5/platform/onboard/host/device_runner.cpp`:
- Around line 382-395: When launch_aicpu_kernel fails after AICore is already
submitted, the device must be recovered or marked unusable to prevent deadlock.
In src/a5/platform/onboard/host/device_runner.cpp at the error handling path for
launch_aicpu_kernel (lines 382-395), call recover_device_or_mark_unusable(rc)
before the return statement. Apply the same fix at the corresponding error
handling location in src/a2a3/platform/onboard/host/device_runner.cpp (lines
396-403) to ensure consistent recovery behavior across both device runner
implementations.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 9c6710b2-3373-42a5-a9b4-e5ddad95c09d
📒 Files selected for processing (3)
docs/hardware/chip-architecture.mdsrc/a2a3/platform/onboard/host/device_runner.cppsrc/a5/platform/onboard/host/device_runner.cpp
3a12281 to
738a448
Compare
738a448 to
aee74f8
Compare
…cy + op-timeout defense hw-native-sys#1019 (PagedAttentionUnrollManualScope intermittent 207001) is already fixed upstream by hw-native-sys#1035 (op-execute timeout 1s -> 3s, a side effect of a tensor-dump feature); no code change is needed to close hw-native-sys#1019 itself. This commit keeps a separately-justified launch reorder and records the full investigation. Reorder (a5 + a2a3 platform/onboard/host/device_runner.cpp): launch the AICore worker before the AICPU Run task (kept symmetric across both arches). - Latency: with the AICPU Run task launched first it occupies the device (spinning in handshake_all_cores), and the first AICore launch — which lazily loads the kernel binary onto the device inside rtKernelLaunchWithHandleV2 — then takes ~1.4s vs ~0.4ms (measured a5, robust). Launching AICore first does that load on an idle device. The handshake is launch-order-independent. - Defense-in-depth: that slow launch is what trips the op-execute timeout (hw-native-sys#1019 family). hw-native-sys#1035 widened the window; this removes the slow launch. - Orphan recovery: the reorder means the AICore is already launched (spinning in the handshake) when the AICPU Run task launches. If that Run launch fails, the AICore is orphaned and spins to the op-timeout, poisoning the device — so the Run-launch failure path now calls recover_device_or_mark_unusable (matches the AICore-launch failure path). Addresses review feedback. a2a3 mirrors a5 to keep the launch order from splitting per-arch; both have the same Init -> core_type-publish -> Run -> AICore structure. The a2a3 reorder is UNVERIFIED on a2a3 silicon (a5-only dev box) — relying on CI. Also fixes a pre-existing duplicate CMake target that broke the `ut` build on current main: test_scope_stats_collector was defined for both the common test (add_executable) and the a2a3 runtime test (add_a2a3_runtime_test). Rename the a2a3 target to test_a2a3_scope_stats_collector (matches the test_a2a3_* naming already used for the tensormap test). - docs/investigations/2026-06-pa-unroll-207001-optimeout-window.md (+ index): full write-up — not OOM; bisect to hw-native-sys#1035's 1s->3s timeout; the spin-throttle experiment that ruled OUT poll-frequency as the cause; the ~1.4s vs ~0.4ms first-launch measurement. - docs/hardware/chip-architecture.md: GM is exclusive per device_id.
aee74f8 to
aecf3e4
Compare
Summary
#1019(TestPagedAttentionUnrollManualScopeintermittent207001) isalready fixed upstream by #1035 — which raised the AICore op-execute
timeout 1 s → 3 s (a side effect of its tensor-dump-on-hang feature). No code
change is needed to close #1019 itself. This PR keeps a separately-justified
launch reorder and lands the investigation write-up.
The reorder (a5 + a2a3) — kept on its own merits, not as the #1019 fix
Launch the AICore worker before the AICPU Run task (same order on both
arches — the launch order is not split per-arch).
the device (spinning in
handshake_all_cores), and the first AICore launch —which lazily loads the kernel binary onto the device inside
rtKernelLaunchWithHandleV2— then takes ~1.4 s vs ~0.4 ms (measured a5,robust, far outside noise). Launching AICore first does that load on an idle
device. Zero functional risk: the handshake is launch-order-independent.
timeout (the [Bug] PagedAttentionUnrollManualScope intermittent 207001 is NOT OOM — AICore launch wedge (op-timeout family, #1016) #1019 family). Add: tensor dump on the abnormal path (hang / op-timeout) (a5 + a2a3) #1035 widened the window; this removes the slow
launch, so the wedge cannot return if the timeout is later tightened or a
device runs slower.
How #1019 was actually explained (bisect)
Only the timeout differs. The ~1.4 s slow launch exceeds the old 1 s op-timeout
→ STARS reaps the op → cascade; it fits inside the new 3 s → no failure. The
branch "stopped failing" simply because rebasing onto upstream pulled in #1035.
A spin-throttle experiment ruled out the handshake poll frequency as the
cause; the exact internal cause of the ~1.4 s stays unpinned but no longer
matters. Full trail in
docs/investigations/2026-06-pa-unroll-207001-optimeout-window.md.Changes
src/{a5,a2a3}/platform/onboard/host/device_runner.cpp: reorder.docs/investigations/2026-06-pa-unroll-207001-optimeout-window.md(+ index).docs/hardware/chip-architecture.md: GM is exclusive perdevice_id.Testing
earlier ~39/39).
(dev box is a5-only) — relying on CI.
Refs #1019 (root cause is #1035; this PR is latency + defense-in-depth, not the
fix).