Skip to content

Launch AICore before AICPU Run (a5+a2a3): first-launch latency + op-timeout defense#1060

Merged
ChaoWao merged 1 commit into
hw-native-sys:mainfrom
ChaoWao:fix/a5-aicore-launch-order-1019
Jun 17, 2026
Merged

Launch AICore before AICPU Run (a5+a2a3): first-launch latency + op-timeout defense#1060
ChaoWao merged 1 commit into
hw-native-sys:mainfrom
ChaoWao:fix/a5-aicore-launch-order-1019

Conversation

@ChaoWao

@ChaoWao ChaoWao commented Jun 16, 2026

Copy link
Copy Markdown
Collaborator

Summary

#1019 (TestPagedAttentionUnrollManualScope intermittent 207001) is
already fixed upstream by #1035 — which raised the AICore op-execute
timeout 1 s → 3 s (a side effect of its tensor-dump-on-hang feature). No code
change is needed to close #1019 itself.
This PR keeps a separately-justified
launch reorder and lands the investigation write-up.

Earlier revisions of this PR described it as the #1019 root-cause fix
(various "module-load / task-submit contention" mechanisms). Those were
wrong — see the investigation doc. The bisect below is what actually
explains #1019.

The reorder (a5 + a2a3) — kept on its own merits, not as the #1019 fix

Launch the AICore worker before the AICPU Run task (same order on both
arches — the launch order is not split per-arch).

How #1019 was actually explained (bisect)

OLD order, 1 s timeout (pre-#1035) OLD order, 3 s timeout (post-#1035)
repro on a clean device 1/15 (op-timeout cascade: 207001 / 507046) 0/~40

Only the timeout differs. The ~1.4 s slow launch exceeds the old 1 s op-timeout
→ STARS reaps the op → cascade; it fits inside the new 3 s → no failure. The
branch "stopped failing" simply because rebasing onto upstream pulled in #1035.
A spin-throttle experiment ruled out the handshake poll frequency as the
cause; the exact internal cause of the ~1.4 s stays unpinned but no longer
matters. Full trail in
docs/investigations/2026-06-pa-unroll-207001-optimeout-window.md.

Changes

  • src/{a5,a2a3}/platform/onboard/host/device_runner.cpp: reorder.
  • docs/investigations/2026-06-pa-unroll-207001-optimeout-window.md (+ index).
  • docs/hardware/chip-architecture.md: GM is exclusive per device_id.

Testing

  • a5 hardware: reorder builds + runs clean (5/5 sanity on the rebased base;
    earlier ~39/39).
  • a2a3 hardware: mirror builds clean but is unverified on a2a3 silicon
    (dev box is a5-only) — relying on CI.

Refs #1019 (root cause is #1035; this PR is latency + defense-in-depth, not the
fix).

@coderabbitai

coderabbitai Bot commented Jun 16, 2026

Copy link
Copy Markdown

Review Change Stack

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: b8517b25-03b5-48da-b04d-6d2d799e2ed2

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Fixes a race condition in DeviceRunner::run for both a2a3 and a5 architectures by reordering kernel dispatch: AICore worker launch now precedes the AICPU Run kernel launch in both codepaths. The a5 path also adds explicit error recovery on AICore launch failure. The chip-architecture.md documentation is updated to define GM exclusivity per device_id and clarify per-architecture behavior.

Changes

AICore launch order fix and GM model documentation

Layer / File(s) Summary
AICore-before-AICPU launch order in a2a3 and a5
src/a2a3/platform/onboard/host/device_runner.cpp, src/a5/platform/onboard/host/device_runner.cpp
In both DeviceRunner::run implementations, the AICore worker kernel is now submitted before the AICPU Run kernel. In a2a3, a comment block documents the ordering invariant and the AICPU launch_aicpu_kernel call is moved after launch_aicore_kernel. In a5, init_device_kernel_args and launch_aicore_kernel are called earlier with recover_device_or_mark_unusable on failure, and the old pre-sync AICore launch block is removed.
GM sharing model docs
docs/hardware/chip-architecture.md
Adds a subsection defining GM exclusivity per device_id: a5 uses a single device_id owning all dies/clusters; a2a3 uses two independent device_ids per card. References PR #990 for removal of the prior chip-shared contention model.

Sequence Diagram(s)

sequenceDiagram
  participant run as DeviceRunner::run
  participant aicore as AICore stream
  participant aicpu as AICPU stream
  participant sync as sync_run_streams()

  rect rgba(255, 80, 80, 0.5)
    Note over run,sync: Old order (both a2a3 & a5)
    run->>aicpu: launch_aicpu_kernel (RunName)
    run->>aicore: launch_aicore_kernel
    run->>sync: sync_run_streams()
  end

  rect rgba(80, 180, 80, 0.5)
    Note over run,sync: New order (both a2a3 & a5)
    run->>run: init_device_kernel_args(mem_alloc_)
    run->>aicore: launch_aicore_kernel(stream_aicore_, device_k_args_)
    alt AICore launch fails
      run->>run: recover_device_or_mark_unusable(rc)
    end
    run->>aicpu: launch_aicpu_kernel (RunName)
    run->>sync: sync_run_streams()
  end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

  • hw-native-sys/simpler#990: Directly referenced in chip-architecture.md as the PR that removed the older chip-shared contention model, which this PR's documentation explicitly builds on.
  • hw-native-sys/simpler#913: Modifies the same DeviceRunner::run AICPU/AICore kernel dispatch path, relocating launch_aicpu_kernel and adjusting the surrounding launch flow in both architectures.
  • hw-native-sys/simpler#1035: Changes DeviceRunner::run() in both src/a5 and src/a2a3 device runner files around hang/op-timeout handling, directly overlapping with the kernel dispatch path modified here.

Poem

🐇 The AICore must go first, that's the law of the land,
Before AICPU Run can take its stand.
No more wedge at submit, no 207001 surprise,
The handshake completes under clear HBM skies.
GM stays exclusive, per device_id alone —
Every die knows its place, every HBM its own! 🎉

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Linked Issues check ✅ Passed All coding requirements from issue #1019 are met: kernel launches are reordered (a5 and a2a3), AICore is launched before AICPU Run, and documentation on GM memory exclusivity per device_id is added.
Out of Scope Changes check ✅ Passed All changes are in scope: kernel launch reordering directly addresses the race condition in #1019, and the documentation update clarifies GM memory model relevant to device_id behavior.
Title check ✅ Passed The title accurately describes the main change: reordering kernel launches (AICore before AICPU Run) on both a5 and a2a3 architectures, with clear mention of the dual objectives (first-launch latency and op-timeout defense).
Description check ✅ Passed The description is directly related to the changeset, providing clear context about the kernel launch reordering, the rationale (first-launch latency and defense-in-depth), root-cause analysis via bisect, and testing status.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the chip architecture documentation to clarify GM sharing and modifies the kernel launch sequence in both a2a3 and a5 device runners. Specifically, it launches the AICore worker before the AICPU Run task to avoid task-submission path contention and handshake timeouts. Feedback suggests adding device recovery logic in the a2a3 runner if the AICPU kernel launch fails after the AICore kernel has already been launched, preventing the device context from being poisoned.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread src/a2a3/platform/onboard/host/device_runner.cpp

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/a5/platform/onboard/host/device_runner.cpp (1)

382-395: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Recover when AICPU Run launch fails after AICore is already submitted.

The shared root cause is the new AICore-first order: once launch_aicore_kernel succeeds, the AICore worker can be pending on aicpu_ready. If the subsequent AICPU Run launch fails, run() skips sync_run_streams() and returns without draining or marking the device unusable.

  • src/a5/platform/onboard/host/device_runner.cpp#L382-L395: call recover_device_or_mark_unusable(rc) before returning from the main Run launch error path.
  • src/a2a3/platform/onboard/host/device_runner.cpp#L396-L403: add the same recovery/poison-marking before returning.
Suggested recovery hook
diff --git a/src/a5/platform/onboard/host/device_runner.cpp b/src/a5/platform/onboard/host/device_runner.cpp
@@
     rc = launch_aicpu_kernel(stream_aicpu_, &kernel_args_.args, host::KernelNames::RunName, aicpu_launch_n);
     if (rc != 0) {
         LOG_ERROR("launch_aicpu_kernel (main) failed: %d", rc);
+        recover_device_or_mark_unusable(rc);
         return rc;
     }
diff --git a/src/a2a3/platform/onboard/host/device_runner.cpp b/src/a2a3/platform/onboard/host/device_runner.cpp
@@
     if (rc != 0) {
         LOG_ERROR("launch_aicpu_kernel (main) failed: %d", rc);
+        recover_device_or_mark_unusable(rc);
         return rc;
     }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/a5/platform/onboard/host/device_runner.cpp` around lines 382 - 395, When
launch_aicpu_kernel fails after AICore is already submitted, the device must be
recovered or marked unusable to prevent deadlock. In
src/a5/platform/onboard/host/device_runner.cpp at the error handling path for
launch_aicpu_kernel (lines 382-395), call recover_device_or_mark_unusable(rc)
before the return statement. Apply the same fix at the corresponding error
handling location in src/a2a3/platform/onboard/host/device_runner.cpp (lines
396-403) to ensure consistent recovery behavior across both device runner
implementations.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@src/a5/platform/onboard/host/device_runner.cpp`:
- Around line 382-395: When launch_aicpu_kernel fails after AICore is already
submitted, the device must be recovered or marked unusable to prevent deadlock.
In src/a5/platform/onboard/host/device_runner.cpp at the error handling path for
launch_aicpu_kernel (lines 382-395), call recover_device_or_mark_unusable(rc)
before the return statement. Apply the same fix at the corresponding error
handling location in src/a2a3/platform/onboard/host/device_runner.cpp (lines
396-403) to ensure consistent recovery behavior across both device runner
implementations.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 9c6710b2-3373-42a5-a9b4-e5ddad95c09d

📥 Commits

Reviewing files that changed from the base of the PR and between ea4d204 and 3a12281.

📒 Files selected for processing (3)
  • docs/hardware/chip-architecture.md
  • src/a2a3/platform/onboard/host/device_runner.cpp
  • src/a5/platform/onboard/host/device_runner.cpp

@ChaoWao ChaoWao force-pushed the fix/a5-aicore-launch-order-1019 branch from 3a12281 to 738a448 Compare June 16, 2026 01:41
@ChaoWao ChaoWao changed the title Fix: launch AICore before AICPU Run to avoid submit contention (#1019) Fix: launch AICore before AICPU Run to avoid module-load on a jammed device (#1019) Jun 16, 2026
@ChaoWao ChaoWao force-pushed the fix/a5-aicore-launch-order-1019 branch from 738a448 to aee74f8 Compare June 17, 2026 07:33
@ChaoWao ChaoWao changed the title Fix: launch AICore before AICPU Run to avoid module-load on a jammed device (#1019) Launch AICore before AICPU Run (a5+a2a3): first-launch latency + op-timeout defense Jun 17, 2026
…cy + op-timeout defense

hw-native-sys#1019 (PagedAttentionUnrollManualScope intermittent 207001) is already fixed
upstream by hw-native-sys#1035 (op-execute timeout 1s -> 3s, a side effect of a tensor-dump
feature); no code change is needed to close hw-native-sys#1019 itself. This commit keeps a
separately-justified launch reorder and records the full investigation.

Reorder (a5 + a2a3 platform/onboard/host/device_runner.cpp): launch the AICore
worker before the AICPU Run task (kept symmetric across both arches).
- Latency: with the AICPU Run task launched first it occupies the device
  (spinning in handshake_all_cores), and the first AICore launch — which lazily
  loads the kernel binary onto the device inside rtKernelLaunchWithHandleV2 —
  then takes ~1.4s vs ~0.4ms (measured a5, robust). Launching AICore first does
  that load on an idle device. The handshake is launch-order-independent.
- Defense-in-depth: that slow launch is what trips the op-execute timeout
  (hw-native-sys#1019 family). hw-native-sys#1035 widened the window; this removes the slow launch.
- Orphan recovery: the reorder means the AICore is already launched (spinning
  in the handshake) when the AICPU Run task launches. If that Run launch fails,
  the AICore is orphaned and spins to the op-timeout, poisoning the device — so
  the Run-launch failure path now calls recover_device_or_mark_unusable (matches
  the AICore-launch failure path). Addresses review feedback.

a2a3 mirrors a5 to keep the launch order from splitting per-arch; both have the
same Init -> core_type-publish -> Run -> AICore structure. The a2a3 reorder is
UNVERIFIED on a2a3 silicon (a5-only dev box) — relying on CI.

Also fixes a pre-existing duplicate CMake target that broke the `ut` build on
current main: test_scope_stats_collector was defined for both the common test
(add_executable) and the a2a3 runtime test (add_a2a3_runtime_test). Rename the
a2a3 target to test_a2a3_scope_stats_collector (matches the test_a2a3_* naming
already used for the tensormap test).

- docs/investigations/2026-06-pa-unroll-207001-optimeout-window.md (+ index):
  full write-up — not OOM; bisect to hw-native-sys#1035's 1s->3s timeout; the spin-throttle
  experiment that ruled OUT poll-frequency as the cause; the ~1.4s vs ~0.4ms
  first-launch measurement.
- docs/hardware/chip-architecture.md: GM is exclusive per device_id.
@ChaoWao ChaoWao force-pushed the fix/a5-aicore-launch-order-1019 branch from aee74f8 to aecf3e4 Compare June 17, 2026 07:52
@ChaoWao ChaoWao merged commit 3734589 into hw-native-sys:main Jun 17, 2026
30 of 31 checks passed
@ChaoWao ChaoWao deleted the fix/a5-aicore-launch-order-1019 branch June 17, 2026 08:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] PagedAttentionUnrollManualScope intermittent 207001 is NOT OOM — AICore launch wedge (op-timeout family, #1016)

1 participant