Launch AICore before AICPU Run (a5+a2a3): first-launch latency + op-timeout defense by ChaoWao · Pull Request #1060 · hw-native-sys/simpler

ChaoWao · 2026-06-16T01:04:55Z

Summary

#1019 (TestPagedAttentionUnrollManualScope intermittent 207001) is
already fixed upstream by #1035 — which raised the AICore op-execute
timeout 1 s → 3 s (a side effect of its tensor-dump-on-hang feature). No code
change is needed to close #1019 itself. This PR keeps a separately-justified
launch reorder and lands the investigation write-up.

Earlier revisions of this PR described it as the #1019 root-cause fix
(various "module-load / task-submit contention" mechanisms). Those were
wrong — see the investigation doc. The bisect below is what actually
explains #1019.

The reorder (a5 + a2a3) — kept on its own merits, not as the #1019 fix

Launch the AICore worker before the AICPU Run task (same order on both
arches — the launch order is not split per-arch).

First-launch latency: with the AICPU Run task launched first it occupies
the device (spinning in handshake_all_cores), and the first AICore launch —
which lazily loads the kernel binary onto the device inside
rtKernelLaunchWithHandleV2 — then takes ~1.4 s vs ~0.4 ms (measured a5,
robust, far outside noise). Launching AICore first does that load on an idle
device. Zero functional risk: the handshake is launch-order-independent.
Defense-in-depth: that ~1.4 s slow launch is what trips the op-execute
timeout (the [Bug] PagedAttentionUnrollManualScope intermittent 207001 is NOT OOM — AICore launch wedge (op-timeout family, #1016) #1019 family). Add: tensor dump on the abnormal path (hang / op-timeout) (a5 + a2a3) #1035 widened the window; this removes the slow
launch, so the wedge cannot return if the timeout is later tightened or a
device runs slower.

How #1019 was actually explained (bisect)

	OLD order, 1 s timeout (pre-#1035)	OLD order, 3 s timeout (post-#1035)
repro on a clean device	1/15 (op-timeout cascade: 207001 / 507046)	0/~40

Only the timeout differs. The ~1.4 s slow launch exceeds the old 1 s op-timeout
→ STARS reaps the op → cascade; it fits inside the new 3 s → no failure. The
branch "stopped failing" simply because rebasing onto upstream pulled in #1035.
A spin-throttle experiment ruled out the handshake poll frequency as the
cause; the exact internal cause of the ~1.4 s stays unpinned but no longer
matters. Full trail in
docs/investigations/2026-06-pa-unroll-207001-optimeout-window.md.

Changes

src/{a5,a2a3}/platform/onboard/host/device_runner.cpp: reorder.
docs/investigations/2026-06-pa-unroll-207001-optimeout-window.md (+ index).
docs/hardware/chip-architecture.md: GM is exclusive per device_id.

Testing

a5 hardware: reorder builds + runs clean (5/5 sanity on the rebased base;
earlier ~39/39).
a2a3 hardware: mirror builds clean but is unverified on a2a3 silicon
(dev box is a5-only) — relying on CI.

Refs #1019 (root cause is #1035; this PR is latency + defense-in-depth, not the
fix).

coderabbitai · 2026-06-16T01:05:10Z

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: b8517b25-03b5-48da-b04d-6d2d799e2ed2

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

📝 Walkthrough

Walkthrough

Fixes a race condition in DeviceRunner::run for both a2a3 and a5 architectures by reordering kernel dispatch: AICore worker launch now precedes the AICPU Run kernel launch in both codepaths. The a5 path also adds explicit error recovery on AICore launch failure. The chip-architecture.md documentation is updated to define GM exclusivity per device_id and clarify per-architecture behavior.

Changes

AICore launch order fix and GM model documentation

Layer / File(s)	Summary
AICore-before-AICPU launch order in a2a3 and a5 `src/a2a3/platform/onboard/host/device_runner.cpp`, `src/a5/platform/onboard/host/device_runner.cpp`	In both `DeviceRunner::run` implementations, the AICore worker kernel is now submitted before the AICPU Run kernel. In `a2a3`, a comment block documents the ordering invariant and the AICPU `launch_aicpu_kernel` call is moved after `launch_aicore_kernel`. In `a5`, `init_device_kernel_args` and `launch_aicore_kernel` are called earlier with `recover_device_or_mark_unusable` on failure, and the old pre-sync AICore launch block is removed.
GM sharing model docs `docs/hardware/chip-architecture.md`	Adds a subsection defining GM exclusivity per `device_id`: `a5` uses a single `device_id` owning all dies/clusters; `a2a3` uses two independent `device_id`s per card. References PR `#990` for removal of the prior chip-shared contention model.

Sequence Diagram(s)

sequenceDiagram
  participant run as DeviceRunner::run
  participant aicore as AICore stream
  participant aicpu as AICPU stream
  participant sync as sync_run_streams()

  rect rgba(255, 80, 80, 0.5)
    Note over run,sync: Old order (both a2a3 & a5)
    run->>aicpu: launch_aicpu_kernel (RunName)
    run->>aicore: launch_aicore_kernel
    run->>sync: sync_run_streams()
  end

  rect rgba(80, 180, 80, 0.5)
    Note over run,sync: New order (both a2a3 & a5)
    run->>run: init_device_kernel_args(mem_alloc_)
    run->>aicore: launch_aicore_kernel(stream_aicore_, device_k_args_)
    alt AICore launch fails
      run->>run: recover_device_or_mark_unusable(rc)
    end
    run->>aicpu: launch_aicpu_kernel (RunName)
    run->>sync: sync_run_streams()
  end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

hw-native-sys/simpler#990: Directly referenced in chip-architecture.md as the PR that removed the older chip-shared contention model, which this PR's documentation explicitly builds on.
hw-native-sys/simpler#913: Modifies the same DeviceRunner::run AICPU/AICore kernel dispatch path, relocating launch_aicpu_kernel and adjusting the surrounding launch flow in both architectures.
hw-native-sys/simpler#1035: Changes DeviceRunner::run() in both src/a5 and src/a2a3 device runner files around hang/op-timeout handling, directly overlapping with the kernel dispatch path modified here.

Poem

🐇 The AICore must go first, that's the law of the land,
Before AICPU Run can take its stand.
No more wedge at submit, no 207001 surprise,
The handshake completes under clear HBM skies.
GM stays exclusive, per device_id alone —
Every die knows its place, every HBM its own! 🎉

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Linked Issues check	✅ Passed	All coding requirements from issue `#1019` are met: kernel launches are reordered (a5 and a2a3), AICore is launched before AICPU Run, and documentation on GM memory exclusivity per device_id is added.
Out of Scope Changes check	✅ Passed	All changes are in scope: kernel launch reordering directly addresses the race condition in `#1019`, and the documentation update clarifies GM memory model relevant to device_id behavior.
Title check	✅ Passed	The title accurately describes the main change: reordering kernel launches (AICore before AICPU Run) on both a5 and a2a3 architectures, with clear mention of the dual objectives (first-launch latency and op-timeout defense).
Description check	✅ Passed	The description is directly related to the changeset, providing clear context about the kernel launch reordering, the rationale (first-launch latency and defense-in-depth), root-cause analysis via bisect, and testing status.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request updates the chip architecture documentation to clarify GM sharing and modifies the kernel launch sequence in both a2a3 and a5 device runners. Specifically, it launches the AICore worker before the AICPU Run task to avoid task-submission path contention and handshake timeouts. Feedback suggests adding device recovery logic in the a2a3 runner if the AICPU kernel launch fails after the AICore kernel has already been launched, preventing the device context from being poisoned.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

src/a5/platform/onboard/host/device_runner.cpp (1)
382-395: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Recover when AICPU Run launch fails after AICore is already submitted.

The shared root cause is the new AICore-first order: once launch_aicore_kernel succeeds, the AICore worker can be pending on aicpu_ready. If the subsequent AICPU Run launch fails, run() skips sync_run_streams() and returns without draining or marking the device unusable.

src/a5/platform/onboard/host/device_runner.cpp#L382-L395: call recover_device_or_mark_unusable(rc) before returning from the main Run launch error path.

src/a2a3/platform/onboard/host/device_runner.cpp#L396-L403: add the same recovery/poison-marking before returning.
Suggested recovery hook
diff --git a/src/a5/platform/onboard/host/device_runner.cpp b/src/a5/platform/onboard/host/device_runner.cpp
@@
     rc = launch_aicpu_kernel(stream_aicpu_, &kernel_args_.args, host::KernelNames::RunName, aicpu_launch_n);
     if (rc != 0) {
         LOG_ERROR("launch_aicpu_kernel (main) failed: %d", rc);
+        recover_device_or_mark_unusable(rc);
         return rc;
     }
diff --git a/src/a2a3/platform/onboard/host/device_runner.cpp b/src/a2a3/platform/onboard/host/device_runner.cpp
@@
     if (rc != 0) {
         LOG_ERROR("launch_aicpu_kernel (main) failed: %d", rc);
+        recover_device_or_mark_unusable(rc);
         return rc;
     }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/a5/platform/onboard/host/device_runner.cpp` around lines 382 - 395, When
launch_aicpu_kernel fails after AICore is already submitted, the device must be
recovered or marked unusable to prevent deadlock. In
src/a5/platform/onboard/host/device_runner.cpp at the error handling path for
launch_aicpu_kernel (lines 382-395), call recover_device_or_mark_unusable(rc)
before the return statement. Apply the same fix at the corresponding error
handling location in src/a2a3/platform/onboard/host/device_runner.cpp (lines
396-403) to ensure consistent recovery behavior across both device runner
implementations.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@src/a5/platform/onboard/host/device_runner.cpp`:
- Around line 382-395: When launch_aicpu_kernel fails after AICore is already
submitted, the device must be recovered or marked unusable to prevent deadlock.
In src/a5/platform/onboard/host/device_runner.cpp at the error handling path for
launch_aicpu_kernel (lines 382-395), call recover_device_or_mark_unusable(rc)
before the return statement. Apply the same fix at the corresponding error
handling location in src/a2a3/platform/onboard/host/device_runner.cpp (lines
396-403) to ensure consistent recovery behavior across both device runner
implementations.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 9c6710b2-3373-42a5-a9b4-e5ddad95c09d

📥 Commits

Reviewing files that changed from the base of the PR and between ea4d204 and 3a12281.

📒 Files selected for processing (3)

docs/hardware/chip-architecture.md
src/a2a3/platform/onboard/host/device_runner.cpp
src/a5/platform/onboard/host/device_runner.cpp

…cy + op-timeout defense hw-native-sys#1019 (PagedAttentionUnrollManualScope intermittent 207001) is already fixed upstream by hw-native-sys#1035 (op-execute timeout 1s -> 3s, a side effect of a tensor-dump feature); no code change is needed to close hw-native-sys#1019 itself. This commit keeps a separately-justified launch reorder and records the full investigation. Reorder (a5 + a2a3 platform/onboard/host/device_runner.cpp): launch the AICore worker before the AICPU Run task (kept symmetric across both arches). - Latency: with the AICPU Run task launched first it occupies the device (spinning in handshake_all_cores), and the first AICore launch — which lazily loads the kernel binary onto the device inside rtKernelLaunchWithHandleV2 — then takes ~1.4s vs ~0.4ms (measured a5, robust). Launching AICore first does that load on an idle device. The handshake is launch-order-independent. - Defense-in-depth: that slow launch is what trips the op-execute timeout (hw-native-sys#1019 family). hw-native-sys#1035 widened the window; this removes the slow launch. - Orphan recovery: the reorder means the AICore is already launched (spinning in the handshake) when the AICPU Run task launches. If that Run launch fails, the AICore is orphaned and spins to the op-timeout, poisoning the device — so the Run-launch failure path now calls recover_device_or_mark_unusable (matches the AICore-launch failure path). Addresses review feedback. a2a3 mirrors a5 to keep the launch order from splitting per-arch; both have the same Init -> core_type-publish -> Run -> AICore structure. The a2a3 reorder is UNVERIFIED on a2a3 silicon (a5-only dev box) — relying on CI. Also fixes a pre-existing duplicate CMake target that broke the `ut` build on current main: test_scope_stats_collector was defined for both the common test (add_executable) and the a2a3 runtime test (add_a2a3_runtime_test). Rename the a2a3 target to test_a2a3_scope_stats_collector (matches the test_a2a3_* naming already used for the tensormap test). - docs/investigations/2026-06-pa-unroll-207001-optimeout-window.md (+ index): full write-up — not OOM; bisect to hw-native-sys#1035's 1s->3s timeout; the spin-throttle experiment that ruled OUT poll-frequency as the cause; the ~1.4s vs ~0.4ms first-launch measurement. - docs/hardware/chip-architecture.md: GM is exclusive per device_id.

gemini-code-assist Bot reviewed Jun 16, 2026

View reviewed changes

Comment thread src/a2a3/platform/onboard/host/device_runner.cpp

coderabbitai Bot reviewed Jun 16, 2026

View reviewed changes

ChaoWao force-pushed the fix/a5-aicore-launch-order-1019 branch from 3a12281 to 738a448 Compare June 16, 2026 01:41

ChaoWao changed the title ~~Fix: launch AICore before AICPU Run to avoid submit contention (#1019)~~ Fix: launch AICore before AICPU Run to avoid module-load on a jammed device (#1019) Jun 16, 2026

ChaoWao force-pushed the fix/a5-aicore-launch-order-1019 branch from 738a448 to aee74f8 Compare June 17, 2026 07:33

ChaoWao changed the title ~~Fix: launch AICore before AICPU Run to avoid module-load on a jammed device (#1019)~~ Launch AICore before AICPU Run (a5+a2a3): first-launch latency + op-timeout defense Jun 17, 2026

ChaoWao force-pushed the fix/a5-aicore-launch-order-1019 branch from aee74f8 to aecf3e4 Compare June 17, 2026 07:52

ChaoWao merged commit 3734589 into hw-native-sys:main Jun 17, 2026
30 of 31 checks passed

ChaoWao deleted the fix/a5-aicore-launch-order-1019 branch June 17, 2026 08:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Launch AICore before AICPU Run (a5+a2a3): first-launch latency + op-timeout defense#1060

Launch AICore before AICPU Run (a5+a2a3): first-launch latency + op-timeout defense#1060
ChaoWao merged 1 commit into
hw-native-sys:mainfrom
ChaoWao:fix/a5-aicore-launch-order-1019

ChaoWao commented Jun 16, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented Jun 16, 2026 •

edited

Loading

Review skipped

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ChaoWao commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

The reorder (a5 + a2a3) — kept on its own merits, not as the #1019 fix

How #1019 was actually explained (bisect)

Changes

Testing

Uh oh!

coderabbitai Bot commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ChaoWao commented Jun 16, 2026 •

edited

Loading

coderabbitai Bot commented Jun 16, 2026 •

edited

Loading