[investigate][draft] #1070 highperf onboard golden mismatch — repro + per-head diag by ChaoZheng109 · Pull Request #1076 · hw-native-sys/simpler

ChaoZheng109 · 2026-06-17T06:51:08Z

Draft / investigation PR — CI st-onboard-a2a3 is EXPECTED to go red. Not for merge.

Goal

Reproduce the flaky spmd_paged_attention_highperf (b1_h32_kv8_s128_bs128_fp16) onboard a2a3 out golden mismatch (#1070) in CI, because it does not reproduce at ≤2-card local concurrency and the local dev box's onboard env is not CI-faithful (PTO_ISA pin / load). CI runs the case under genuine multi-core concurrency with the pinned --pto-isa-commit, which is where the race surfaces.

What this changes

Re-enable b1 on a2a3 (was sim-only after Fix: scheduler timeout per platform variant (sim 5s, onboard 2s) #1063) so st-onboard-a2a3 runs it.
_compare_outputs: on a golden mismatch, print a localizing breakdown — the worst element's multi-dim index ((batch, head, dim)) and a per-row (flattened leading dims = per-head) over-tolerance count. Best-effort, never raises.

How to read the CI failure

Random set of rows_over_tol across reruns ⇒ concurrency race (expected — Update: raise CORE_MAX_TENSOR_ARGS to 32, lower scalars to 16 #1056's payload-size change perturbed dispatch timing, exposing the kernel's AIC↔AIV cross-core sync; see [Bug] spmd_paged_attention_highperf b1_h32_kv8_s128_bs128_fp16 regressed: sim scheduler stall (-100) + onboard golden mismatch #1070 analysis).
Fixed set of bad rows ⇒ deterministic per-head bug instead.
Which head → which core/cluster (b1: 16 cores, head_split=2, kv_split_core_num=1).

…head mismatch diag Debug PR to reproduce the flaky spmd_paged_attention_highperf onboard 'out' golden mismatch under CI multi-core concurrency (it does not reproduce at <=2-card local concurrency). CI st-onboard-a2a3 is EXPECTED to go red — this PR is for diagnosis, not merge. - Re-enable b1_h32_kv8_s128_bs128_fp16 on a2a3 (was sim-only after hw-native-sys#1063). - _compare_outputs: on mismatch, print the worst element's multi-dim index and a per-row (flattened leading dims, e.g. per-head) over-tolerance breakdown. Random bad rows across reruns => concurrency race; a fixed set => deterministic per-row bug. Best-effort, never raises, so it cannot mask the real AssertionError.

gemini-code-assist · 2026-06-17T06:51:11Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

coderabbitai · 2026-06-17T06:51:17Z

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 71e82de4-fdec-4a7d-a8c4-361f53c5be9b

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

… aliasing flag 3) The highperf paged-attention decode kernel reused FFTS flag id 3 for two distinct cross-core barriers: QK_READY_DECODER (constexpr 3) in the decode pipeline, and the hardcoded reduce_flag_id=3 in the split-KV CombineScale reduce. FFTS semaphores are saturating counters with no identity, so the two uses alias onto one hardware flag: the reduce's wait_flag_dev can be released early by the decode path's set, before the per-core partials' GM writes are globally visible -> flaky all-heads 'out' golden mismatch with run-to-run varying magnitude. Latent since the kernel was written; exposed onboard once simpler#1056 enlarged the dispatch payload and tightened AICPU->AICore dispatch timing enough to overlap the two flag-3 uses. Two flush experiments (host device sync, AICore exit pipe_barrier) did NOT fix it, precisely because the data is wrongly computed (early release), not unflushed. Move the reduce to a free flag (9; 0-8 used by the pipeline, 11-15 reserved by PTO-ISA) and wait on the same id rather than a hardcoded 3.

ChaoZheng109 · 2026-06-17T11:21:06Z

Investigation done — onboard golden mismatch root-caused to a latent cross-core race in the highperf decode kernel exposed by #1056's dispatch-timing change; full analysis + 4 ruled-out hypotheses posted on #1070. Closing this draft (no fix; not a merge candidate). Branch investigate-1070-highperf-acc kept for repro + the flag-id latent-bug fix.

ChaoZheng109 force-pushed the investigate-1070-highperf-acc branch 2 times, most recently from 8cc2a74 to 68c938b Compare June 17, 2026 09:34

ChaoZheng109 force-pushed the investigate-1070-highperf-acc branch from 1654840 to 68c938b Compare June 17, 2026 11:20

ChaoZheng109 closed this Jun 17, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[investigate][draft] #1070 highperf onboard golden mismatch — repro + per-head diag#1076

[investigate][draft] #1070 highperf onboard golden mismatch — repro + per-head diag#1076
ChaoZheng109 wants to merge 2 commits into
hw-native-sys:mainfrom
ChaoZheng109:investigate-1070-highperf-acc

ChaoZheng109 commented Jun 17, 2026

Uh oh!

gemini-code-assist Bot commented Jun 17, 2026

Uh oh!

coderabbitai Bot commented Jun 17, 2026 •

edited

Loading

Review skipped

Uh oh!

ChaoZheng109 commented Jun 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ChaoZheng109 commented Jun 17, 2026

Goal

What this changes

How to read the CI failure

Next

Uh oh!

gemini-code-assist Bot commented Jun 17, 2026

Uh oh!

coderabbitai Bot commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

ChaoZheng109 commented Jun 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

coderabbitai Bot commented Jun 17, 2026 •

edited

Loading