Skip to content

[investigate][draft] #1070 highperf onboard golden mismatch — repro + per-head diag#1076

Closed
ChaoZheng109 wants to merge 2 commits into
hw-native-sys:mainfrom
ChaoZheng109:investigate-1070-highperf-acc
Closed

[investigate][draft] #1070 highperf onboard golden mismatch — repro + per-head diag#1076
ChaoZheng109 wants to merge 2 commits into
hw-native-sys:mainfrom
ChaoZheng109:investigate-1070-highperf-acc

Conversation

@ChaoZheng109

Copy link
Copy Markdown
Collaborator

Draft / investigation PR — CI st-onboard-a2a3 is EXPECTED to go red. Not for merge.

Goal

Reproduce the flaky spmd_paged_attention_highperf (b1_h32_kv8_s128_bs128_fp16) onboard a2a3 out golden mismatch (#1070) in CI, because it does not reproduce at ≤2-card local concurrency and the local dev box's onboard env is not CI-faithful (PTO_ISA pin / load). CI runs the case under genuine multi-core concurrency with the pinned --pto-isa-commit, which is where the race surfaces.

What this changes

  1. Re-enable b1 on a2a3 (was sim-only after Fix: scheduler timeout per platform variant (sim 5s, onboard 2s) #1063) so st-onboard-a2a3 runs it.
  2. _compare_outputs: on a golden mismatch, print a localizing breakdown — the worst element's multi-dim index ((batch, head, dim)) and a per-row (flattened leading dims = per-head) over-tolerance count. Best-effort, never raises.

How to read the CI failure

Next

Cross-reference the per-head diag with the device log from the failing onboard job to localize the racy FFTS handoff, then decide the kernel-side sync fix.

Refs #1070

…head mismatch diag

Debug PR to reproduce the flaky spmd_paged_attention_highperf onboard
'out' golden mismatch under CI multi-core concurrency (it does not
reproduce at <=2-card local concurrency). CI st-onboard-a2a3 is EXPECTED
to go red — this PR is for diagnosis, not merge.

- Re-enable b1_h32_kv8_s128_bs128_fp16 on a2a3 (was sim-only after hw-native-sys#1063).
- _compare_outputs: on mismatch, print the worst element's multi-dim
  index and a per-row (flattened leading dims, e.g. per-head)
  over-tolerance breakdown. Random bad rows across reruns => concurrency
  race; a fixed set => deterministic per-row bug. Best-effort, never
  raises, so it cannot mask the real AssertionError.
@gemini-code-assist

Copy link
Copy Markdown

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@coderabbitai

coderabbitai Bot commented Jun 17, 2026

Copy link
Copy Markdown

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 71e82de4-fdec-4a7d-a8c4-361f53c5be9b

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@ChaoZheng109 ChaoZheng109 force-pushed the investigate-1070-highperf-acc branch 2 times, most recently from 8cc2a74 to 68c938b Compare June 17, 2026 09:34
… aliasing flag 3)

The highperf paged-attention decode kernel reused FFTS flag id 3 for two
distinct cross-core barriers: QK_READY_DECODER (constexpr 3) in the decode
pipeline, and the hardcoded reduce_flag_id=3 in the split-KV CombineScale
reduce. FFTS semaphores are saturating counters with no identity, so the
two uses alias onto one hardware flag: the reduce's wait_flag_dev can be
released early by the decode path's set, before the per-core partials' GM
writes are globally visible -> flaky all-heads 'out' golden mismatch with
run-to-run varying magnitude.

Latent since the kernel was written; exposed onboard once simpler#1056
enlarged the dispatch payload and tightened AICPU->AICore dispatch timing
enough to overlap the two flag-3 uses. Two flush experiments (host device
sync, AICore exit pipe_barrier) did NOT fix it, precisely because the data
is wrongly computed (early release), not unflushed.

Move the reduce to a free flag (9; 0-8 used by the pipeline, 11-15
reserved by PTO-ISA) and wait on the same id rather than a hardcoded 3.
@ChaoZheng109 ChaoZheng109 force-pushed the investigate-1070-highperf-acc branch from 1654840 to 68c938b Compare June 17, 2026 11:20
@ChaoZheng109

Copy link
Copy Markdown
Collaborator Author

Investigation done — onboard golden mismatch root-caused to a latent cross-core race in the highperf decode kernel exposed by #1056's dispatch-timing change; full analysis + 4 ruled-out hypotheses posted on #1070. Closing this draft (no fix; not a merge candidate). Branch investigate-1070-highperf-acc kept for repro + the flag-id latent-bug fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant