[investigate][draft] #1070 highperf onboard golden mismatch — repro + per-head diag#1076
[investigate][draft] #1070 highperf onboard golden mismatch — repro + per-head diag#1076ChaoZheng109 wants to merge 2 commits into
Conversation
…head mismatch diag Debug PR to reproduce the flaky spmd_paged_attention_highperf onboard 'out' golden mismatch under CI multi-core concurrency (it does not reproduce at <=2-card local concurrency). CI st-onboard-a2a3 is EXPECTED to go red — this PR is for diagnosis, not merge. - Re-enable b1_h32_kv8_s128_bs128_fp16 on a2a3 (was sim-only after hw-native-sys#1063). - _compare_outputs: on mismatch, print the worst element's multi-dim index and a per-row (flattened leading dims, e.g. per-head) over-tolerance breakdown. Random bad rows across reruns => concurrency race; a fixed set => deterministic per-row bug. Best-effort, never raises, so it cannot mask the real AssertionError.
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
8cc2a74 to
68c938b
Compare
… aliasing flag 3) The highperf paged-attention decode kernel reused FFTS flag id 3 for two distinct cross-core barriers: QK_READY_DECODER (constexpr 3) in the decode pipeline, and the hardcoded reduce_flag_id=3 in the split-KV CombineScale reduce. FFTS semaphores are saturating counters with no identity, so the two uses alias onto one hardware flag: the reduce's wait_flag_dev can be released early by the decode path's set, before the per-core partials' GM writes are globally visible -> flaky all-heads 'out' golden mismatch with run-to-run varying magnitude. Latent since the kernel was written; exposed onboard once simpler#1056 enlarged the dispatch payload and tightened AICPU->AICore dispatch timing enough to overlap the two flag-3 uses. Two flush experiments (host device sync, AICore exit pipe_barrier) did NOT fix it, precisely because the data is wrongly computed (early release), not unflushed. Move the reduce to a free flag (9; 0-8 used by the pipeline, 11-15 reserved by PTO-ISA) and wait on the same id rather than a hardcoded 3.
1654840 to
68c938b
Compare
|
Investigation done — onboard golden mismatch root-caused to a latent cross-core race in the highperf decode kernel exposed by #1056's dispatch-timing change; full analysis + 4 ruled-out hypotheses posted on #1070. Closing this draft (no fix; not a merge candidate). Branch |
Draft / investigation PR — CI
st-onboard-a2a3is EXPECTED to go red. Not for merge.Goal
Reproduce the flaky
spmd_paged_attention_highperf(b1_h32_kv8_s128_bs128_fp16) onboard a2a3outgolden mismatch (#1070) in CI, because it does not reproduce at ≤2-card local concurrency and the local dev box's onboard env is not CI-faithful (PTO_ISA pin / load). CI runs the case under genuine multi-core concurrency with the pinned--pto-isa-commit, which is where the race surfaces.What this changes
b1ona2a3(was sim-only after Fix: scheduler timeout per platform variant (sim 5s, onboard 2s) #1063) sost-onboard-a2a3runs it._compare_outputs: on a golden mismatch, print a localizing breakdown — the worst element's multi-dim index ((batch, head, dim)) and a per-row (flattened leading dims = per-head) over-tolerance count. Best-effort, never raises.How to read the CI failure
rows_over_tolacross reruns ⇒ concurrency race (expected — Update: raise CORE_MAX_TENSOR_ARGS to 32, lower scalars to 16 #1056's payload-size change perturbed dispatch timing, exposing the kernel's AIC↔AIV cross-core sync; see [Bug] spmd_paged_attention_highperf b1_h32_kv8_s128_bs128_fp16 regressed: sim scheduler stall (-100) + onboard golden mismatch #1070 analysis).head_split=2,kv_split_core_num=1).Next
Cross-reference the per-head diag with the device log from the failing onboard job to localize the racy FFTS handoff, then decide the kernel-side sync fix.
Refs #1070