Skip to content

feat(a2a3/runtime): speculative early-dispatch (pre-stage + doorbell)#1079

Open
poursoul wants to merge 18 commits into
mainfrom
feat/early-dispatch
Open

feat(a2a3/runtime): speculative early-dispatch (pre-stage + doorbell)#1079
poursoul wants to merge 18 commits into
mainfrom
feat/early-dispatch

Conversation

@poursoul

Copy link
Copy Markdown
Collaborator

Summary

Speculative early-dispatch hides the ~7μs scheduler dispatch-loop latency by
pre-staging a ready consumer onto idle AICore(s) while its producer still runs,
gating execution on a per-core DMB doorbell, and ringing it the instant the
producer completes (skipping the ready-queue round trip).

  • Core: pre-stage + not_ready gate + DMB doorbell; release at producer
    completion (try_speculative_release).
  • Coverage: MIX / SPMD-as-producer / SPMD-as-consumer; 3 ST suites
    (spec_mix, spec_spmd_producer, spec_spmd_consumer).
  • Data structures: per-core doorbell bitmask + global table (replaces the
    per-task 3-doorbell cap); early_dispatch_queue backed by PTO2ReadyQueue;
    event-driven candidate detection (dispatch_fanin) replacing the PULL scan.
  • Concurrency: lock-free cross-thread staging + re-push; gated-pending
    deadlock fix; deferred dispatch_fanin propagation off the hot paths.
  • Payload init: early-dispatch spec fields initialized in
    PTO2TaskPayload::init (single init point; reset_for_reuse no longer
    touches the payload).

Testing

  • qwen3 decode_layer single-layer golden (a2a3, MAX_SEQ): PASS.
  • 20-round onboard batch at LEVEL=1: golden parity + AICore runtime unchanged
    vs main (median ~1027us).
  • Rebased onto current main; clean rebuild + NORMAL codegen verified e2e.
  • Simulation tests pass
  • Hardware tests pass

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a speculative early-dispatch mechanism for the A2A3 runtime scheduler, enabling consumer tasks to be pre-staged on idle cores and gated on a hardware doorbell register until their producers complete. It also adds documentation on cache coherency constraints, updates profiling phases, and introduces new test cases. The review feedback identifies two critical concurrency issues in scheduler_completion.cpp where loading spec_state with std::memory_order_relaxed could result in stale reads, potentially causing deadlocks or increased latency; using std::memory_order_acquire is recommended to ensure correct visibility across threads.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

ChaoWao and others added 5 commits June 17, 2026 16:36
…ell)

Stage a consumer task on an idle AICore before its producer finishes, gate
execution with the DATA_MAIN_BASE high-32 doorbell, so build_payload +
dispatch-write overlap the producer's run. The dependency engine is
untouched: producers run to completion, deps resolve normally; the only
additions are a pre-stage hook and a completion-path doorbell.

Mechanism:
- pto2_dispatch_payload.h: not_ready gate flag (reuses reserved pad).
- aicore inner_kernel.h (onboard+sim): read_dmb_high32().
- aicore_executor.cpp: gate spins on read_dmb_high32()==task_id, EXIT-honoring;
  COND ACK deferred until AFTER the gate so the core stays off-limits.
- scheduler_dispatch.cpp: Hook 1 try_speculative_prestage — after normal
  dispatch, pre-stage a RUNNING flagged producer's consumers onto spare idle
  cores (eligibility: single-block, AIC/AIV, remaining-producers==1).
- pto_scheduler.h: try_speculative_release on the completion path rings the
  doorbell the instant the last producer's FIN satisfies fanin (skips the
  ready-queue round-trip). Lock-free spec_state CAS coordinates stager/router.
- pto_types.h: Arg::set_allow_early_resolve() codegen hint (a2a3-local Arg,
  shared task_args.h ABI untouched), exposed via ArgWithDeps.

v1 scope: single-block + AIC/AIV (not MIX) + fully-idle cores. MIX, SPMD, and
dual-issue are follow-ups. vector_example carries the inflated-kernel test
harness (flag currently commented for a clean no-regression baseline).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
A MIX consumer occupies a full cluster (1-3 subtask cores, each with its own
dispatch token + reg_addr), so the release must ring one doorbell per gated
core. Generalize the single-doorbell staged metadata into a fixed array:

- pto_runtime2_types.h: staged_reg_addr/staged_reg_task_id become arrays of
  PTO2_SPEC_MAX_DOORBELLS (3) plus staged_count; still fits the 40B payload
  padding (guarded by the existing offsetof(tensors)==576 static_assert).
- scheduler_dispatch.cpp (Hook 1): allow shape == MIX; pre-stage onto a
  fully-idle cluster via get_idle_core_offset_states(MIX); record all n handles.
- pto_scheduler.h (try_speculative_release): ring one doorbell per staged core.
- scheduler_context.h: drop the now-dead ring_speculative_doorbell helper
  (release rings inline).

Single-block MIX only (set_sync_start fires only for block_num>1, so the
pre-stage bypass is safe). New st-sim test spec_mix validates a flagged AIV
producer feeding a 2-subtask MIX consumer: speculation fires (staged_count==2)
and the result is correct.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ispatch

A multi-block (SPMD) flagged producer already works with the existing logic:
its single fanin edge into the consumer is released only when ALL blocks
complete, so the completion-path doorbell fires the pre-staged consumer exactly
once, after the last block. New st-sim test spec_spmd_producer pins this — a
block_num=4 producer where block i writes out_p[i]=i+1 feeds a single-block
consumer out_c[i]=2*out_p[i]; the partitioned output makes any premature
release mismatch the golden. Speculation fires (staged_count==1) and the result
is correct across repeated runs.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…spatch

Pre-stage a multi-block (SPMD) consumer block-by-block — not all N blocks at
once. In a single atomic Hook-1 pass (bracketed by the STAGING->STAGED claim, so
it never races the completion release), stage min(idle cores now, remaining
blocks, doorbell budget) blocks; each gated block records its doorbell(s). Any
blocks that don't fit dispatch normally:

- scheduler_dispatch.cpp (Hook 1): drop the single-block restriction; loop over
  blocks_to_stage claiming one idle core/cluster per block, advancing
  next_block_idx. subtasks_per_block (popcount of core_mask) bounds how many
  blocks fit the PTO2_SPEC_MAX_DOORBELLS array. Exclude sync_start tasks (their
  blocks must launch atomically — partial staging would break that).
- pto_scheduler.h (try_speculative_release): after ringing the staged doorbells,
  return false when next_block_idx < logical_block_num so the caller pushes C to
  the ready queue; dispatch_shape resumes from next_block_idx for the rest.

New st-sim test spec_spmd_consumer: a flagged single-block producer feeds a
block_num=4 SPMD consumer. Three blocks fit the doorbell budget (gated +
doorbell-released) and the fourth dispatches normally; speculation fires
(blocks=3/4) and the result is correct across repeated runs. Single-block,
MIX, and SPMD-producer paths all remain a special case of the unified loop.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The spec_spmd_{producer,consumer} kernels had each SPMD block write one
float into a shared 64B cache line and dcci-flush the whole line. On
silicon each core writes back its full line copy (other slots stale 0),
so the last block to flush clobbers the rest -- false sharing,
last-writer-wins. Passed on sim only (sim models dcci as a no-op);
failed on device with max_diff 8.0. Not a speculative-dispatch bug --
reproduces with the early-resolve flag off.

- Isolate each block on its own cache line (stride 16 floats = 64B):
  kernel_spmd_fill / kernel_spmd_addk write out[block_idx*16];
  kernel_double reads the strided producer output; producer orchestration
  intermediate and consumer golden updated to the strided layout.
- kernel_fill_all: cast i+1 via int32 -- ccec rejects float<-uint32 in
  aicore functions; the test had only ever run from a stale cached .o.
- Document the limitation: aicore-kernel-programming.md "Each block must
  write to its own cache line" and cache-coherency.md "dcci is
  whole-cache-line".

All 3 spec tests now pass on both a2a3sim and a2a3 (device).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@coderabbitai

coderabbitai Bot commented Jun 17, 2026

Copy link
Copy Markdown

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: b2edcadc-69c9-46ad-98f4-f2229646b04c

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Implements speculative early-dispatch (pre-stage + DATA_MAIN_BASE doorbell) in the a2a3 tensormap_and_ringbuffer scheduler: adds allow_early_resolve API, a not_ready dispatch payload gate, PTO2SpecState state machine, AICore executor doorbell spin-wait, scheduler prestage/release machinery with per-core doorbell table and early_dispatch_queue, extended L2 swimlane profiling phases, three new system test suites, and hardware cache-line isolation documentation.

Changes

Speculative Early-Dispatch Feature

Layer / File(s) Summary
Cache-line isolation docs
docs/aicore-kernel-programming.md, docs/hardware/cache-coherency.md
Adds sections documenting that dcci operates at 64-byte whole-cache-line granularity, showing a broken SPMD write pattern and a corrected cache-line-isolated pattern.
Public allow_early_resolve API and dispatch payload gate
src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_types.h, .../orchestration/pto_arg_with_deps.h, .../runtime/pto2_dispatch_payload.h
Adds Arg::allow_early_resolve_ flag with accessors propagated through ArgWithDeps. Adds volatile uint32_t not_ready to PTO2DispatchPayload, reducing ABI padding from 8 to 4 bytes.
Speculative dispatch data model and runtime init
.../runtime/pto_runtime2_types.h, .../runtime/pto_orchestrator.cpp, .../runtime/shared/pto_runtime2_init.cpp
Introduces PTO2SpecState enum, PTO2_EARLY_DISPATCH_QUEUE_SIZE, staged-core mask atomics, dispatch_fanin, spec_state, chaining fields, and prefetch() on PTO2TaskPayload. Changes next_block_idx to std::atomic<int16_t>. Wires early_dispatch_queue through scheduler arena lifecycle.
AICore read_dmb_high32 and executor doorbell spin-wait
src/a2a3/platform/onboard/aicore/inner_kernel.h, .../sim/aicore/inner_kernel.h, .../aicore/aicore_executor.cpp
Adds read_dmb_high32() on both onboard and sim paths. Updates the executor to spin-poll not_ready via read_dmb_high32() until the doorbell task ID matches, with AICORE_EXIT_SIGNAL teardown detection. Moves FIN write before swimlane recording.
Scheduler speculative prestage, release, and dispatch wiring
.../runtime/scheduler/pto_scheduler.h, scheduler_context.h, scheduler_dispatch.cpp, scheduler_completion.cpp, scheduler_cold_path.cpp
Adds per-core doorbell table, early_dispatch_queue, try_speculative_release (CAS spec_state, ring doorbells, defer propagation), stage_consumer_blocks, and try_speculative_prestage. Updates build_payload to gate not_ready for STAGING slots. Defers dispatch_fanin propagation via prop_list. Adds pending_gated to decide_slot_transition to avoid STAGED-slot deadlock. Skips STAGED cores in completion polling.
Profiling: new swimlane phases and activity-fill tracking
src/a2a3/platform/include/common/l2_swimlane_profiling.h, .../scheduler/scheduler_types.h, scheduler_dispatch.cpp, scheduler_completion.cpp, src/a2a3/platform/shared/host/l2_swimlane_collector.cpp, simpler_setup/tools/swimlane_converter.py
Extends L2SwimlaneSchedPhaseKind with Poll, Release, Fanout, Scan, Prestage. Adds phase_subretire_count and activity-fill fields to SchedL2SwimlaneCounters. Reworks dispatch loop activity-fill attribution (Poll/Release/Prestage). Records Fanout and Scan phases with timing. Updates JSON name mapping and Perfetto color map/filter.
System test suites: spec_mix, spec_spmd_consumer, spec_spmd_producer
tests/st/a2a3/tensormap_and_ringbuffer/spec_mix/*, tests/st/a2a3/tensormap_and_ringbuffer/spec_spmd_consumer/*, tests/st/a2a3/tensormap_and_ringbuffer/spec_spmd_producer/*
Three new test suites covering speculative early-dispatch scenarios: MIX consumer (producer c=a+b, MIX d=c+b / e=c*b), SPMD consumer (single-block fill producer + 4-block SPMD addk consumer), and SPMD producer (4-block fill producer + single-block double consumer). Each includes C++ AIV kernels, C++ orchestration, and Python SceneTestCase.
vector_example kernel latency inflation and block_dim update
examples/a2a3/tensormap_and_ringbuffer/vector_example/kernels/aiv/kernel_add.cpp, kernel_add_scalar.cpp, kernel_mul.cpp, kernels/orchestration/example_orchestration.cpp, test_vector_example.py
Wraps TADD, TADDS, TMUL in 32-iteration loops. Increases block_dim from 3 to 24. Adds commented-out set_allow_early_resolve() calls in the orchestration example.

Sequence Diagram(s)

sequenceDiagram
  participant Orch as Orchestrator Thread
  participant Sched as PTO2SchedulerState
  participant EarlyQ as early_dispatch_queue
  participant Doorbell as per-core doorbell table
  participant Payload as PTO2DispatchPayload
  participant AICore as AICore executor

  Orch->>Sched: submit task with allow_early_resolve=true
  Sched->>Sched: wire_task (seed dispatch_fanin for finished producers)
  Sched->>EarlyQ: push consumer slot
  Sched->>Sched: try_speculative_prestage → stage_consumer_blocks
  Note over Sched: CAS next_block_idx, set spec_state=STAGING
  Sched->>Payload: write not_ready=1, reg_task_id
  Sched->>Doorbell: store task_id for core
  Sched->>AICore: dispatch payload
  AICore->>AICore: observe not_ready=1, spin on read_dmb_high32()
  Note over Sched: producer task completes → on_mixed_task_complete
  Sched->>Sched: try_speculative_release (CAS STAGING→DISPATCHED)
  Sched->>Doorbell: ring_doorbell (write task_id to DATA_MAIN_BASE)
  AICore->>AICore: read_dmb_high32() matches task_id, exit spin
  AICore->>AICore: execute consumer kernel
  AICore->>Sched: write FIN to COND
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related issues

Possibly related PRs

  • hw-native-sys/simpler#942: Both PRs modify L2SwimlaneSchedPhaseKind and the host-side sched-phase JSON name mapping in l2_swimlane_collector.cpp, with this PR adding five new phase kinds atop the infrastructure introduced there.
  • hw-native-sys/simpler#989: Both PRs modify scheduler_dispatch.cpp and the drain_worker_dispatch block-claiming loop; this PR changes that same loop to use atomic next_block_idx loads/stores and adds the deferred prop_list publish step.
  • hw-native-sys/simpler#994: Both PRs extend SchedL2SwimlaneCounters and the profiling instrumentation in scheduler_dispatch.cpp and scheduler_types.h, with this PR adding phase_subretire_count and activity-fill fields in the same struct.

Poem

🐇 Hop hop, the doorbell rings at last!
A staged core waits, then runs so fast.
Each block claims its own cache line,
No false-share clobber — all aligned!
The swimlane fills with Poll and Scan,
Speculative dispatch: the runtime's plan. ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 37.21% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'feat(a2a3/runtime): speculative early-dispatch (pre-stage + doorbell)' clearly and specifically describes the main feature added: a speculative early-dispatch mechanism with pre-staging and doorbell components.
Description check ✅ Passed The description comprehensively explains the speculative early-dispatch feature, covering the core mechanism, data structures, coverage, concurrency considerations, and testing approach, all of which align with the changeset.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

Warning

Review ran into problems

🔥 Problems

Stopped waiting for pipeline failures after 30000ms. One of your pipelines takes longer than our 30000ms fetch window to run, so review may not consider pipeline-failure results for inline comments if any failures occurred after the fetch window. Increase the timeout if you want to wait longer or run a @coderabbit review after the pipeline has finished.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

ChaoWao and others added 9 commits June 17, 2026 16:50
…profiling

Speculative early-dispatch completion path:
- Inline the pre-staged consumer's doorbell during the producer fanout
  walk (pto_scheduler.h on_mixed_task_complete) instead of batching every
  doorbell to after the O(fanout-degree) walk. A flagged consumer near the
  front of the list starts immediately and overlaps the AICPU releasing
  the rest. Safe: the producer slot is COMPLETED (all SPMD blocks FIN'd +
  dcci-flushed) before the walk. On qwen3-14b online->out_proj the
  producer-end -> consumer-start gap drops ~11.8us -> ~8.2us (fanout half
  ~5us -> ~1.3us).
- Completion poll skips STAGED cores (gated; cannot ACK/FIN until their
  doorbell) and re-polls once after dispatch so a producer block that FINs
  mid-iteration is observed the same iteration.

Scheduler activity profiling (the sched lane no longer has blank gaps):
- New SCHED phase kinds Poll / Release / Fanout / Scan
  (l2_swimlane_profiling.h + host dumper + converter render/colors).
- scheduler_dispatch.cpp: per-iteration activity-fill classifies each
  otherwise-blank iteration as Poll (scan that retired nothing) and emits
  the deferred-release drain as its own Release bar.
- scheduler_completion.cpp: a Scan bar per check_running_cores pass (cores'
  MMIO COND-read cost, ~95ns/LDR) and a Fanout child nested in Complete
  (fanin + doorbell walk); phase_subretire_count surfaces SPMD sub-block
  retires.
- aicore_executor.cpp: sample the op-event end AFTER the FIN write so the
  bar end marks FIN-write (when the AICPU can first observe completion),
  separating AICore-side from AICPU-side latency.

Device-validated by spec_mix / spec_spmd_producer / spec_spmd_consumer.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The pre-staged-consumer doorbell storage capped at PTO2_SPEC_MAX_DOORBELLS=3
per task (sized to fit a MIX block's 1-3 subtask cores in the payload). That
cap also bounded how many blocks a wide SPMD consumer could pre-stage in one
pass, leaving idle cores unused.

- Replace payload's staged_count + staged_reg_task_id[3] + staged_reg_addr[3]
  with a staged_core_mask bitmask (PTO2_SPEC_CORE_MASK_WORDS=2 words = 128
  bits >= RUNTIME_MAX_WORKER=72), 1 bit per global core_id.
- Move the (reg_addr, token) pair into a scheduler-global spec_doorbell_table
  indexed by core_id. Gated cores in flight are bounded by the chip core count
  (no two-level pre-dispatch), so the table is the natural capacity and the
  per-task 3-cap is gone.
- try_speculative_release iterates set bits in the mask, ringing each core's
  doorbell from the table. Drop the SpecDoorbellBatch struct and the now-unused
  db parameter threaded through release_fanin_and_check_ready.
- Add static_assert(RUNTIME_MAX_WORKER <= PTO2_SPEC_CORE_MASK_WORDS * 64).

offsetof(tensors)==576 still holds (bitmask is smaller than the old arrays).
3 spec st-sim tests (mix, spmd_consumer, spmd_producer) pass.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ores

Speculative pre-stage previously only used IDLE cores in the producer's
owner thread. For a producer like fa_fused (qwen attention) that holds every
core until it finishes — its 72 blocks all complete in the last ~20us, no
core goes idle mid-run — that staged nothing, so the consumer (online_softmax,
48 SPMD-AIV blocks) gained zero overlap. Two problems had to be solved.

1. Stage onto the PENDING slot of cores RUNNING the producer (the only cores
   available when no core is idle). This hit a deadlock (507018,
   SCHEDULER_TIMEOUT): decide_slot_transition Case 3.1 ignores a running-task
   FIN whenever a pending task exists, assuming the AICore immediately runs the
   pending one and the scheduler observes ITS ack. A speculatively-gated pending
   task spins on its doorbell and never acks — and the producer's completion
   depends on collecting that very running FIN — so it wedged forever. Fix:
   detect a gated (STAGED) pending in the completion poll and complete the
   running FIN + promote the gated task (Case 3.3) instead of waiting. The
   promoted task keeps pending_occupied set so no second gated block stacks and
   the per-core doorbell entry is never overwritten.

2. A consumer has ONE spec_state but its SPMD blocks span cores owned by
   several AICPU threads. Only the first thread to CAS NONE->STAGING staged its
   share (16/48); the others saw spec_state != NONE and skipped. Allow a thread
   to re-claim a STAGED consumer (CAS STAGED->STAGING), appending its own cores
   to the shared mask and advancing the shared next_block_idx. The CAS
   serializes the staging critical section across threads and races cleanly with
   release (re-claim fails once release sets DISPATCHED).

Device-verified on qwen fa_fused (task-submit --device 1): all 48 blocks now
pre-stage across 3 threads (was 16); online starts +2.4us after fa_end (was
+6.0), online END -3.9us, downstream out_proj END -7.4us. 3 spec st-onboard +
st-sim tests pass; fa repro 5/5 deadlock-free. A V9 [SPEC] diagnostic logs each
thread's staging contribution.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Reworks Hook 1 (early-dispatch pre-staging) so a flagged producer's consumer
is staged fully and concurrently across all AICPU threads, and fixes the
completion-path interaction that made pending-slot staging deadlock.

- Concurrent staging: spec_state simplifies to NONE -> STAGING (stable gated
  state) -> DISPATCHED. next_block_idx and staged_core_mask become atomic so
  every thread claims a distinct block range and OR-s its cores into the mask
  in parallel (no serialized staircase). Release flips STAGING->DISPATCHED and
  rings the mask; a block staged after that self-rings its own doorbell
  (seq_cst order between mask-OR/spec-load here and spec-store/mask-read in
  release guarantees no doorbell is missed).

- Gated-pending (Case 3.3): decide_slot_transition previously ignored a
  running-task FIN whenever a pending task existed (Case 3.1), assuming the
  AICore immediately runs pending and the scheduler observes its ack. A gated
  pending never acks until its doorbell, and the producer's completion depends
  on that very running FIN -> deadlock (507018). Now a gated (STAGING) pending
  is detected and the running FIN is completed + the gated task promoted.

- Cross-thread extend list: a thread-local producer (e.g. rope_qkv, all blocks
  on one thread) only lets that thread reach its consumer's fanout. The first
  claimer publishes the consumer to a shared list; every thread's prestage pass
  drains it and stages the consumer's remaining blocks onto ITS OWN cores.

- Swimlane: add a Prestage phase kind so a staging-dominated loop iteration
  shows as prestage(N), not mislabeled poll(0) (kind + record + collector +
  converter wired end-to-end).

Device-verified (task-submit --device 1): 3 spec st-onboard tests pass; qwen
fa_fused fully pre-dispatches its online consumer (online end -3.8us, out_proj
-5.9us); deadlock-free. sim spec tests pass.
Replace the O(n) broadcast extend-list with an inline Vyukov MPMC queue
and restructure staging to mirror the normal SPMD dispatch path, so a
flagged producer's consumer is pre-staged across all idle threads in
parallel instead of a serial winner-then-peer daisy chain.

- pto_scheduler.h: inline bounded MPMC spec queue (push/pop/init),
  replacing spec_extend_list[] + publish/remove.
- scheduler_dispatch.cpp:
  - pass 1 (discovery) CAS-claims NONE->STAGING and pushes the consumer
    ONCE; it no longer stages (no front-loading).
  - pass 2 (drain) pops, CAS-claims a range sized to this thread's free
    cores, RE-PUSHES before the expensive prepare+publish so peers claim
    and stage concurrently. CAS (not store) keeps next_block_idx atomic
    against normal dispatch, which also claims it when release routes a
    partially-staged consumer to the ready queue.
  - stage_consumer_blocks now stages an explicit already-claimed range.
  - remove the per-stage [SPEC] debug log from the hot path: its device
    -log write (~9us, contended across threads) dominated the prestage
    bar and was the real source of the short-producer regression.
  - gate the prestage trigger on no-ready-work (idle only).
- pto_runtime2_init.cpp: init the spec queue.
Replace the per-iteration PULL scan in try_speculative_prestage with an
event-driven PUSH dual of fanin_refcount/ready:

- Add dispatch_fanin/dispatch_propagated/spec_chain_active/spec_chain_depth
  to PTO2TaskPayload (reset in reset_for_reuse).
- propagate_dispatch_fanin(): when a FLAGGED producer dispatches (normal or
  doorbell-released), bump each consumer's dispatch_fanin once (guarded). A
  consumer reaching fanin_actual_count is an early-dispatch candidate:
  CAS NONE->STAGING and push to spec_queue. Auto-chains by inheriting the
  flag (depth-capped).
- Seed dispatch_fanin at wire_task with early_finished (producers already
  complete when the consumer was wired). Such producers never dispatch at
  runtime, so without the seed the candidate compare is unreachable whenever
  any producer is pre-completed (e.g. a buffer-creator). Mirrors the
  early_finished seed ready_fanin already gets.
- try_speculative_prestage now only DRAINS the queue; the O(running cores x
  fanout) pass-1 scan is gone.

Verified on a2a3: qwen3-14b decode online_softmax now pre-stages (df 2==2,
48/96 blocks staged across 3 scheduler threads); was 0 prestage before the
seed. Spec st (mix/spmd-producer/spmd-consumer) pass on a2a3sim.
… paths

propagate_dispatch_fanin was being called inline at two latency-critical
points, where its fanout walk delayed the very thing early-dispatch exists
to make fast:

- dispatch_shape: called between claiming cores and publishing a task's
  blocks. For a flagged SPMD producer, the thread that won the once-guard
  ran the walk before publishing its block range, while peer threads
  co-dispatching the same SPMD task published immediately — misaligning the
  task's block starts. Now recorded in a per-pop prop_list and replayed
  after flush_publish.

- on_mixed_task_complete fanout walk: try_speculative_release rang a
  released consumer's doorbell and then propagated inline, delaying the
  doorbells of later consumers in the same producer's fanout. Now collected
  in a SpecReleaseSink and replayed after the walk; every doorbell fires
  inline first, propagation runs last.

Propagation is pure bookkeeping for future candidates, so deferring it by
one walk is semantically free. Measured on qwen3-14b decode: qk_gamma/
qk_recip/rope handoffs dropped from ~7-8us back to ~2us at full chain depth.
Spec st (mix/spmd-producer/spmd-consumer) pass on a2a3sim.
The cross-thread early-dispatch work queue was a hand-rolled inline
Vyukov MPMC (spec_queue_) that duplicated PTO2ReadyQueue's algorithm,
element type, and push/pop semantics verbatim. Replace it with a
PTO2ReadyQueue instance, early_dispatch_queue, allocated/wired via the
same arena helpers as ready_queues and dummy_ready_queue.

- Drop SpecQueueSlot, spec_queue_, spec_queue_init/push/pop (~60 lines)
- Add PTO2_EARLY_DISPATCH_QUEUE_SIZE (64); reserve/init/wire/destroy the
  new queue in PTO2SchedulerState, plus off_early_dispatch_queue_slots
- Rename spec_queue_* call sites and PTO2_SPEC_QUEUE_DRAIN_MAX

Behavior-preserving (same MPMC algorithm): 20 onboard decode rounds show
unchanged median AICore runtime and golden parity (the lone golden miss
is the pre-existing early-dispatch ULP race, not this change).
The early-dispatch spec fields (allow_early_resolve, spec_state,
staged_core_mask, dispatch_fanin, dispatch_propagated, spec_chain_active,
spec_chain_depth) were reset across two sites: prepare_task wrote two of
them, and reset_for_reuse reached into the payload for the other five — a
cold cross-structure write on the scheduler's advance-ring path, which
also violated reset_for_reuse's documented "skips payload" contract.

Consolidate all of them into PTO2TaskPayload::init, the single payload
init point (runs every submit, before the slot is visible to the
scheduler). prepare_task now does no payload writes; reset_for_reuse no
longer touches the payload. The fields must be (re)initialized for every
task regardless of its own allow_early_resolve, because spec_state
(try_speculative_release CASes it for every task), spec_chain_active
(read by every dispatched task's propagate) and dispatch_fanin (seeded
via fetch_add at wiring) are touched on per-task paths driven by other
tasks' flags — so no per-task gating.

Also pack the spec block by descending alignment (drops 6B internal
padding; sizeof and tensors[]@576 unchanged) and turn prefetch_payload
into a PTO2TaskPayload::prefetch method that also warms the spec cache
line. Behavior-preserving: 20 onboard rounds show unchanged median
AICore runtime and golden parity (the golden misses are the pre-existing
early-dispatch ULP race, not this change).
@poursoul poursoul force-pushed the feat/early-dispatch branch from f3b928f to 0c44cf0 Compare June 17, 2026 08:53

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

🧹 Nitpick comments (4)
tests/st/a2a3/tensormap_and_ringbuffer/spec_spmd_producer/test_spec_spmd_producer.py (1)

18-22: ⚡ Quick win

Use ClassVar for mutable test metadata fields (CALLABLE, CASES).

Line 31 and Line 55 use mutable class attributes; annotating them as class vars prevents accidental instance-level mutation patterns and resolves the RUF012 warning.

Proposed fix
+from typing import ClassVar
+
 import torch
 from simpler.task_interface import ArgDirection as D
 
 from simpler_setup import SceneTestCase, TaskArgsBuilder, Tensor, scene_test
@@
-    CALLABLE = {
+    CALLABLE: ClassVar = {
@@
-    CASES = [
+    CASES: ClassVar = [

Also applies to: 31-62

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@tests/st/a2a3/tensormap_and_ringbuffer/spec_spmd_producer/test_spec_spmd_producer.py`
around lines 18 - 22, Add `ClassVar` to the imports from the `typing` module at
the top of the file. Then locate the `CALLABLE` class attribute (around line 31)
and the `CASES` class attribute (around line 55) in the test class, and annotate
both with `ClassVar` type hints to properly indicate they are class-level
mutable attributes rather than instance attributes, which will resolve the
RUF012 warning.

Source: Linters/SAST tools

tests/st/a2a3/tensormap_and_ringbuffer/spec_spmd_consumer/test_spec_spmd_consumer.py (1)

19-23: ⚡ Quick win

Annotate mutable class configs as ClassVar to avoid shared-state surprises.

Line 36 and Line 60 define mutable class attributes (dict / list). If any test utility mutates them, state can leak between runs; Ruff already flags this (RUF012).

Proposed fix
+from typing import ClassVar
+
 import torch
 from simpler.task_interface import ArgDirection as D
 
 from simpler_setup import SceneTestCase, TaskArgsBuilder, Tensor, scene_test
@@
-    CALLABLE = {
+    CALLABLE: ClassVar = {
@@
-    CASES = [
+    CASES: ClassVar = [

Also applies to: 36-67

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@tests/st/a2a3/tensormap_and_ringbuffer/spec_spmd_consumer/test_spec_spmd_consumer.py`
around lines 19 - 23, Import ClassVar from the typing module at the top of the
file, then locate all mutable class attributes (dicts and lists) defined at the
class level around lines 36 and 60 in the class definitions. Annotate each of
these mutable class attributes with the ClassVar type annotation (e.g., change a
class attribute like my_dict = {} to my_dict: ClassVar[dict] = {}) to prevent
Ruff's RUF012 warning about mutable class variables that can cause shared state
between test runs.

Source: Linters/SAST tools

tests/st/a2a3/tensormap_and_ringbuffer/spec_spmd_consumer/kernels/orchestration/spec_spmd_consumer_orch.cpp (1)

18-22: ⚡ Quick win

Update the stale per-task doorbell-cap assumption in test docs.

This comment still asserts a fixed PTO2_SPEC_MAX_DOORBELLS == 3 behavior, but the scheduler model in this PR stack moved to per-core bitmask/table doorbell tracking. Keeping the old claim makes the intended coverage path misleading.

Suggested comment update
- * t1 is a multi-block consumer. Block-by-block pre-staging stages as many of its
- * blocks as fit on idle cores and in the doorbell budget (PTO2_SPEC_MAX_DOORBELLS
- * == 3); with B == 4, three blocks are gated + released by doorbell and the
- * remaining block dispatches normally off the ready queue. Golden: out_c[i] =
- * i + 11.
+ * t1 is a multi-block consumer. Block-by-block pre-staging stages as many blocks
+ * as possible on currently idle cores, and gated blocks are released by doorbell
+ * once t0 completes. Golden: out_c[i] = i + 11.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@tests/st/a2a3/tensormap_and_ringbuffer/spec_spmd_consumer/kernels/orchestration/spec_spmd_consumer_orch.cpp`
around lines 18 - 22, Update the multi-block consumer documentation comment
(lines 18-22) in the spec_spmd_consumer_orch.cpp file. Replace the stale
assertion about fixed PTO2_SPEC_MAX_DOORBELLS == 3 doorbell cap with an accurate
description of the current scheduler behavior using per-core bitmask/table
doorbell tracking instead of per-task tracking. Ensure the updated comment still
clearly explains the intended coverage path for block staging and dispatch
behavior.
tests/st/a2a3/tensormap_and_ringbuffer/spec_spmd_producer/kernels/orchestration/spec_spmd_producer_orch.cpp (1)

48-52: ⚡ Quick win

Remove the hardcoded stride literal from the producer/consumer memory-layout contract.

16 here is coupled to CL in both kernel_spmd_fill.cpp and kernel_double.cpp. If one side changes independently, the orchestration tensor shape can become inconsistent with kernel indexing.

Localized improvement
-    uint32_t inter_shapes[1] = {static_cast<uint32_t>(BLOCK_NUM) * 16};
+    constexpr uint32_t kCacheLineFloats = 64 / sizeof(float);
+    uint32_t inter_shapes[1] = {static_cast<uint32_t>(BLOCK_NUM) * kCacheLineFloats};

(Preferably, move this constant into a shared test header used by all three files.)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@tests/st/a2a3/tensormap_and_ringbuffer/spec_spmd_producer/kernels/orchestration/spec_spmd_producer_orch.cpp`
around lines 48 - 52, The hardcoded value `16` in the inter_shapes array
calculation in spec_spmd_producer_orch.cpp is tightly coupled to matching values
in kernel_spmd_fill.cpp and kernel_double.cpp, creating a risk of memory-layout
inconsistency if one file is modified independently. Extract this cache line
stride constant into a shared test header file that is included by all three
files (spec_spmd_producer_orch.cpp, kernel_spmd_fill.cpp, and
kernel_double.cpp), then replace the hardcoded `16` literal with a reference to
this shared constant definition. This ensures all three files reference the same
value and prevents accidental divergence.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/a2a3/runtime/tensormap_and_ringbuffer/aicore/aicore_executor.cpp`:
- Around line 153-166: The exit signal check inside the while loop that waits
for read_dmb_high32() to match task_id can be bypassed if the shutdown signal is
written after the doorbell condition is already satisfied. This causes the code
to skip the loop entirely and execute the gated task before shutting down. Add
an additional check for the AICORE_EXIT_SIGNAL (by reading RegId::DATA_MAIN_BASE
and comparing to AICORE_EXIT_SIGNAL) immediately after the while loop exits but
before the subsequent if (exiting) block to ensure the exit signal is properly
detected even when it arrives after the doorbell match, preventing execution of
the gated task during shutdown.

In `@src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/pto_scheduler.h`:
- Line 978: The `early_dispatch_queue.push(c)` call at the specified location
can fail by returning false, but the code does not handle this failure case.
When the push fails, the `spec_state` remains incorrectly set to
`PTO2_SPEC_STAGING` for a consumer that was never actually added to the queue,
causing incorrect behavior in the ready release path. Check the return value of
`early_dispatch_queue.push(c)`, and if it returns false, restore the speculative
claim by resetting `spec_state` to its previous value before the attempted push
operation.

In
`@src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_cold_path.cpp`:
- Around line 197-201: The expanded anomaly diagnostic line now includes
additional token fields (cond_tok, running_tok, pending_tok) that can cause the
output to exceed the current buffer capacity. Locate the buffer declarations for
`aic_buf` and `aiv*_buf` (currently sized at 128 bytes) and increase their sizes
to accommodate the longer formatted string that now includes the three new token
fields along with the existing parameters. Apply the same buffer size increase
to both the location around line 197 and the related location around line 333
mentioned in the comment.

In
`@src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_dispatch.cpp`:
- Line 1102: The try_speculative_prestage function call in the dispatch logic
currently bypasses the PMU policy by prestaging work even when PMU monitoring is
active. Modify the condition that guards the try_speculative_prestage call to
also check for pmu_active status, ensuring that speculative prestaging is
disabled whenever PMU monitoring is active (treat it the same way as when
any_ready_work is true). This will prevent phase 4b from prestaging gated work
into idle or pending slots during PMU runs, maintaining the single-issue PMU
policy.
- Around line 332-333: The prop_list array is declared with a fixed size of
CoreTracker::MAX_CLUSTERS, but the batch buffer can contain up to
CoreTracker::MAX_CLUSTERS * 3 entries, causing extra flagged producers to
silently skip dispatch_fanin propagation. Resize the prop_list array declaration
to accommodate the full batch capacity (CoreTracker::MAX_CLUSTERS * 3) instead
of CoreTracker::MAX_CLUSTERS to ensure all entries in the popped batch are
properly processed and propagated.
- Around line 631-633: The publish_subtask_to_core function writes to a register
that signals AICore, but there is no write memory barrier before this call to
ensure the payload and gate data written by prepare_block_for_dispatch are
visible before the register write. Insert a wmb() call (write memory barrier) in
the loop before each publish_subtask_to_core call to guarantee that AICore
observes the payload and not_ready gate updates before seeing the token.
- Around line 1107-1113: After emitting the Prestage bar by calling
l2_swimlane_aicpu_record_sched_phase in the block where prestage_record &&
prestaged > 0, the phase_dispatch_count needs to be cleared by resetting it to
zero. This prevents the dispatch count that was incremented during
prepare_block_for_dispatch from leaking into the next Dispatch phase iteration.
- Around line 1129-1131: In the second completion poll section where
completed_2nd > 0, the code updates completed_tasks_ but fails to increment
sched_->tasks_completed like the first poll does under PTO2_SCHED_PROFILING. Add
the same profiling instrumentation to the second poll by incrementing
sched_->tasks_completed with completed_2nd within the same PTO2_SCHED_PROFILING
guard condition used in the first poll, ensuring scheduler profiling totals
remain consistent across both polls.

---

Nitpick comments:
In
`@tests/st/a2a3/tensormap_and_ringbuffer/spec_spmd_consumer/kernels/orchestration/spec_spmd_consumer_orch.cpp`:
- Around line 18-22: Update the multi-block consumer documentation comment
(lines 18-22) in the spec_spmd_consumer_orch.cpp file. Replace the stale
assertion about fixed PTO2_SPEC_MAX_DOORBELLS == 3 doorbell cap with an accurate
description of the current scheduler behavior using per-core bitmask/table
doorbell tracking instead of per-task tracking. Ensure the updated comment still
clearly explains the intended coverage path for block staging and dispatch
behavior.

In
`@tests/st/a2a3/tensormap_and_ringbuffer/spec_spmd_consumer/test_spec_spmd_consumer.py`:
- Around line 19-23: Import ClassVar from the typing module at the top of the
file, then locate all mutable class attributes (dicts and lists) defined at the
class level around lines 36 and 60 in the class definitions. Annotate each of
these mutable class attributes with the ClassVar type annotation (e.g., change a
class attribute like my_dict = {} to my_dict: ClassVar[dict] = {}) to prevent
Ruff's RUF012 warning about mutable class variables that can cause shared state
between test runs.

In
`@tests/st/a2a3/tensormap_and_ringbuffer/spec_spmd_producer/kernels/orchestration/spec_spmd_producer_orch.cpp`:
- Around line 48-52: The hardcoded value `16` in the inter_shapes array
calculation in spec_spmd_producer_orch.cpp is tightly coupled to matching values
in kernel_spmd_fill.cpp and kernel_double.cpp, creating a risk of memory-layout
inconsistency if one file is modified independently. Extract this cache line
stride constant into a shared test header file that is included by all three
files (spec_spmd_producer_orch.cpp, kernel_spmd_fill.cpp, and
kernel_double.cpp), then replace the hardcoded `16` literal with a reference to
this shared constant definition. This ensures all three files reference the same
value and prevents accidental divergence.

In
`@tests/st/a2a3/tensormap_and_ringbuffer/spec_spmd_producer/test_spec_spmd_producer.py`:
- Around line 18-22: Add `ClassVar` to the imports from the `typing` module at
the top of the file. Then locate the `CALLABLE` class attribute (around line 31)
and the `CASES` class attribute (around line 55) in the test class, and annotate
both with `ClassVar` type hints to properly indicate they are class-level
mutable attributes rather than instance attributes, which will resolve the
RUF012 warning.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 8c2d7f4c-5c11-412f-a4fd-40eb98a80f80

📥 Commits

Reviewing files that changed from the base of the PR and between c4b0aac and f3b928f.

📒 Files selected for processing (37)
  • docs/aicore-kernel-programming.md
  • docs/hardware/cache-coherency.md
  • examples/a2a3/tensormap_and_ringbuffer/vector_example/kernels/aiv/kernel_add.cpp
  • examples/a2a3/tensormap_and_ringbuffer/vector_example/kernels/aiv/kernel_add_scalar.cpp
  • examples/a2a3/tensormap_and_ringbuffer/vector_example/kernels/aiv/kernel_mul.cpp
  • examples/a2a3/tensormap_and_ringbuffer/vector_example/kernels/orchestration/example_orchestration.cpp
  • examples/a2a3/tensormap_and_ringbuffer/vector_example/test_vector_example.py
  • simpler_setup/tools/swimlane_converter.py
  • src/a2a3/platform/include/common/l2_swimlane_profiling.h
  • src/a2a3/platform/onboard/aicore/inner_kernel.h
  • src/a2a3/platform/shared/host/l2_swimlane_collector.cpp
  • src/a2a3/platform/sim/aicore/inner_kernel.h
  • src/a2a3/runtime/tensormap_and_ringbuffer/aicore/aicore_executor.cpp
  • src/a2a3/runtime/tensormap_and_ringbuffer/orchestration/pto_arg_with_deps.h
  • src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto2_dispatch_payload.h
  • src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp
  • src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_runtime2_types.h
  • src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_types.h
  • src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/pto_scheduler.h
  • src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_cold_path.cpp
  • src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_completion.cpp
  • src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_context.h
  • src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_dispatch.cpp
  • src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_types.h
  • src/a2a3/runtime/tensormap_and_ringbuffer/runtime/shared/pto_runtime2_init.cpp
  • tests/st/a2a3/tensormap_and_ringbuffer/spec_mix/kernels/aiv/kernel_add_standalone.cpp
  • tests/st/a2a3/tensormap_and_ringbuffer/spec_mix/kernels/aiv/kernel_mul_standalone.cpp
  • tests/st/a2a3/tensormap_and_ringbuffer/spec_mix/kernels/orchestration/spec_mix_orch.cpp
  • tests/st/a2a3/tensormap_and_ringbuffer/spec_mix/test_spec_mix.py
  • tests/st/a2a3/tensormap_and_ringbuffer/spec_spmd_consumer/kernels/aiv/kernel_fill_all.cpp
  • tests/st/a2a3/tensormap_and_ringbuffer/spec_spmd_consumer/kernels/aiv/kernel_spmd_addk.cpp
  • tests/st/a2a3/tensormap_and_ringbuffer/spec_spmd_consumer/kernels/orchestration/spec_spmd_consumer_orch.cpp
  • tests/st/a2a3/tensormap_and_ringbuffer/spec_spmd_consumer/test_spec_spmd_consumer.py
  • tests/st/a2a3/tensormap_and_ringbuffer/spec_spmd_producer/kernels/aiv/kernel_double.cpp
  • tests/st/a2a3/tensormap_and_ringbuffer/spec_spmd_producer/kernels/aiv/kernel_spmd_fill.cpp
  • tests/st/a2a3/tensormap_and_ringbuffer/spec_spmd_producer/kernels/orchestration/spec_spmd_producer_orch.cpp
  • tests/st/a2a3/tensormap_and_ringbuffer/spec_spmd_producer/test_spec_spmd_producer.py

poursoul added 4 commits June 17, 2026 17:26
Two CI breakages on the early-dispatch branch:

- test_scheduler_state (cpput) failed to link: the inline
  PTO2SchedulerState::ring_one_doorbell pulls in get_reg_ptr, which the
  host UT object set (a2a3_rt_objs / a5_rt_objs) does not provide. Add a
  get_reg_ptr stub to test_stubs.cpp returning writable static storage
  (no MMIO on host), mirroring the existing get_sys_cnt_aicpu stub.

- profiling-flags-smoke (a2a3sim/pto2-off) failed to build: `prestaged`
  in scheduler_dispatch.cpp is only read under #if PTO2_PROFILING, so
  PTO2_PROFILING=0 + -Werror=unused-variable killed the build. Mark it
  [[maybe_unused]]; the try_speculative_prestage() side effect must stay.
stage_consumer_blocks wrote a block's payload via prepare_block_for_dispatch
then immediately published its DATA_MAIN_BASE token, with no wmb() in between
— unlike the normal dispatch path (flush_publish), which barriers before
publishing. Without it a gated AICore can observe the token, dcci its payload,
and read a stale not_ready gate / args. Hoist one wmb() per stage_from pass
(prepare all claimed blocks, barrier, then publish), mirroring flush_publish.

Clears the intermittent ~5-8% bf16 ULP golden failures on qwen3 decode:
0/50 onboard replay rounds failed after the fix (was ~1-2 per 20 before).
…uard

test_scheduler_state / test_task_state / test_wiring build minimal
PTO2TaskSlotStates via init_slot, which memsets the slot and never sets
slot.payload. That was fine until the speculative-dispatch helpers
(try_speculative_release, propagate_dispatch_fanin) began dereferencing
slot.payload->spec_state — release_fanin_and_check_ready then SEGV'd on these
payload-less slots, surfaced once the host UT link was fixed and the tests
actually ran (ut macos/ubuntu).

Production slots always have a payload (orch::prepare_task -> bind_buffers sets
it; reset_for_reuse never clears it), so the fix is test-side: init_slot now
binds each slot a distinct zeroed payload from a small per-fixture pool,
mirroring bind_buffers. With that invariant restored, the defensive
`c->payload == nullptr` skip in propagate_dispatch_fanin's fanout walk is dead
code — removed for consistency with the unguarded payload derefs in the same
path.
…own EXIT, diag buffer)

- scheduler_dispatch.cpp: skip speculative prestage when pmu_active, matching
  dispatch_ready_tasks' single-issue PMU policy — a gated prestage would perturb
  the same per-task PMU windows.
- scheduler_dispatch.cpp: clear phase_dispatch_count after emitting a Prestage
  swimlane bar so blocks staged this iteration don't leak into the next Dispatch
  bar's count.
- scheduler_dispatch.cpp: the second completion poll now also bumps
  sched_->tasks_completed under PTO2_SCHED_PROFILING, matching the first poll so
  scheduler profiling no longer undercounts.
- scheduler_cold_path.cpp: grow per-core stall-status buffers 128->192B so the
  expanded ANOMALY diagnostic (added token fields) isn't truncated.
- aicore_executor.cpp: in the speculative doorbell gate, check AICORE_EXIT_SIGNAL
  on the doorbell-match iteration too, so a teardown EXIT racing in right after a
  matching doorbell still wins over executing the gated task.

Onboard-verified unchanged: qwen3 decode 20/20 replay rounds pass.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants