feat(a2a3/runtime): speculative early-dispatch (pre-stage + doorbell) by poursoul · Pull Request #1079 · hw-native-sys/simpler

poursoul · 2026-06-17T08:16:44Z

Summary

Speculative early-dispatch hides the ~7μs scheduler dispatch-loop latency by
pre-staging a ready consumer onto idle AICore(s) while its producer still runs,
gating execution on a per-core DMB doorbell, and ringing it the instant the
producer completes (skipping the ready-queue round trip).

Core: pre-stage + not_ready gate + DMB doorbell; release at producer
completion (try_speculative_release).
Coverage: MIX / SPMD-as-producer / SPMD-as-consumer; 3 ST suites
(spec_mix, spec_spmd_producer, spec_spmd_consumer).
Data structures: per-core doorbell bitmask + global table (replaces the
per-task 3-doorbell cap); early_dispatch_queue backed by PTO2ReadyQueue;
event-driven candidate detection (dispatch_fanin) replacing the PULL scan.
Concurrency: lock-free cross-thread staging + re-push; gated-pending
deadlock fix; deferred dispatch_fanin propagation off the hot paths.
Payload init: early-dispatch spec fields initialized in
PTO2TaskPayload::init (single init point; reset_for_reuse no longer
touches the payload).

Testing

qwen3 decode_layer single-layer golden (a2a3, MAX_SEQ): PASS.
20-round onboard batch at LEVEL=1: golden parity + AICore runtime unchanged
vs main (median ~1027us).
Rebased onto current main; clean rebuild + NORMAL codegen verified e2e.
Simulation tests pass
Hardware tests pass

gemini-code-assist

Code Review

This pull request introduces a speculative early-dispatch mechanism for the A2A3 runtime scheduler, enabling consumer tasks to be pre-staged on idle cores and gated on a hardware doorbell register until their producers complete. It also adds documentation on cache coherency constraints, updates profiling phases, and introduces new test cases. The review feedback identifies two critical concurrency issues in scheduler_completion.cpp where loading spec_state with std::memory_order_relaxed could result in stale reads, potentially causing deadlocks or increased latency; using std::memory_order_acquire is recommended to ensure correct visibility across threads.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

…ell) Stage a consumer task on an idle AICore before its producer finishes, gate execution with the DATA_MAIN_BASE high-32 doorbell, so build_payload + dispatch-write overlap the producer's run. The dependency engine is untouched: producers run to completion, deps resolve normally; the only additions are a pre-stage hook and a completion-path doorbell. Mechanism: - pto2_dispatch_payload.h: not_ready gate flag (reuses reserved pad). - aicore inner_kernel.h (onboard+sim): read_dmb_high32(). - aicore_executor.cpp: gate spins on read_dmb_high32()==task_id, EXIT-honoring; COND ACK deferred until AFTER the gate so the core stays off-limits. - scheduler_dispatch.cpp: Hook 1 try_speculative_prestage — after normal dispatch, pre-stage a RUNNING flagged producer's consumers onto spare idle cores (eligibility: single-block, AIC/AIV, remaining-producers==1). - pto_scheduler.h: try_speculative_release on the completion path rings the doorbell the instant the last producer's FIN satisfies fanin (skips the ready-queue round-trip). Lock-free spec_state CAS coordinates stager/router. - pto_types.h: Arg::set_allow_early_resolve() codegen hint (a2a3-local Arg, shared task_args.h ABI untouched), exposed via ArgWithDeps. v1 scope: single-block + AIC/AIV (not MIX) + fully-idle cores. MIX, SPMD, and dual-issue are follow-ups. vector_example carries the inflated-kernel test harness (flag currently commented for a clean no-regression baseline). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

A MIX consumer occupies a full cluster (1-3 subtask cores, each with its own dispatch token + reg_addr), so the release must ring one doorbell per gated core. Generalize the single-doorbell staged metadata into a fixed array: - pto_runtime2_types.h: staged_reg_addr/staged_reg_task_id become arrays of PTO2_SPEC_MAX_DOORBELLS (3) plus staged_count; still fits the 40B payload padding (guarded by the existing offsetof(tensors)==576 static_assert). - scheduler_dispatch.cpp (Hook 1): allow shape == MIX; pre-stage onto a fully-idle cluster via get_idle_core_offset_states(MIX); record all n handles. - pto_scheduler.h (try_speculative_release): ring one doorbell per staged core. - scheduler_context.h: drop the now-dead ring_speculative_doorbell helper (release rings inline). Single-block MIX only (set_sync_start fires only for block_num>1, so the pre-stage bypass is safe). New st-sim test spec_mix validates a flagged AIV producer feeding a 2-subtask MIX consumer: speculation fires (staged_count==2) and the result is correct. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…ispatch A multi-block (SPMD) flagged producer already works with the existing logic: its single fanin edge into the consumer is released only when ALL blocks complete, so the completion-path doorbell fires the pre-staged consumer exactly once, after the last block. New st-sim test spec_spmd_producer pins this — a block_num=4 producer where block i writes out_p[i]=i+1 feeds a single-block consumer out_c[i]=2*out_p[i]; the partitioned output makes any premature release mismatch the golden. Speculation fires (staged_count==1) and the result is correct across repeated runs. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…spatch Pre-stage a multi-block (SPMD) consumer block-by-block — not all N blocks at once. In a single atomic Hook-1 pass (bracketed by the STAGING->STAGED claim, so it never races the completion release), stage min(idle cores now, remaining blocks, doorbell budget) blocks; each gated block records its doorbell(s). Any blocks that don't fit dispatch normally: - scheduler_dispatch.cpp (Hook 1): drop the single-block restriction; loop over blocks_to_stage claiming one idle core/cluster per block, advancing next_block_idx. subtasks_per_block (popcount of core_mask) bounds how many blocks fit the PTO2_SPEC_MAX_DOORBELLS array. Exclude sync_start tasks (their blocks must launch atomically — partial staging would break that). - pto_scheduler.h (try_speculative_release): after ringing the staged doorbells, return false when next_block_idx < logical_block_num so the caller pushes C to the ready queue; dispatch_shape resumes from next_block_idx for the rest. New st-sim test spec_spmd_consumer: a flagged single-block producer feeds a block_num=4 SPMD consumer. Three blocks fit the doorbell budget (gated + doorbell-released) and the fourth dispatches normally; speculation fires (blocks=3/4) and the result is correct across repeated runs. Single-block, MIX, and SPMD-producer paths all remain a special case of the unified loop. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

The spec_spmd_{producer,consumer} kernels had each SPMD block write one float into a shared 64B cache line and dcci-flush the whole line. On silicon each core writes back its full line copy (other slots stale 0), so the last block to flush clobbers the rest -- false sharing, last-writer-wins. Passed on sim only (sim models dcci as a no-op); failed on device with max_diff 8.0. Not a speculative-dispatch bug -- reproduces with the early-resolve flag off. - Isolate each block on its own cache line (stride 16 floats = 64B): kernel_spmd_fill / kernel_spmd_addk write out[block_idx*16]; kernel_double reads the strided producer output; producer orchestration intermediate and consumer golden updated to the strided layout. - kernel_fill_all: cast i+1 via int32 -- ccec rejects float<-uint32 in aicore functions; the test had only ever run from a stale cached .o. - Document the limitation: aicore-kernel-programming.md "Each block must write to its own cache line" and cache-coherency.md "dcci is whole-cache-line". All 3 spec tests now pass on both a2a3sim and a2a3 (device). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

coderabbitai · 2026-06-17T08:37:06Z

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: b2edcadc-69c9-46ad-98f4-f2229646b04c

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

📝 Walkthrough

Walkthrough

Implements speculative early-dispatch (pre-stage + DATA_MAIN_BASE doorbell) in the a2a3 tensormap_and_ringbuffer scheduler: adds allow_early_resolve API, a not_ready dispatch payload gate, PTO2SpecState state machine, AICore executor doorbell spin-wait, scheduler prestage/release machinery with per-core doorbell table and early_dispatch_queue, extended L2 swimlane profiling phases, three new system test suites, and hardware cache-line isolation documentation.

Changes

Speculative Early-Dispatch Feature

Layer / File(s)	Summary
Cache-line isolation docs `docs/aicore-kernel-programming.md`, `docs/hardware/cache-coherency.md`	Adds sections documenting that `dcci` operates at 64-byte whole-cache-line granularity, showing a broken SPMD write pattern and a corrected cache-line-isolated pattern.
Public `allow_early_resolve` API and dispatch payload gate `src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_types.h`, `.../orchestration/pto_arg_with_deps.h`, `.../runtime/pto2_dispatch_payload.h`	Adds `Arg::allow_early_resolve_` flag with accessors propagated through `ArgWithDeps`. Adds `volatile uint32_t not_ready` to `PTO2DispatchPayload`, reducing ABI padding from 8 to 4 bytes.
Speculative dispatch data model and runtime init `.../runtime/pto_runtime2_types.h`, `.../runtime/pto_orchestrator.cpp`, `.../runtime/shared/pto_runtime2_init.cpp`	Introduces `PTO2SpecState` enum, `PTO2_EARLY_DISPATCH_QUEUE_SIZE`, staged-core mask atomics, `dispatch_fanin`, `spec_state`, chaining fields, and `prefetch()` on `PTO2TaskPayload`. Changes `next_block_idx` to `std::atomic<int16_t>`. Wires `early_dispatch_queue` through scheduler arena lifecycle.
AICore `read_dmb_high32` and executor doorbell spin-wait `src/a2a3/platform/onboard/aicore/inner_kernel.h`, `.../sim/aicore/inner_kernel.h`, `.../aicore/aicore_executor.cpp`	Adds `read_dmb_high32()` on both onboard and sim paths. Updates the executor to spin-poll `not_ready` via `read_dmb_high32()` until the doorbell task ID matches, with `AICORE_EXIT_SIGNAL` teardown detection. Moves FIN write before swimlane recording.
Scheduler speculative prestage, release, and dispatch wiring `.../runtime/scheduler/pto_scheduler.h`, `scheduler_context.h`, `scheduler_dispatch.cpp`, `scheduler_completion.cpp`, `scheduler_cold_path.cpp`	Adds per-core doorbell table, `early_dispatch_queue`, `try_speculative_release` (CAS `spec_state`, ring doorbells, defer propagation), `stage_consumer_blocks`, and `try_speculative_prestage`. Updates `build_payload` to gate `not_ready` for STAGING slots. Defers `dispatch_fanin` propagation via `prop_list`. Adds `pending_gated` to `decide_slot_transition` to avoid STAGED-slot deadlock. Skips STAGED cores in completion polling.
Profiling: new swimlane phases and activity-fill tracking `src/a2a3/platform/include/common/l2_swimlane_profiling.h`, `.../scheduler/scheduler_types.h`, `scheduler_dispatch.cpp`, `scheduler_completion.cpp`, `src/a2a3/platform/shared/host/l2_swimlane_collector.cpp`, `simpler_setup/tools/swimlane_converter.py`	Extends `L2SwimlaneSchedPhaseKind` with `Poll`, `Release`, `Fanout`, `Scan`, `Prestage`. Adds `phase_subretire_count` and activity-fill fields to `SchedL2SwimlaneCounters`. Reworks dispatch loop activity-fill attribution (Poll/Release/Prestage). Records Fanout and Scan phases with timing. Updates JSON name mapping and Perfetto color map/filter.
System test suites: spec_mix, spec_spmd_consumer, spec_spmd_producer `tests/st/a2a3/tensormap_and_ringbuffer/spec_mix/`, `tests/st/a2a3/tensormap_and_ringbuffer/spec_spmd_consumer/`, `tests/st/a2a3/tensormap_and_ringbuffer/spec_spmd_producer/*`	Three new test suites covering speculative early-dispatch scenarios: MIX consumer (producer `c=a+b`, MIX `d=c+b` / `e=c*b`), SPMD consumer (single-block fill producer + 4-block SPMD addk consumer), and SPMD producer (4-block fill producer + single-block double consumer). Each includes C++ AIV kernels, C++ orchestration, and Python `SceneTestCase`.
vector_example kernel latency inflation and block_dim update `examples/a2a3/tensormap_and_ringbuffer/vector_example/kernels/aiv/kernel_add.cpp`, `kernel_add_scalar.cpp`, `kernel_mul.cpp`, `kernels/orchestration/example_orchestration.cpp`, `test_vector_example.py`	Wraps `TADD`, `TADDS`, `TMUL` in 32-iteration loops. Increases `block_dim` from 3 to 24. Adds commented-out `set_allow_early_resolve()` calls in the orchestration example.

Sequence Diagram(s)

sequenceDiagram
  participant Orch as Orchestrator Thread
  participant Sched as PTO2SchedulerState
  participant EarlyQ as early_dispatch_queue
  participant Doorbell as per-core doorbell table
  participant Payload as PTO2DispatchPayload
  participant AICore as AICore executor

  Orch->>Sched: submit task with allow_early_resolve=true
  Sched->>Sched: wire_task (seed dispatch_fanin for finished producers)
  Sched->>EarlyQ: push consumer slot
  Sched->>Sched: try_speculative_prestage → stage_consumer_blocks
  Note over Sched: CAS next_block_idx, set spec_state=STAGING
  Sched->>Payload: write not_ready=1, reg_task_id
  Sched->>Doorbell: store task_id for core
  Sched->>AICore: dispatch payload
  AICore->>AICore: observe not_ready=1, spin on read_dmb_high32()
  Note over Sched: producer task completes → on_mixed_task_complete
  Sched->>Sched: try_speculative_release (CAS STAGING→DISPATCHED)
  Sched->>Doorbell: ring_doorbell (write task_id to DATA_MAIN_BASE)
  AICore->>AICore: read_dmb_high32() matches task_id, exit spin
  AICore->>AICore: execute consumer kernel
  AICore->>Sched: write FIN to COND

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related issues

Speculative early-dispatch (pre-stage + DMB doorbell) — feat/early-dispatch #1065 — This PR directly implements the "Speculative early-dispatch (pre-stage + DMB doorbell)" feature specified in that issue, including read_dmb_high32(), not_ready gate, dispatch_fanin, spec_state/staged_core_mask, try_speculative_prestage, early_dispatch_queue, set_allow_early_resolve() API, and all described swimlane phases.

Possibly related PRs

hw-native-sys/simpler#942: Both PRs modify L2SwimlaneSchedPhaseKind and the host-side sched-phase JSON name mapping in l2_swimlane_collector.cpp, with this PR adding five new phase kinds atop the infrastructure introduced there.
hw-native-sys/simpler#989: Both PRs modify scheduler_dispatch.cpp and the drain_worker_dispatch block-claiming loop; this PR changes that same loop to use atomic next_block_idx loads/stores and adds the deferred prop_list publish step.
hw-native-sys/simpler#994: Both PRs extend SchedL2SwimlaneCounters and the profiling instrumentation in scheduler_dispatch.cpp and scheduler_types.h, with this PR adding phase_subretire_count and activity-fill fields in the same struct.

Poem

🐇 Hop hop, the doorbell rings at last!
A staged core waits, then runs so fast.
Each block claims its own cache line,
No false-share clobber — all aligned!
The swimlane fills with Poll and Scan,
Speculative dispatch: the runtime's plan. ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 37.21% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'feat(a2a3/runtime): speculative early-dispatch (pre-stage + doorbell)' clearly and specifically describes the main feature added: a speculative early-dispatch mechanism with pre-staging and doorbell components.
Description check	✅ Passed	The description comprehensively explains the speculative early-dispatch feature, covering the core mechanism, data structures, coverage, concurrency considerations, and testing approach, all of which align with the changeset.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Warning

Review ran into problems

🔥 Problems

Stopped waiting for pipeline failures after 30000ms. One of your pipelines takes longer than our 30000ms fetch window to run, so review may not consider pipeline-failure results for inline comments if any failures occurred after the fetch window. Increase the timeout if you want to wait longer or run a @coderabbit review after the pipeline has finished.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

…profiling Speculative early-dispatch completion path: - Inline the pre-staged consumer's doorbell during the producer fanout walk (pto_scheduler.h on_mixed_task_complete) instead of batching every doorbell to after the O(fanout-degree) walk. A flagged consumer near the front of the list starts immediately and overlaps the AICPU releasing the rest. Safe: the producer slot is COMPLETED (all SPMD blocks FIN'd + dcci-flushed) before the walk. On qwen3-14b online->out_proj the producer-end -> consumer-start gap drops ~11.8us -> ~8.2us (fanout half ~5us -> ~1.3us). - Completion poll skips STAGED cores (gated; cannot ACK/FIN until their doorbell) and re-polls once after dispatch so a producer block that FINs mid-iteration is observed the same iteration. Scheduler activity profiling (the sched lane no longer has blank gaps): - New SCHED phase kinds Poll / Release / Fanout / Scan (l2_swimlane_profiling.h + host dumper + converter render/colors). - scheduler_dispatch.cpp: per-iteration activity-fill classifies each otherwise-blank iteration as Poll (scan that retired nothing) and emits the deferred-release drain as its own Release bar. - scheduler_completion.cpp: a Scan bar per check_running_cores pass (cores' MMIO COND-read cost, ~95ns/LDR) and a Fanout child nested in Complete (fanin + doorbell walk); phase_subretire_count surfaces SPMD sub-block retires. - aicore_executor.cpp: sample the op-event end AFTER the FIN write so the bar end marks FIN-write (when the AICPU can first observe completion), separating AICore-side from AICPU-side latency. Device-validated by spec_mix / spec_spmd_producer / spec_spmd_consumer. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

The pre-staged-consumer doorbell storage capped at PTO2_SPEC_MAX_DOORBELLS=3 per task (sized to fit a MIX block's 1-3 subtask cores in the payload). That cap also bounded how many blocks a wide SPMD consumer could pre-stage in one pass, leaving idle cores unused. - Replace payload's staged_count + staged_reg_task_id[3] + staged_reg_addr[3] with a staged_core_mask bitmask (PTO2_SPEC_CORE_MASK_WORDS=2 words = 128 bits >= RUNTIME_MAX_WORKER=72), 1 bit per global core_id. - Move the (reg_addr, token) pair into a scheduler-global spec_doorbell_table indexed by core_id. Gated cores in flight are bounded by the chip core count (no two-level pre-dispatch), so the table is the natural capacity and the per-task 3-cap is gone. - try_speculative_release iterates set bits in the mask, ringing each core's doorbell from the table. Drop the SpecDoorbellBatch struct and the now-unused db parameter threaded through release_fanin_and_check_ready. - Add static_assert(RUNTIME_MAX_WORKER <= PTO2_SPEC_CORE_MASK_WORDS * 64). offsetof(tensors)==576 still holds (bitmask is smaller than the old arrays). 3 spec st-sim tests (mix, spmd_consumer, spmd_producer) pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…ores Speculative pre-stage previously only used IDLE cores in the producer's owner thread. For a producer like fa_fused (qwen attention) that holds every core until it finishes — its 72 blocks all complete in the last ~20us, no core goes idle mid-run — that staged nothing, so the consumer (online_softmax, 48 SPMD-AIV blocks) gained zero overlap. Two problems had to be solved. 1. Stage onto the PENDING slot of cores RUNNING the producer (the only cores available when no core is idle). This hit a deadlock (507018, SCHEDULER_TIMEOUT): decide_slot_transition Case 3.1 ignores a running-task FIN whenever a pending task exists, assuming the AICore immediately runs the pending one and the scheduler observes ITS ack. A speculatively-gated pending task spins on its doorbell and never acks — and the producer's completion depends on collecting that very running FIN — so it wedged forever. Fix: detect a gated (STAGED) pending in the completion poll and complete the running FIN + promote the gated task (Case 3.3) instead of waiting. The promoted task keeps pending_occupied set so no second gated block stacks and the per-core doorbell entry is never overwritten. 2. A consumer has ONE spec_state but its SPMD blocks span cores owned by several AICPU threads. Only the first thread to CAS NONE->STAGING staged its share (16/48); the others saw spec_state != NONE and skipped. Allow a thread to re-claim a STAGED consumer (CAS STAGED->STAGING), appending its own cores to the shared mask and advancing the shared next_block_idx. The CAS serializes the staging critical section across threads and races cleanly with release (re-claim fails once release sets DISPATCHED). Device-verified on qwen fa_fused (task-submit --device 1): all 48 blocks now pre-stage across 3 threads (was 16); online starts +2.4us after fa_end (was +6.0), online END -3.9us, downstream out_proj END -7.4us. 3 spec st-onboard + st-sim tests pass; fa repro 5/5 deadlock-free. A V9 [SPEC] diagnostic logs each thread's staging contribution. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Reworks Hook 1 (early-dispatch pre-staging) so a flagged producer's consumer is staged fully and concurrently across all AICPU threads, and fixes the completion-path interaction that made pending-slot staging deadlock. - Concurrent staging: spec_state simplifies to NONE -> STAGING (stable gated state) -> DISPATCHED. next_block_idx and staged_core_mask become atomic so every thread claims a distinct block range and OR-s its cores into the mask in parallel (no serialized staircase). Release flips STAGING->DISPATCHED and rings the mask; a block staged after that self-rings its own doorbell (seq_cst order between mask-OR/spec-load here and spec-store/mask-read in release guarantees no doorbell is missed). - Gated-pending (Case 3.3): decide_slot_transition previously ignored a running-task FIN whenever a pending task existed (Case 3.1), assuming the AICore immediately runs pending and the scheduler observes its ack. A gated pending never acks until its doorbell, and the producer's completion depends on that very running FIN -> deadlock (507018). Now a gated (STAGING) pending is detected and the running FIN is completed + the gated task promoted. - Cross-thread extend list: a thread-local producer (e.g. rope_qkv, all blocks on one thread) only lets that thread reach its consumer's fanout. The first claimer publishes the consumer to a shared list; every thread's prestage pass drains it and stages the consumer's remaining blocks onto ITS OWN cores. - Swimlane: add a Prestage phase kind so a staging-dominated loop iteration shows as prestage(N), not mislabeled poll(0) (kind + record + collector + converter wired end-to-end). Device-verified (task-submit --device 1): 3 spec st-onboard tests pass; qwen fa_fused fully pre-dispatches its online consumer (online end -3.8us, out_proj -5.9us); deadlock-free. sim spec tests pass.

Replace the O(n) broadcast extend-list with an inline Vyukov MPMC queue and restructure staging to mirror the normal SPMD dispatch path, so a flagged producer's consumer is pre-staged across all idle threads in parallel instead of a serial winner-then-peer daisy chain. - pto_scheduler.h: inline bounded MPMC spec queue (push/pop/init), replacing spec_extend_list[] + publish/remove. - scheduler_dispatch.cpp: - pass 1 (discovery) CAS-claims NONE->STAGING and pushes the consumer ONCE; it no longer stages (no front-loading). - pass 2 (drain) pops, CAS-claims a range sized to this thread's free cores, RE-PUSHES before the expensive prepare+publish so peers claim and stage concurrently. CAS (not store) keeps next_block_idx atomic against normal dispatch, which also claims it when release routes a partially-staged consumer to the ready queue. - stage_consumer_blocks now stages an explicit already-claimed range. - remove the per-stage [SPEC] debug log from the hot path: its device -log write (~9us, contended across threads) dominated the prestage bar and was the real source of the short-producer regression. - gate the prestage trigger on no-ready-work (idle only). - pto_runtime2_init.cpp: init the spec queue.

Replace the per-iteration PULL scan in try_speculative_prestage with an event-driven PUSH dual of fanin_refcount/ready: - Add dispatch_fanin/dispatch_propagated/spec_chain_active/spec_chain_depth to PTO2TaskPayload (reset in reset_for_reuse). - propagate_dispatch_fanin(): when a FLAGGED producer dispatches (normal or doorbell-released), bump each consumer's dispatch_fanin once (guarded). A consumer reaching fanin_actual_count is an early-dispatch candidate: CAS NONE->STAGING and push to spec_queue. Auto-chains by inheriting the flag (depth-capped). - Seed dispatch_fanin at wire_task with early_finished (producers already complete when the consumer was wired). Such producers never dispatch at runtime, so without the seed the candidate compare is unreachable whenever any producer is pre-completed (e.g. a buffer-creator). Mirrors the early_finished seed ready_fanin already gets. - try_speculative_prestage now only DRAINS the queue; the O(running cores x fanout) pass-1 scan is gone. Verified on a2a3: qwen3-14b decode online_softmax now pre-stages (df 2==2, 48/96 blocks staged across 3 scheduler threads); was 0 prestage before the seed. Spec st (mix/spmd-producer/spmd-consumer) pass on a2a3sim.

… paths propagate_dispatch_fanin was being called inline at two latency-critical points, where its fanout walk delayed the very thing early-dispatch exists to make fast: - dispatch_shape: called between claiming cores and publishing a task's blocks. For a flagged SPMD producer, the thread that won the once-guard ran the walk before publishing its block range, while peer threads co-dispatching the same SPMD task published immediately — misaligning the task's block starts. Now recorded in a per-pop prop_list and replayed after flush_publish. - on_mixed_task_complete fanout walk: try_speculative_release rang a released consumer's doorbell and then propagated inline, delaying the doorbells of later consumers in the same producer's fanout. Now collected in a SpecReleaseSink and replayed after the walk; every doorbell fires inline first, propagation runs last. Propagation is pure bookkeeping for future candidates, so deferring it by one walk is semantically free. Measured on qwen3-14b decode: qk_gamma/ qk_recip/rope handoffs dropped from ~7-8us back to ~2us at full chain depth. Spec st (mix/spmd-producer/spmd-consumer) pass on a2a3sim.

The cross-thread early-dispatch work queue was a hand-rolled inline Vyukov MPMC (spec_queue_) that duplicated PTO2ReadyQueue's algorithm, element type, and push/pop semantics verbatim. Replace it with a PTO2ReadyQueue instance, early_dispatch_queue, allocated/wired via the same arena helpers as ready_queues and dummy_ready_queue. - Drop SpecQueueSlot, spec_queue_, spec_queue_init/push/pop (~60 lines) - Add PTO2_EARLY_DISPATCH_QUEUE_SIZE (64); reserve/init/wire/destroy the new queue in PTO2SchedulerState, plus off_early_dispatch_queue_slots - Rename spec_queue_* call sites and PTO2_SPEC_QUEUE_DRAIN_MAX Behavior-preserving (same MPMC algorithm): 20 onboard decode rounds show unchanged median AICore runtime and golden parity (the lone golden miss is the pre-existing early-dispatch ULP race, not this change).

@576

The early-dispatch spec fields (allow_early_resolve, spec_state, staged_core_mask, dispatch_fanin, dispatch_propagated, spec_chain_active, spec_chain_depth) were reset across two sites: prepare_task wrote two of them, and reset_for_reuse reached into the payload for the other five — a cold cross-structure write on the scheduler's advance-ring path, which also violated reset_for_reuse's documented "skips payload" contract. Consolidate all of them into PTO2TaskPayload::init, the single payload init point (runs every submit, before the slot is visible to the scheduler). prepare_task now does no payload writes; reset_for_reuse no longer touches the payload. The fields must be (re)initialized for every task regardless of its own allow_early_resolve, because spec_state (try_speculative_release CASes it for every task), spec_chain_active (read by every dispatched task's propagate) and dispatch_fanin (seeded via fetch_add at wiring) are touched on per-task paths driven by other tasks' flags — so no per-task gating. Also pack the spec block by descending alignment (drops 6B internal padding; sizeof and tensors[]@576 unchanged) and turn prefetch_payload into a PTO2TaskPayload::prefetch method that also warms the spec cache line. Behavior-preserving: 20 onboard rounds show unchanged median AICore runtime and golden parity (the golden misses are the pre-existing early-dispatch ULP race, not this change).

coderabbitai

Actionable comments posted: 8

🧹 Nitpick comments (4)

tests/st/a2a3/tensormap_and_ringbuffer/spec_spmd_producer/test_spec_spmd_producer.py (1)

18-22: ⚡ Quick win

Use ClassVar for mutable test metadata fields (CALLABLE, CASES).

Line 31 and Line 55 use mutable class attributes; annotating them as class vars prevents accidental instance-level mutation patterns and resolves the RUF012 warning.

Proposed fix

+from typing import ClassVar
+
 import torch
 from simpler.task_interface import ArgDirection as D
 
 from simpler_setup import SceneTestCase, TaskArgsBuilder, Tensor, scene_test
@@
-    CALLABLE = {
+    CALLABLE: ClassVar = {
@@
-    CASES = [
+    CASES: ClassVar = [

Also applies to: 31-62

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@tests/st/a2a3/tensormap_and_ringbuffer/spec_spmd_producer/test_spec_spmd_producer.py`
around lines 18 - 22, Add `ClassVar` to the imports from the `typing` module at
the top of the file. Then locate the `CALLABLE` class attribute (around line 31)
and the `CASES` class attribute (around line 55) in the test class, and annotate
both with `ClassVar` type hints to properly indicate they are class-level
mutable attributes rather than instance attributes, which will resolve the
RUF012 warning.

Source: Linters/SAST tools

tests/st/a2a3/tensormap_and_ringbuffer/spec_spmd_consumer/test_spec_spmd_consumer.py (1)

19-23: ⚡ Quick win

Annotate mutable class configs as ClassVar to avoid shared-state surprises.

Line 36 and Line 60 define mutable class attributes (dict / list). If any test utility mutates them, state can leak between runs; Ruff already flags this (RUF012).

Proposed fix

+from typing import ClassVar
+
 import torch
 from simpler.task_interface import ArgDirection as D
 
 from simpler_setup import SceneTestCase, TaskArgsBuilder, Tensor, scene_test
@@
-    CALLABLE = {
+    CALLABLE: ClassVar = {
@@
-    CASES = [
+    CASES: ClassVar = [

Also applies to: 36-67

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@tests/st/a2a3/tensormap_and_ringbuffer/spec_spmd_consumer/test_spec_spmd_consumer.py`
around lines 19 - 23, Import ClassVar from the typing module at the top of the
file, then locate all mutable class attributes (dicts and lists) defined at the
class level around lines 36 and 60 in the class definitions. Annotate each of
these mutable class attributes with the ClassVar type annotation (e.g., change a
class attribute like my_dict = {} to my_dict: ClassVar[dict] = {}) to prevent
Ruff's RUF012 warning about mutable class variables that can cause shared state
between test runs.

Source: Linters/SAST tools

tests/st/a2a3/tensormap_and_ringbuffer/spec_spmd_consumer/kernels/orchestration/spec_spmd_consumer_orch.cpp (1)

18-22: ⚡ Quick win

Update the stale per-task doorbell-cap assumption in test docs.

This comment still asserts a fixed PTO2_SPEC_MAX_DOORBELLS == 3 behavior, but the scheduler model in this PR stack moved to per-core bitmask/table doorbell tracking. Keeping the old claim makes the intended coverage path misleading.

Suggested comment update

- * t1 is a multi-block consumer. Block-by-block pre-staging stages as many of its
- * blocks as fit on idle cores and in the doorbell budget (PTO2_SPEC_MAX_DOORBELLS
- * == 3); with B == 4, three blocks are gated + released by doorbell and the
- * remaining block dispatches normally off the ready queue. Golden: out_c[i] =
- * i + 11.
+ * t1 is a multi-block consumer. Block-by-block pre-staging stages as many blocks
+ * as possible on currently idle cores, and gated blocks are released by doorbell
+ * once t0 completes. Golden: out_c[i] = i + 11.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@tests/st/a2a3/tensormap_and_ringbuffer/spec_spmd_consumer/kernels/orchestration/spec_spmd_consumer_orch.cpp`
around lines 18 - 22, Update the multi-block consumer documentation comment
(lines 18-22) in the spec_spmd_consumer_orch.cpp file. Replace the stale
assertion about fixed PTO2_SPEC_MAX_DOORBELLS == 3 doorbell cap with an accurate
description of the current scheduler behavior using per-core bitmask/table
doorbell tracking instead of per-task tracking. Ensure the updated comment still
clearly explains the intended coverage path for block staging and dispatch
behavior.

tests/st/a2a3/tensormap_and_ringbuffer/spec_spmd_producer/kernels/orchestration/spec_spmd_producer_orch.cpp (1)

48-52: ⚡ Quick win

Remove the hardcoded stride literal from the producer/consumer memory-layout contract.

16 here is coupled to CL in both kernel_spmd_fill.cpp and kernel_double.cpp. If one side changes independently, the orchestration tensor shape can become inconsistent with kernel indexing.

Localized improvement

-    uint32_t inter_shapes[1] = {static_cast<uint32_t>(BLOCK_NUM) * 16};
+    constexpr uint32_t kCacheLineFloats = 64 / sizeof(float);
+    uint32_t inter_shapes[1] = {static_cast<uint32_t>(BLOCK_NUM) * kCacheLineFloats};

(Preferably, move this constant into a shared test header used by all three files.)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@tests/st/a2a3/tensormap_and_ringbuffer/spec_spmd_producer/kernels/orchestration/spec_spmd_producer_orch.cpp`
around lines 48 - 52, The hardcoded value `16` in the inter_shapes array
calculation in spec_spmd_producer_orch.cpp is tightly coupled to matching values
in kernel_spmd_fill.cpp and kernel_double.cpp, creating a risk of memory-layout
inconsistency if one file is modified independently. Extract this cache line
stride constant into a shared test header file that is included by all three
files (spec_spmd_producer_orch.cpp, kernel_spmd_fill.cpp, and
kernel_double.cpp), then replace the hardcoded `16` literal with a reference to
this shared constant definition. This ensures all three files reference the same
value and prevents accidental divergence.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/a2a3/runtime/tensormap_and_ringbuffer/aicore/aicore_executor.cpp`:
- Around line 153-166: The exit signal check inside the while loop that waits
for read_dmb_high32() to match task_id can be bypassed if the shutdown signal is
written after the doorbell condition is already satisfied. This causes the code
to skip the loop entirely and execute the gated task before shutting down. Add
an additional check for the AICORE_EXIT_SIGNAL (by reading RegId::DATA_MAIN_BASE
and comparing to AICORE_EXIT_SIGNAL) immediately after the while loop exits but
before the subsequent if (exiting) block to ensure the exit signal is properly
detected even when it arrives after the doorbell match, preventing execution of
the gated task during shutdown.

In `@src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/pto_scheduler.h`:
- Line 978: The `early_dispatch_queue.push(c)` call at the specified location
can fail by returning false, but the code does not handle this failure case.
When the push fails, the `spec_state` remains incorrectly set to
`PTO2_SPEC_STAGING` for a consumer that was never actually added to the queue,
causing incorrect behavior in the ready release path. Check the return value of
`early_dispatch_queue.push(c)`, and if it returns false, restore the speculative
claim by resetting `spec_state` to its previous value before the attempted push
operation.

In
`@src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_cold_path.cpp`:
- Around line 197-201: The expanded anomaly diagnostic line now includes
additional token fields (cond_tok, running_tok, pending_tok) that can cause the
output to exceed the current buffer capacity. Locate the buffer declarations for
`aic_buf` and `aiv*_buf` (currently sized at 128 bytes) and increase their sizes
to accommodate the longer formatted string that now includes the three new token
fields along with the existing parameters. Apply the same buffer size increase
to both the location around line 197 and the related location around line 333
mentioned in the comment.

In
`@src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_dispatch.cpp`:
- Line 1102: The try_speculative_prestage function call in the dispatch logic
currently bypasses the PMU policy by prestaging work even when PMU monitoring is
active. Modify the condition that guards the try_speculative_prestage call to
also check for pmu_active status, ensuring that speculative prestaging is
disabled whenever PMU monitoring is active (treat it the same way as when
any_ready_work is true). This will prevent phase 4b from prestaging gated work
into idle or pending slots during PMU runs, maintaining the single-issue PMU
policy.
- Around line 332-333: The prop_list array is declared with a fixed size of
CoreTracker::MAX_CLUSTERS, but the batch buffer can contain up to
CoreTracker::MAX_CLUSTERS * 3 entries, causing extra flagged producers to
silently skip dispatch_fanin propagation. Resize the prop_list array declaration
to accommodate the full batch capacity (CoreTracker::MAX_CLUSTERS * 3) instead
of CoreTracker::MAX_CLUSTERS to ensure all entries in the popped batch are
properly processed and propagated.
- Around line 631-633: The publish_subtask_to_core function writes to a register
that signals AICore, but there is no write memory barrier before this call to
ensure the payload and gate data written by prepare_block_for_dispatch are
visible before the register write. Insert a wmb() call (write memory barrier) in
the loop before each publish_subtask_to_core call to guarantee that AICore
observes the payload and not_ready gate updates before seeing the token.
- Around line 1107-1113: After emitting the Prestage bar by calling
l2_swimlane_aicpu_record_sched_phase in the block where prestage_record &&
prestaged > 0, the phase_dispatch_count needs to be cleared by resetting it to
zero. This prevents the dispatch count that was incremented during
prepare_block_for_dispatch from leaking into the next Dispatch phase iteration.
- Around line 1129-1131: In the second completion poll section where
completed_2nd > 0, the code updates completed_tasks_ but fails to increment
sched_->tasks_completed like the first poll does under PTO2_SCHED_PROFILING. Add
the same profiling instrumentation to the second poll by incrementing
sched_->tasks_completed with completed_2nd within the same PTO2_SCHED_PROFILING
guard condition used in the first poll, ensuring scheduler profiling totals
remain consistent across both polls.

---

Nitpick comments:
In
`@tests/st/a2a3/tensormap_and_ringbuffer/spec_spmd_consumer/kernels/orchestration/spec_spmd_consumer_orch.cpp`:
- Around line 18-22: Update the multi-block consumer documentation comment
(lines 18-22) in the spec_spmd_consumer_orch.cpp file. Replace the stale
assertion about fixed PTO2_SPEC_MAX_DOORBELLS == 3 doorbell cap with an accurate
description of the current scheduler behavior using per-core bitmask/table
doorbell tracking instead of per-task tracking. Ensure the updated comment still
clearly explains the intended coverage path for block staging and dispatch
behavior.

In
`@tests/st/a2a3/tensormap_and_ringbuffer/spec_spmd_consumer/test_spec_spmd_consumer.py`:
- Around line 19-23: Import ClassVar from the typing module at the top of the
file, then locate all mutable class attributes (dicts and lists) defined at the
class level around lines 36 and 60 in the class definitions. Annotate each of
these mutable class attributes with the ClassVar type annotation (e.g., change a
class attribute like my_dict = {} to my_dict: ClassVar[dict] = {}) to prevent
Ruff's RUF012 warning about mutable class variables that can cause shared state
between test runs.

In
`@tests/st/a2a3/tensormap_and_ringbuffer/spec_spmd_producer/kernels/orchestration/spec_spmd_producer_orch.cpp`:
- Around line 48-52: The hardcoded value `16` in the inter_shapes array
calculation in spec_spmd_producer_orch.cpp is tightly coupled to matching values
in kernel_spmd_fill.cpp and kernel_double.cpp, creating a risk of memory-layout
inconsistency if one file is modified independently. Extract this cache line
stride constant into a shared test header file that is included by all three
files (spec_spmd_producer_orch.cpp, kernel_spmd_fill.cpp, and
kernel_double.cpp), then replace the hardcoded `16` literal with a reference to
this shared constant definition. This ensures all three files reference the same
value and prevents accidental divergence.

In
`@tests/st/a2a3/tensormap_and_ringbuffer/spec_spmd_producer/test_spec_spmd_producer.py`:
- Around line 18-22: Add `ClassVar` to the imports from the `typing` module at
the top of the file. Then locate the `CALLABLE` class attribute (around line 31)
and the `CASES` class attribute (around line 55) in the test class, and annotate
both with `ClassVar` type hints to properly indicate they are class-level
mutable attributes rather than instance attributes, which will resolve the
RUF012 warning.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 8c2d7f4c-5c11-412f-a4fd-40eb98a80f80

📥 Commits

Reviewing files that changed from the base of the PR and between c4b0aac and f3b928f.

📒 Files selected for processing (37)

docs/aicore-kernel-programming.md
docs/hardware/cache-coherency.md
examples/a2a3/tensormap_and_ringbuffer/vector_example/kernels/aiv/kernel_add.cpp
examples/a2a3/tensormap_and_ringbuffer/vector_example/kernels/aiv/kernel_add_scalar.cpp
examples/a2a3/tensormap_and_ringbuffer/vector_example/kernels/aiv/kernel_mul.cpp
examples/a2a3/tensormap_and_ringbuffer/vector_example/kernels/orchestration/example_orchestration.cpp
examples/a2a3/tensormap_and_ringbuffer/vector_example/test_vector_example.py
simpler_setup/tools/swimlane_converter.py
src/a2a3/platform/include/common/l2_swimlane_profiling.h
src/a2a3/platform/onboard/aicore/inner_kernel.h
src/a2a3/platform/shared/host/l2_swimlane_collector.cpp
src/a2a3/platform/sim/aicore/inner_kernel.h
src/a2a3/runtime/tensormap_and_ringbuffer/aicore/aicore_executor.cpp
src/a2a3/runtime/tensormap_and_ringbuffer/orchestration/pto_arg_with_deps.h
src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto2_dispatch_payload.h
src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp
src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_runtime2_types.h
src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_types.h
src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/pto_scheduler.h
src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_cold_path.cpp
src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_completion.cpp
src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_context.h
src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_dispatch.cpp
src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_types.h
src/a2a3/runtime/tensormap_and_ringbuffer/runtime/shared/pto_runtime2_init.cpp
tests/st/a2a3/tensormap_and_ringbuffer/spec_mix/kernels/aiv/kernel_add_standalone.cpp
tests/st/a2a3/tensormap_and_ringbuffer/spec_mix/kernels/aiv/kernel_mul_standalone.cpp
tests/st/a2a3/tensormap_and_ringbuffer/spec_mix/kernels/orchestration/spec_mix_orch.cpp
tests/st/a2a3/tensormap_and_ringbuffer/spec_mix/test_spec_mix.py
tests/st/a2a3/tensormap_and_ringbuffer/spec_spmd_consumer/kernels/aiv/kernel_fill_all.cpp
tests/st/a2a3/tensormap_and_ringbuffer/spec_spmd_consumer/kernels/aiv/kernel_spmd_addk.cpp
tests/st/a2a3/tensormap_and_ringbuffer/spec_spmd_consumer/kernels/orchestration/spec_spmd_consumer_orch.cpp
tests/st/a2a3/tensormap_and_ringbuffer/spec_spmd_consumer/test_spec_spmd_consumer.py
tests/st/a2a3/tensormap_and_ringbuffer/spec_spmd_producer/kernels/aiv/kernel_double.cpp
tests/st/a2a3/tensormap_and_ringbuffer/spec_spmd_producer/kernels/aiv/kernel_spmd_fill.cpp
tests/st/a2a3/tensormap_and_ringbuffer/spec_spmd_producer/kernels/orchestration/spec_spmd_producer_orch.cpp
tests/st/a2a3/tensormap_and_ringbuffer/spec_spmd_producer/test_spec_spmd_producer.py

Two CI breakages on the early-dispatch branch: - test_scheduler_state (cpput) failed to link: the inline PTO2SchedulerState::ring_one_doorbell pulls in get_reg_ptr, which the host UT object set (a2a3_rt_objs / a5_rt_objs) does not provide. Add a get_reg_ptr stub to test_stubs.cpp returning writable static storage (no MMIO on host), mirroring the existing get_sys_cnt_aicpu stub. - profiling-flags-smoke (a2a3sim/pto2-off) failed to build: `prestaged` in scheduler_dispatch.cpp is only read under #if PTO2_PROFILING, so PTO2_PROFILING=0 + -Werror=unused-variable killed the build. Mark it [[maybe_unused]]; the try_speculative_prestage() side effect must stay.

stage_consumer_blocks wrote a block's payload via prepare_block_for_dispatch then immediately published its DATA_MAIN_BASE token, with no wmb() in between — unlike the normal dispatch path (flush_publish), which barriers before publishing. Without it a gated AICore can observe the token, dcci its payload, and read a stale not_ready gate / args. Hoist one wmb() per stage_from pass (prepare all claimed blocks, barrier, then publish), mirroring flush_publish. Clears the intermittent ~5-8% bf16 ULP golden failures on qwen3 decode: 0/50 onboard replay rounds failed after the fix (was ~1-2 per 20 before).

…uard test_scheduler_state / test_task_state / test_wiring build minimal PTO2TaskSlotStates via init_slot, which memsets the slot and never sets slot.payload. That was fine until the speculative-dispatch helpers (try_speculative_release, propagate_dispatch_fanin) began dereferencing slot.payload->spec_state — release_fanin_and_check_ready then SEGV'd on these payload-less slots, surfaced once the host UT link was fixed and the tests actually ran (ut macos/ubuntu). Production slots always have a payload (orch::prepare_task -> bind_buffers sets it; reset_for_reuse never clears it), so the fix is test-side: init_slot now binds each slot a distinct zeroed payload from a small per-fixture pool, mirroring bind_buffers. With that invariant restored, the defensive `c->payload == nullptr` skip in propagate_dispatch_fanin's fanout walk is dead code — removed for consistency with the unguarded payload derefs in the same path.

…own EXIT, diag buffer) - scheduler_dispatch.cpp: skip speculative prestage when pmu_active, matching dispatch_ready_tasks' single-issue PMU policy — a gated prestage would perturb the same per-task PMU windows. - scheduler_dispatch.cpp: clear phase_dispatch_count after emitting a Prestage swimlane bar so blocks staged this iteration don't leak into the next Dispatch bar's count. - scheduler_dispatch.cpp: the second completion poll now also bumps sched_->tasks_completed under PTO2_SCHED_PROFILING, matching the first poll so scheduler profiling no longer undercounts. - scheduler_cold_path.cpp: grow per-core stall-status buffers 128->192B so the expanded ANOMALY diagnostic (added token fields) isn't truncated. - aicore_executor.cpp: in the speculative doorbell gate, check AICORE_EXIT_SIGNAL on the doorbell-match iteration too, so a teardown EXIT racing in right after a matching doorbell still wins over executing the gated task. Onboard-verified unchanged: qwen3 decode 20/20 replay rounds pass.

gemini-code-assist Bot reviewed Jun 17, 2026

View reviewed changes

Comment thread src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_completion.cpp

Comment thread src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_completion.cpp

ChaoWao and others added 5 commits June 17, 2026 16:36

ChaoWao and others added 9 commits June 17, 2026 16:50

poursoul force-pushed the feat/early-dispatch branch from f3b928f to 0c44cf0 Compare June 17, 2026 08:53

coderabbitai Bot reviewed Jun 17, 2026

View reviewed changes

poursoul added 4 commits June 17, 2026 17:26

Conversation

poursoul commented Jun 17, 2026

Summary

Testing

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related issues

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Review ran into problems

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coderabbitai Bot commented Jun 17, 2026 •

edited

Loading