feat(a2a3/runtime): speculative early-dispatch (pre-stage + doorbell)#1079
feat(a2a3/runtime): speculative early-dispatch (pre-stage + doorbell)#1079poursoul wants to merge 18 commits into
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a speculative early-dispatch mechanism for the A2A3 runtime scheduler, enabling consumer tasks to be pre-staged on idle cores and gated on a hardware doorbell register until their producers complete. It also adds documentation on cache coherency constraints, updates profiling phases, and introduces new test cases. The review feedback identifies two critical concurrency issues in scheduler_completion.cpp where loading spec_state with std::memory_order_relaxed could result in stale reads, potentially causing deadlocks or increased latency; using std::memory_order_acquire is recommended to ensure correct visibility across threads.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
…ell) Stage a consumer task on an idle AICore before its producer finishes, gate execution with the DATA_MAIN_BASE high-32 doorbell, so build_payload + dispatch-write overlap the producer's run. The dependency engine is untouched: producers run to completion, deps resolve normally; the only additions are a pre-stage hook and a completion-path doorbell. Mechanism: - pto2_dispatch_payload.h: not_ready gate flag (reuses reserved pad). - aicore inner_kernel.h (onboard+sim): read_dmb_high32(). - aicore_executor.cpp: gate spins on read_dmb_high32()==task_id, EXIT-honoring; COND ACK deferred until AFTER the gate so the core stays off-limits. - scheduler_dispatch.cpp: Hook 1 try_speculative_prestage — after normal dispatch, pre-stage a RUNNING flagged producer's consumers onto spare idle cores (eligibility: single-block, AIC/AIV, remaining-producers==1). - pto_scheduler.h: try_speculative_release on the completion path rings the doorbell the instant the last producer's FIN satisfies fanin (skips the ready-queue round-trip). Lock-free spec_state CAS coordinates stager/router. - pto_types.h: Arg::set_allow_early_resolve() codegen hint (a2a3-local Arg, shared task_args.h ABI untouched), exposed via ArgWithDeps. v1 scope: single-block + AIC/AIV (not MIX) + fully-idle cores. MIX, SPMD, and dual-issue are follow-ups. vector_example carries the inflated-kernel test harness (flag currently commented for a clean no-regression baseline). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
A MIX consumer occupies a full cluster (1-3 subtask cores, each with its own dispatch token + reg_addr), so the release must ring one doorbell per gated core. Generalize the single-doorbell staged metadata into a fixed array: - pto_runtime2_types.h: staged_reg_addr/staged_reg_task_id become arrays of PTO2_SPEC_MAX_DOORBELLS (3) plus staged_count; still fits the 40B payload padding (guarded by the existing offsetof(tensors)==576 static_assert). - scheduler_dispatch.cpp (Hook 1): allow shape == MIX; pre-stage onto a fully-idle cluster via get_idle_core_offset_states(MIX); record all n handles. - pto_scheduler.h (try_speculative_release): ring one doorbell per staged core. - scheduler_context.h: drop the now-dead ring_speculative_doorbell helper (release rings inline). Single-block MIX only (set_sync_start fires only for block_num>1, so the pre-stage bypass is safe). New st-sim test spec_mix validates a flagged AIV producer feeding a 2-subtask MIX consumer: speculation fires (staged_count==2) and the result is correct. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ispatch A multi-block (SPMD) flagged producer already works with the existing logic: its single fanin edge into the consumer is released only when ALL blocks complete, so the completion-path doorbell fires the pre-staged consumer exactly once, after the last block. New st-sim test spec_spmd_producer pins this — a block_num=4 producer where block i writes out_p[i]=i+1 feeds a single-block consumer out_c[i]=2*out_p[i]; the partitioned output makes any premature release mismatch the golden. Speculation fires (staged_count==1) and the result is correct across repeated runs. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…spatch Pre-stage a multi-block (SPMD) consumer block-by-block — not all N blocks at once. In a single atomic Hook-1 pass (bracketed by the STAGING->STAGED claim, so it never races the completion release), stage min(idle cores now, remaining blocks, doorbell budget) blocks; each gated block records its doorbell(s). Any blocks that don't fit dispatch normally: - scheduler_dispatch.cpp (Hook 1): drop the single-block restriction; loop over blocks_to_stage claiming one idle core/cluster per block, advancing next_block_idx. subtasks_per_block (popcount of core_mask) bounds how many blocks fit the PTO2_SPEC_MAX_DOORBELLS array. Exclude sync_start tasks (their blocks must launch atomically — partial staging would break that). - pto_scheduler.h (try_speculative_release): after ringing the staged doorbells, return false when next_block_idx < logical_block_num so the caller pushes C to the ready queue; dispatch_shape resumes from next_block_idx for the rest. New st-sim test spec_spmd_consumer: a flagged single-block producer feeds a block_num=4 SPMD consumer. Three blocks fit the doorbell budget (gated + doorbell-released) and the fourth dispatches normally; speculation fires (blocks=3/4) and the result is correct across repeated runs. Single-block, MIX, and SPMD-producer paths all remain a special case of the unified loop. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The spec_spmd_{producer,consumer} kernels had each SPMD block write one
float into a shared 64B cache line and dcci-flush the whole line. On
silicon each core writes back its full line copy (other slots stale 0),
so the last block to flush clobbers the rest -- false sharing,
last-writer-wins. Passed on sim only (sim models dcci as a no-op);
failed on device with max_diff 8.0. Not a speculative-dispatch bug --
reproduces with the early-resolve flag off.
- Isolate each block on its own cache line (stride 16 floats = 64B):
kernel_spmd_fill / kernel_spmd_addk write out[block_idx*16];
kernel_double reads the strided producer output; producer orchestration
intermediate and consumer golden updated to the strided layout.
- kernel_fill_all: cast i+1 via int32 -- ccec rejects float<-uint32 in
aicore functions; the test had only ever run from a stale cached .o.
- Document the limitation: aicore-kernel-programming.md "Each block must
write to its own cache line" and cache-coherency.md "dcci is
whole-cache-line".
All 3 spec tests now pass on both a2a3sim and a2a3 (device).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughImplements speculative early-dispatch (pre-stage + DATA_MAIN_BASE doorbell) in the a2a3 tensormap_and_ringbuffer scheduler: adds ChangesSpeculative Early-Dispatch Feature
Sequence Diagram(s)sequenceDiagram
participant Orch as Orchestrator Thread
participant Sched as PTO2SchedulerState
participant EarlyQ as early_dispatch_queue
participant Doorbell as per-core doorbell table
participant Payload as PTO2DispatchPayload
participant AICore as AICore executor
Orch->>Sched: submit task with allow_early_resolve=true
Sched->>Sched: wire_task (seed dispatch_fanin for finished producers)
Sched->>EarlyQ: push consumer slot
Sched->>Sched: try_speculative_prestage → stage_consumer_blocks
Note over Sched: CAS next_block_idx, set spec_state=STAGING
Sched->>Payload: write not_ready=1, reg_task_id
Sched->>Doorbell: store task_id for core
Sched->>AICore: dispatch payload
AICore->>AICore: observe not_ready=1, spin on read_dmb_high32()
Note over Sched: producer task completes → on_mixed_task_complete
Sched->>Sched: try_speculative_release (CAS STAGING→DISPATCHED)
Sched->>Doorbell: ring_doorbell (write task_id to DATA_MAIN_BASE)
AICore->>AICore: read_dmb_high32() matches task_id, exit spin
AICore->>AICore: execute consumer kernel
AICore->>Sched: write FIN to COND
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes Possibly related issues
Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Warning Review ran into problems🔥 ProblemsStopped waiting for pipeline failures after 30000ms. One of your pipelines takes longer than our 30000ms fetch window to run, so review may not consider pipeline-failure results for inline comments if any failures occurred after the fetch window. Increase the timeout if you want to wait longer or run a Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
…profiling Speculative early-dispatch completion path: - Inline the pre-staged consumer's doorbell during the producer fanout walk (pto_scheduler.h on_mixed_task_complete) instead of batching every doorbell to after the O(fanout-degree) walk. A flagged consumer near the front of the list starts immediately and overlaps the AICPU releasing the rest. Safe: the producer slot is COMPLETED (all SPMD blocks FIN'd + dcci-flushed) before the walk. On qwen3-14b online->out_proj the producer-end -> consumer-start gap drops ~11.8us -> ~8.2us (fanout half ~5us -> ~1.3us). - Completion poll skips STAGED cores (gated; cannot ACK/FIN until their doorbell) and re-polls once after dispatch so a producer block that FINs mid-iteration is observed the same iteration. Scheduler activity profiling (the sched lane no longer has blank gaps): - New SCHED phase kinds Poll / Release / Fanout / Scan (l2_swimlane_profiling.h + host dumper + converter render/colors). - scheduler_dispatch.cpp: per-iteration activity-fill classifies each otherwise-blank iteration as Poll (scan that retired nothing) and emits the deferred-release drain as its own Release bar. - scheduler_completion.cpp: a Scan bar per check_running_cores pass (cores' MMIO COND-read cost, ~95ns/LDR) and a Fanout child nested in Complete (fanin + doorbell walk); phase_subretire_count surfaces SPMD sub-block retires. - aicore_executor.cpp: sample the op-event end AFTER the FIN write so the bar end marks FIN-write (when the AICPU can first observe completion), separating AICore-side from AICPU-side latency. Device-validated by spec_mix / spec_spmd_producer / spec_spmd_consumer. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The pre-staged-consumer doorbell storage capped at PTO2_SPEC_MAX_DOORBELLS=3 per task (sized to fit a MIX block's 1-3 subtask cores in the payload). That cap also bounded how many blocks a wide SPMD consumer could pre-stage in one pass, leaving idle cores unused. - Replace payload's staged_count + staged_reg_task_id[3] + staged_reg_addr[3] with a staged_core_mask bitmask (PTO2_SPEC_CORE_MASK_WORDS=2 words = 128 bits >= RUNTIME_MAX_WORKER=72), 1 bit per global core_id. - Move the (reg_addr, token) pair into a scheduler-global spec_doorbell_table indexed by core_id. Gated cores in flight are bounded by the chip core count (no two-level pre-dispatch), so the table is the natural capacity and the per-task 3-cap is gone. - try_speculative_release iterates set bits in the mask, ringing each core's doorbell from the table. Drop the SpecDoorbellBatch struct and the now-unused db parameter threaded through release_fanin_and_check_ready. - Add static_assert(RUNTIME_MAX_WORKER <= PTO2_SPEC_CORE_MASK_WORDS * 64). offsetof(tensors)==576 still holds (bitmask is smaller than the old arrays). 3 spec st-sim tests (mix, spmd_consumer, spmd_producer) pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ores Speculative pre-stage previously only used IDLE cores in the producer's owner thread. For a producer like fa_fused (qwen attention) that holds every core until it finishes — its 72 blocks all complete in the last ~20us, no core goes idle mid-run — that staged nothing, so the consumer (online_softmax, 48 SPMD-AIV blocks) gained zero overlap. Two problems had to be solved. 1. Stage onto the PENDING slot of cores RUNNING the producer (the only cores available when no core is idle). This hit a deadlock (507018, SCHEDULER_TIMEOUT): decide_slot_transition Case 3.1 ignores a running-task FIN whenever a pending task exists, assuming the AICore immediately runs the pending one and the scheduler observes ITS ack. A speculatively-gated pending task spins on its doorbell and never acks — and the producer's completion depends on collecting that very running FIN — so it wedged forever. Fix: detect a gated (STAGED) pending in the completion poll and complete the running FIN + promote the gated task (Case 3.3) instead of waiting. The promoted task keeps pending_occupied set so no second gated block stacks and the per-core doorbell entry is never overwritten. 2. A consumer has ONE spec_state but its SPMD blocks span cores owned by several AICPU threads. Only the first thread to CAS NONE->STAGING staged its share (16/48); the others saw spec_state != NONE and skipped. Allow a thread to re-claim a STAGED consumer (CAS STAGED->STAGING), appending its own cores to the shared mask and advancing the shared next_block_idx. The CAS serializes the staging critical section across threads and races cleanly with release (re-claim fails once release sets DISPATCHED). Device-verified on qwen fa_fused (task-submit --device 1): all 48 blocks now pre-stage across 3 threads (was 16); online starts +2.4us after fa_end (was +6.0), online END -3.9us, downstream out_proj END -7.4us. 3 spec st-onboard + st-sim tests pass; fa repro 5/5 deadlock-free. A V9 [SPEC] diagnostic logs each thread's staging contribution. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Reworks Hook 1 (early-dispatch pre-staging) so a flagged producer's consumer is staged fully and concurrently across all AICPU threads, and fixes the completion-path interaction that made pending-slot staging deadlock. - Concurrent staging: spec_state simplifies to NONE -> STAGING (stable gated state) -> DISPATCHED. next_block_idx and staged_core_mask become atomic so every thread claims a distinct block range and OR-s its cores into the mask in parallel (no serialized staircase). Release flips STAGING->DISPATCHED and rings the mask; a block staged after that self-rings its own doorbell (seq_cst order between mask-OR/spec-load here and spec-store/mask-read in release guarantees no doorbell is missed). - Gated-pending (Case 3.3): decide_slot_transition previously ignored a running-task FIN whenever a pending task existed (Case 3.1), assuming the AICore immediately runs pending and the scheduler observes its ack. A gated pending never acks until its doorbell, and the producer's completion depends on that very running FIN -> deadlock (507018). Now a gated (STAGING) pending is detected and the running FIN is completed + the gated task promoted. - Cross-thread extend list: a thread-local producer (e.g. rope_qkv, all blocks on one thread) only lets that thread reach its consumer's fanout. The first claimer publishes the consumer to a shared list; every thread's prestage pass drains it and stages the consumer's remaining blocks onto ITS OWN cores. - Swimlane: add a Prestage phase kind so a staging-dominated loop iteration shows as prestage(N), not mislabeled poll(0) (kind + record + collector + converter wired end-to-end). Device-verified (task-submit --device 1): 3 spec st-onboard tests pass; qwen fa_fused fully pre-dispatches its online consumer (online end -3.8us, out_proj -5.9us); deadlock-free. sim spec tests pass.
Replace the O(n) broadcast extend-list with an inline Vyukov MPMC queue
and restructure staging to mirror the normal SPMD dispatch path, so a
flagged producer's consumer is pre-staged across all idle threads in
parallel instead of a serial winner-then-peer daisy chain.
- pto_scheduler.h: inline bounded MPMC spec queue (push/pop/init),
replacing spec_extend_list[] + publish/remove.
- scheduler_dispatch.cpp:
- pass 1 (discovery) CAS-claims NONE->STAGING and pushes the consumer
ONCE; it no longer stages (no front-loading).
- pass 2 (drain) pops, CAS-claims a range sized to this thread's free
cores, RE-PUSHES before the expensive prepare+publish so peers claim
and stage concurrently. CAS (not store) keeps next_block_idx atomic
against normal dispatch, which also claims it when release routes a
partially-staged consumer to the ready queue.
- stage_consumer_blocks now stages an explicit already-claimed range.
- remove the per-stage [SPEC] debug log from the hot path: its device
-log write (~9us, contended across threads) dominated the prestage
bar and was the real source of the short-producer regression.
- gate the prestage trigger on no-ready-work (idle only).
- pto_runtime2_init.cpp: init the spec queue.
Replace the per-iteration PULL scan in try_speculative_prestage with an event-driven PUSH dual of fanin_refcount/ready: - Add dispatch_fanin/dispatch_propagated/spec_chain_active/spec_chain_depth to PTO2TaskPayload (reset in reset_for_reuse). - propagate_dispatch_fanin(): when a FLAGGED producer dispatches (normal or doorbell-released), bump each consumer's dispatch_fanin once (guarded). A consumer reaching fanin_actual_count is an early-dispatch candidate: CAS NONE->STAGING and push to spec_queue. Auto-chains by inheriting the flag (depth-capped). - Seed dispatch_fanin at wire_task with early_finished (producers already complete when the consumer was wired). Such producers never dispatch at runtime, so without the seed the candidate compare is unreachable whenever any producer is pre-completed (e.g. a buffer-creator). Mirrors the early_finished seed ready_fanin already gets. - try_speculative_prestage now only DRAINS the queue; the O(running cores x fanout) pass-1 scan is gone. Verified on a2a3: qwen3-14b decode online_softmax now pre-stages (df 2==2, 48/96 blocks staged across 3 scheduler threads); was 0 prestage before the seed. Spec st (mix/spmd-producer/spmd-consumer) pass on a2a3sim.
… paths propagate_dispatch_fanin was being called inline at two latency-critical points, where its fanout walk delayed the very thing early-dispatch exists to make fast: - dispatch_shape: called between claiming cores and publishing a task's blocks. For a flagged SPMD producer, the thread that won the once-guard ran the walk before publishing its block range, while peer threads co-dispatching the same SPMD task published immediately — misaligning the task's block starts. Now recorded in a per-pop prop_list and replayed after flush_publish. - on_mixed_task_complete fanout walk: try_speculative_release rang a released consumer's doorbell and then propagated inline, delaying the doorbells of later consumers in the same producer's fanout. Now collected in a SpecReleaseSink and replayed after the walk; every doorbell fires inline first, propagation runs last. Propagation is pure bookkeeping for future candidates, so deferring it by one walk is semantically free. Measured on qwen3-14b decode: qk_gamma/ qk_recip/rope handoffs dropped from ~7-8us back to ~2us at full chain depth. Spec st (mix/spmd-producer/spmd-consumer) pass on a2a3sim.
The cross-thread early-dispatch work queue was a hand-rolled inline Vyukov MPMC (spec_queue_) that duplicated PTO2ReadyQueue's algorithm, element type, and push/pop semantics verbatim. Replace it with a PTO2ReadyQueue instance, early_dispatch_queue, allocated/wired via the same arena helpers as ready_queues and dummy_ready_queue. - Drop SpecQueueSlot, spec_queue_, spec_queue_init/push/pop (~60 lines) - Add PTO2_EARLY_DISPATCH_QUEUE_SIZE (64); reserve/init/wire/destroy the new queue in PTO2SchedulerState, plus off_early_dispatch_queue_slots - Rename spec_queue_* call sites and PTO2_SPEC_QUEUE_DRAIN_MAX Behavior-preserving (same MPMC algorithm): 20 onboard decode rounds show unchanged median AICore runtime and golden parity (the lone golden miss is the pre-existing early-dispatch ULP race, not this change).
The early-dispatch spec fields (allow_early_resolve, spec_state, staged_core_mask, dispatch_fanin, dispatch_propagated, spec_chain_active, spec_chain_depth) were reset across two sites: prepare_task wrote two of them, and reset_for_reuse reached into the payload for the other five — a cold cross-structure write on the scheduler's advance-ring path, which also violated reset_for_reuse's documented "skips payload" contract. Consolidate all of them into PTO2TaskPayload::init, the single payload init point (runs every submit, before the slot is visible to the scheduler). prepare_task now does no payload writes; reset_for_reuse no longer touches the payload. The fields must be (re)initialized for every task regardless of its own allow_early_resolve, because spec_state (try_speculative_release CASes it for every task), spec_chain_active (read by every dispatched task's propagate) and dispatch_fanin (seeded via fetch_add at wiring) are touched on per-task paths driven by other tasks' flags — so no per-task gating. Also pack the spec block by descending alignment (drops 6B internal padding; sizeof and tensors[]@576 unchanged) and turn prefetch_payload into a PTO2TaskPayload::prefetch method that also warms the spec cache line. Behavior-preserving: 20 onboard rounds show unchanged median AICore runtime and golden parity (the golden misses are the pre-existing early-dispatch ULP race, not this change).
f3b928f to
0c44cf0
Compare
There was a problem hiding this comment.
Actionable comments posted: 8
🧹 Nitpick comments (4)
tests/st/a2a3/tensormap_and_ringbuffer/spec_spmd_producer/test_spec_spmd_producer.py (1)
18-22: ⚡ Quick winUse
ClassVarfor mutable test metadata fields (CALLABLE,CASES).Line 31 and Line 55 use mutable class attributes; annotating them as class vars prevents accidental instance-level mutation patterns and resolves the RUF012 warning.
Proposed fix
+from typing import ClassVar + import torch from simpler.task_interface import ArgDirection as D from simpler_setup import SceneTestCase, TaskArgsBuilder, Tensor, scene_test @@ - CALLABLE = { + CALLABLE: ClassVar = { @@ - CASES = [ + CASES: ClassVar = [Also applies to: 31-62
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/st/a2a3/tensormap_and_ringbuffer/spec_spmd_producer/test_spec_spmd_producer.py` around lines 18 - 22, Add `ClassVar` to the imports from the `typing` module at the top of the file. Then locate the `CALLABLE` class attribute (around line 31) and the `CASES` class attribute (around line 55) in the test class, and annotate both with `ClassVar` type hints to properly indicate they are class-level mutable attributes rather than instance attributes, which will resolve the RUF012 warning.Source: Linters/SAST tools
tests/st/a2a3/tensormap_and_ringbuffer/spec_spmd_consumer/test_spec_spmd_consumer.py (1)
19-23: ⚡ Quick winAnnotate mutable class configs as
ClassVarto avoid shared-state surprises.Line 36 and Line 60 define mutable class attributes (
dict/list). If any test utility mutates them, state can leak between runs; Ruff already flags this (RUF012).Proposed fix
+from typing import ClassVar + import torch from simpler.task_interface import ArgDirection as D from simpler_setup import SceneTestCase, TaskArgsBuilder, Tensor, scene_test @@ - CALLABLE = { + CALLABLE: ClassVar = { @@ - CASES = [ + CASES: ClassVar = [Also applies to: 36-67
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/st/a2a3/tensormap_and_ringbuffer/spec_spmd_consumer/test_spec_spmd_consumer.py` around lines 19 - 23, Import ClassVar from the typing module at the top of the file, then locate all mutable class attributes (dicts and lists) defined at the class level around lines 36 and 60 in the class definitions. Annotate each of these mutable class attributes with the ClassVar type annotation (e.g., change a class attribute like my_dict = {} to my_dict: ClassVar[dict] = {}) to prevent Ruff's RUF012 warning about mutable class variables that can cause shared state between test runs.Source: Linters/SAST tools
tests/st/a2a3/tensormap_and_ringbuffer/spec_spmd_consumer/kernels/orchestration/spec_spmd_consumer_orch.cpp (1)
18-22: ⚡ Quick winUpdate the stale per-task doorbell-cap assumption in test docs.
This comment still asserts a fixed
PTO2_SPEC_MAX_DOORBELLS == 3behavior, but the scheduler model in this PR stack moved to per-core bitmask/table doorbell tracking. Keeping the old claim makes the intended coverage path misleading.Suggested comment update
- * t1 is a multi-block consumer. Block-by-block pre-staging stages as many of its - * blocks as fit on idle cores and in the doorbell budget (PTO2_SPEC_MAX_DOORBELLS - * == 3); with B == 4, three blocks are gated + released by doorbell and the - * remaining block dispatches normally off the ready queue. Golden: out_c[i] = - * i + 11. + * t1 is a multi-block consumer. Block-by-block pre-staging stages as many blocks + * as possible on currently idle cores, and gated blocks are released by doorbell + * once t0 completes. Golden: out_c[i] = i + 11.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/st/a2a3/tensormap_and_ringbuffer/spec_spmd_consumer/kernels/orchestration/spec_spmd_consumer_orch.cpp` around lines 18 - 22, Update the multi-block consumer documentation comment (lines 18-22) in the spec_spmd_consumer_orch.cpp file. Replace the stale assertion about fixed PTO2_SPEC_MAX_DOORBELLS == 3 doorbell cap with an accurate description of the current scheduler behavior using per-core bitmask/table doorbell tracking instead of per-task tracking. Ensure the updated comment still clearly explains the intended coverage path for block staging and dispatch behavior.tests/st/a2a3/tensormap_and_ringbuffer/spec_spmd_producer/kernels/orchestration/spec_spmd_producer_orch.cpp (1)
48-52: ⚡ Quick winRemove the hardcoded stride literal from the producer/consumer memory-layout contract.
16here is coupled toCLin bothkernel_spmd_fill.cppandkernel_double.cpp. If one side changes independently, the orchestration tensor shape can become inconsistent with kernel indexing.Localized improvement
- uint32_t inter_shapes[1] = {static_cast<uint32_t>(BLOCK_NUM) * 16}; + constexpr uint32_t kCacheLineFloats = 64 / sizeof(float); + uint32_t inter_shapes[1] = {static_cast<uint32_t>(BLOCK_NUM) * kCacheLineFloats};(Preferably, move this constant into a shared test header used by all three files.)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/st/a2a3/tensormap_and_ringbuffer/spec_spmd_producer/kernels/orchestration/spec_spmd_producer_orch.cpp` around lines 48 - 52, The hardcoded value `16` in the inter_shapes array calculation in spec_spmd_producer_orch.cpp is tightly coupled to matching values in kernel_spmd_fill.cpp and kernel_double.cpp, creating a risk of memory-layout inconsistency if one file is modified independently. Extract this cache line stride constant into a shared test header file that is included by all three files (spec_spmd_producer_orch.cpp, kernel_spmd_fill.cpp, and kernel_double.cpp), then replace the hardcoded `16` literal with a reference to this shared constant definition. This ensures all three files reference the same value and prevents accidental divergence.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/a2a3/runtime/tensormap_and_ringbuffer/aicore/aicore_executor.cpp`:
- Around line 153-166: The exit signal check inside the while loop that waits
for read_dmb_high32() to match task_id can be bypassed if the shutdown signal is
written after the doorbell condition is already satisfied. This causes the code
to skip the loop entirely and execute the gated task before shutting down. Add
an additional check for the AICORE_EXIT_SIGNAL (by reading RegId::DATA_MAIN_BASE
and comparing to AICORE_EXIT_SIGNAL) immediately after the while loop exits but
before the subsequent if (exiting) block to ensure the exit signal is properly
detected even when it arrives after the doorbell match, preventing execution of
the gated task during shutdown.
In `@src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/pto_scheduler.h`:
- Line 978: The `early_dispatch_queue.push(c)` call at the specified location
can fail by returning false, but the code does not handle this failure case.
When the push fails, the `spec_state` remains incorrectly set to
`PTO2_SPEC_STAGING` for a consumer that was never actually added to the queue,
causing incorrect behavior in the ready release path. Check the return value of
`early_dispatch_queue.push(c)`, and if it returns false, restore the speculative
claim by resetting `spec_state` to its previous value before the attempted push
operation.
In
`@src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_cold_path.cpp`:
- Around line 197-201: The expanded anomaly diagnostic line now includes
additional token fields (cond_tok, running_tok, pending_tok) that can cause the
output to exceed the current buffer capacity. Locate the buffer declarations for
`aic_buf` and `aiv*_buf` (currently sized at 128 bytes) and increase their sizes
to accommodate the longer formatted string that now includes the three new token
fields along with the existing parameters. Apply the same buffer size increase
to both the location around line 197 and the related location around line 333
mentioned in the comment.
In
`@src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_dispatch.cpp`:
- Line 1102: The try_speculative_prestage function call in the dispatch logic
currently bypasses the PMU policy by prestaging work even when PMU monitoring is
active. Modify the condition that guards the try_speculative_prestage call to
also check for pmu_active status, ensuring that speculative prestaging is
disabled whenever PMU monitoring is active (treat it the same way as when
any_ready_work is true). This will prevent phase 4b from prestaging gated work
into idle or pending slots during PMU runs, maintaining the single-issue PMU
policy.
- Around line 332-333: The prop_list array is declared with a fixed size of
CoreTracker::MAX_CLUSTERS, but the batch buffer can contain up to
CoreTracker::MAX_CLUSTERS * 3 entries, causing extra flagged producers to
silently skip dispatch_fanin propagation. Resize the prop_list array declaration
to accommodate the full batch capacity (CoreTracker::MAX_CLUSTERS * 3) instead
of CoreTracker::MAX_CLUSTERS to ensure all entries in the popped batch are
properly processed and propagated.
- Around line 631-633: The publish_subtask_to_core function writes to a register
that signals AICore, but there is no write memory barrier before this call to
ensure the payload and gate data written by prepare_block_for_dispatch are
visible before the register write. Insert a wmb() call (write memory barrier) in
the loop before each publish_subtask_to_core call to guarantee that AICore
observes the payload and not_ready gate updates before seeing the token.
- Around line 1107-1113: After emitting the Prestage bar by calling
l2_swimlane_aicpu_record_sched_phase in the block where prestage_record &&
prestaged > 0, the phase_dispatch_count needs to be cleared by resetting it to
zero. This prevents the dispatch count that was incremented during
prepare_block_for_dispatch from leaking into the next Dispatch phase iteration.
- Around line 1129-1131: In the second completion poll section where
completed_2nd > 0, the code updates completed_tasks_ but fails to increment
sched_->tasks_completed like the first poll does under PTO2_SCHED_PROFILING. Add
the same profiling instrumentation to the second poll by incrementing
sched_->tasks_completed with completed_2nd within the same PTO2_SCHED_PROFILING
guard condition used in the first poll, ensuring scheduler profiling totals
remain consistent across both polls.
---
Nitpick comments:
In
`@tests/st/a2a3/tensormap_and_ringbuffer/spec_spmd_consumer/kernels/orchestration/spec_spmd_consumer_orch.cpp`:
- Around line 18-22: Update the multi-block consumer documentation comment
(lines 18-22) in the spec_spmd_consumer_orch.cpp file. Replace the stale
assertion about fixed PTO2_SPEC_MAX_DOORBELLS == 3 doorbell cap with an accurate
description of the current scheduler behavior using per-core bitmask/table
doorbell tracking instead of per-task tracking. Ensure the updated comment still
clearly explains the intended coverage path for block staging and dispatch
behavior.
In
`@tests/st/a2a3/tensormap_and_ringbuffer/spec_spmd_consumer/test_spec_spmd_consumer.py`:
- Around line 19-23: Import ClassVar from the typing module at the top of the
file, then locate all mutable class attributes (dicts and lists) defined at the
class level around lines 36 and 60 in the class definitions. Annotate each of
these mutable class attributes with the ClassVar type annotation (e.g., change a
class attribute like my_dict = {} to my_dict: ClassVar[dict] = {}) to prevent
Ruff's RUF012 warning about mutable class variables that can cause shared state
between test runs.
In
`@tests/st/a2a3/tensormap_and_ringbuffer/spec_spmd_producer/kernels/orchestration/spec_spmd_producer_orch.cpp`:
- Around line 48-52: The hardcoded value `16` in the inter_shapes array
calculation in spec_spmd_producer_orch.cpp is tightly coupled to matching values
in kernel_spmd_fill.cpp and kernel_double.cpp, creating a risk of memory-layout
inconsistency if one file is modified independently. Extract this cache line
stride constant into a shared test header file that is included by all three
files (spec_spmd_producer_orch.cpp, kernel_spmd_fill.cpp, and
kernel_double.cpp), then replace the hardcoded `16` literal with a reference to
this shared constant definition. This ensures all three files reference the same
value and prevents accidental divergence.
In
`@tests/st/a2a3/tensormap_and_ringbuffer/spec_spmd_producer/test_spec_spmd_producer.py`:
- Around line 18-22: Add `ClassVar` to the imports from the `typing` module at
the top of the file. Then locate the `CALLABLE` class attribute (around line 31)
and the `CASES` class attribute (around line 55) in the test class, and annotate
both with `ClassVar` type hints to properly indicate they are class-level
mutable attributes rather than instance attributes, which will resolve the
RUF012 warning.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 8c2d7f4c-5c11-412f-a4fd-40eb98a80f80
📒 Files selected for processing (37)
docs/aicore-kernel-programming.mddocs/hardware/cache-coherency.mdexamples/a2a3/tensormap_and_ringbuffer/vector_example/kernels/aiv/kernel_add.cppexamples/a2a3/tensormap_and_ringbuffer/vector_example/kernels/aiv/kernel_add_scalar.cppexamples/a2a3/tensormap_and_ringbuffer/vector_example/kernels/aiv/kernel_mul.cppexamples/a2a3/tensormap_and_ringbuffer/vector_example/kernels/orchestration/example_orchestration.cppexamples/a2a3/tensormap_and_ringbuffer/vector_example/test_vector_example.pysimpler_setup/tools/swimlane_converter.pysrc/a2a3/platform/include/common/l2_swimlane_profiling.hsrc/a2a3/platform/onboard/aicore/inner_kernel.hsrc/a2a3/platform/shared/host/l2_swimlane_collector.cppsrc/a2a3/platform/sim/aicore/inner_kernel.hsrc/a2a3/runtime/tensormap_and_ringbuffer/aicore/aicore_executor.cppsrc/a2a3/runtime/tensormap_and_ringbuffer/orchestration/pto_arg_with_deps.hsrc/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto2_dispatch_payload.hsrc/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cppsrc/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_runtime2_types.hsrc/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_types.hsrc/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/pto_scheduler.hsrc/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_cold_path.cppsrc/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_completion.cppsrc/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_context.hsrc/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_dispatch.cppsrc/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_types.hsrc/a2a3/runtime/tensormap_and_ringbuffer/runtime/shared/pto_runtime2_init.cpptests/st/a2a3/tensormap_and_ringbuffer/spec_mix/kernels/aiv/kernel_add_standalone.cpptests/st/a2a3/tensormap_and_ringbuffer/spec_mix/kernels/aiv/kernel_mul_standalone.cpptests/st/a2a3/tensormap_and_ringbuffer/spec_mix/kernels/orchestration/spec_mix_orch.cpptests/st/a2a3/tensormap_and_ringbuffer/spec_mix/test_spec_mix.pytests/st/a2a3/tensormap_and_ringbuffer/spec_spmd_consumer/kernels/aiv/kernel_fill_all.cpptests/st/a2a3/tensormap_and_ringbuffer/spec_spmd_consumer/kernels/aiv/kernel_spmd_addk.cpptests/st/a2a3/tensormap_and_ringbuffer/spec_spmd_consumer/kernels/orchestration/spec_spmd_consumer_orch.cpptests/st/a2a3/tensormap_and_ringbuffer/spec_spmd_consumer/test_spec_spmd_consumer.pytests/st/a2a3/tensormap_and_ringbuffer/spec_spmd_producer/kernels/aiv/kernel_double.cpptests/st/a2a3/tensormap_and_ringbuffer/spec_spmd_producer/kernels/aiv/kernel_spmd_fill.cpptests/st/a2a3/tensormap_and_ringbuffer/spec_spmd_producer/kernels/orchestration/spec_spmd_producer_orch.cpptests/st/a2a3/tensormap_and_ringbuffer/spec_spmd_producer/test_spec_spmd_producer.py
Two CI breakages on the early-dispatch branch: - test_scheduler_state (cpput) failed to link: the inline PTO2SchedulerState::ring_one_doorbell pulls in get_reg_ptr, which the host UT object set (a2a3_rt_objs / a5_rt_objs) does not provide. Add a get_reg_ptr stub to test_stubs.cpp returning writable static storage (no MMIO on host), mirroring the existing get_sys_cnt_aicpu stub. - profiling-flags-smoke (a2a3sim/pto2-off) failed to build: `prestaged` in scheduler_dispatch.cpp is only read under #if PTO2_PROFILING, so PTO2_PROFILING=0 + -Werror=unused-variable killed the build. Mark it [[maybe_unused]]; the try_speculative_prestage() side effect must stay.
stage_consumer_blocks wrote a block's payload via prepare_block_for_dispatch then immediately published its DATA_MAIN_BASE token, with no wmb() in between — unlike the normal dispatch path (flush_publish), which barriers before publishing. Without it a gated AICore can observe the token, dcci its payload, and read a stale not_ready gate / args. Hoist one wmb() per stage_from pass (prepare all claimed blocks, barrier, then publish), mirroring flush_publish. Clears the intermittent ~5-8% bf16 ULP golden failures on qwen3 decode: 0/50 onboard replay rounds failed after the fix (was ~1-2 per 20 before).
…uard test_scheduler_state / test_task_state / test_wiring build minimal PTO2TaskSlotStates via init_slot, which memsets the slot and never sets slot.payload. That was fine until the speculative-dispatch helpers (try_speculative_release, propagate_dispatch_fanin) began dereferencing slot.payload->spec_state — release_fanin_and_check_ready then SEGV'd on these payload-less slots, surfaced once the host UT link was fixed and the tests actually ran (ut macos/ubuntu). Production slots always have a payload (orch::prepare_task -> bind_buffers sets it; reset_for_reuse never clears it), so the fix is test-side: init_slot now binds each slot a distinct zeroed payload from a small per-fixture pool, mirroring bind_buffers. With that invariant restored, the defensive `c->payload == nullptr` skip in propagate_dispatch_fanin's fanout walk is dead code — removed for consistency with the unguarded payload derefs in the same path.
…own EXIT, diag buffer) - scheduler_dispatch.cpp: skip speculative prestage when pmu_active, matching dispatch_ready_tasks' single-issue PMU policy — a gated prestage would perturb the same per-task PMU windows. - scheduler_dispatch.cpp: clear phase_dispatch_count after emitting a Prestage swimlane bar so blocks staged this iteration don't leak into the next Dispatch bar's count. - scheduler_dispatch.cpp: the second completion poll now also bumps sched_->tasks_completed under PTO2_SCHED_PROFILING, matching the first poll so scheduler profiling no longer undercounts. - scheduler_cold_path.cpp: grow per-core stall-status buffers 128->192B so the expanded ANOMALY diagnostic (added token fields) isn't truncated. - aicore_executor.cpp: in the speculative doorbell gate, check AICORE_EXIT_SIGNAL on the doorbell-match iteration too, so a teardown EXIT racing in right after a matching doorbell still wins over executing the gated task. Onboard-verified unchanged: qwen3 decode 20/20 replay rounds pass.
Summary
Speculative early-dispatch hides the ~7μs scheduler dispatch-loop latency by
pre-staging a ready consumer onto idle AICore(s) while its producer still runs,
gating execution on a per-core DMB doorbell, and ringing it the instant the
producer completes (skipping the ready-queue round trip).
not_readygate + DMB doorbell; release at producercompletion (
try_speculative_release).(
spec_mix,spec_spmd_producer,spec_spmd_consumer).per-task 3-doorbell cap);
early_dispatch_queuebacked byPTO2ReadyQueue;event-driven candidate detection (
dispatch_fanin) replacing the PULL scan.deadlock fix; deferred
dispatch_faninpropagation off the hot paths.PTO2TaskPayload::init(single init point;reset_for_reuseno longertouches the payload).
Testing
vs
main(median ~1027us).main; clean rebuild + NORMAL codegen verified e2e.