Fix: make a5 L2 swimlane work onboard (host-shadow alloc + payload buffer channel) by indigo1973 · Pull Request #1058 · hw-native-sys/simpler

indigo1973 · 2026-06-15T12:22:06Z

Two separate bugs kept --enable-l2-swimlane from running on a5 silicon; the
test passes without the flag. Both are a5-only.

1) Host SIGSEGV in collector init. a5 profiling memory is pure device HBM
   (no halHostRegister), so the host cannot write it directly — but
   L2SwimlaneCollector::initialize fell back to host_ptr = dev_ptr and wrote
   the header + SPSC pool slots straight to that raw device pointer. The
   sibling a5 collectors (dep_gen, pmu) already use a host-shadow model; this
   one did not. Route allocation through ProfilerBase::alloc_paired_buffer,
   install the real copy callbacks in set_memory_context, push the
   host-initialized region to device once at end of init, refresh from device
   on the reconcile/read_phase paths, and add clear_mappings() in finalize to
   free the host shadows. Also bring the collector in line with its siblings'
   init hardening: zero the host shadow right after allocation (don't assume
   the allocator returns zeroed memory / clean padding), add an
   InitRollbackGuard so a failure on any post-allocation init path releases
   the buffers + shadows already allocated (shm_host_ is set only at the end,
   so finalize() cannot clean up a partial init), and validate the alloc/free
   callbacks up front.

2) AICore hang -> scheduler timeout -> op-timeout 507000. The AICore record
   path read the shared L2SwimlaneActiveHead line (current_buf_ptr) every task
   to learn its rotating record buffer. On a5, an AICore read of a GM line the
   AICPU concurrently writes stalls the AICore pipeline so it never signals
   FIN; the scheduler then spins to its idle timeout and the AICPU exec op
   exceeds the 3s CANN op-execute timeout (a2a3 tolerates the same code).
   Deliver the current buffer pointer through the per-task dispatch payload
   instead (reusing the former ABI pad; struct stays 512B): the scheduler
   stamps PTO2DispatchPayload::l2_swimlane_cur_buf_ptr right after the AICPU
   rotation hook, and AICore reads it from the payload it already dcci's each
   task — no cross-core shared-head read. AICPU rotation + host recycling are
   unchanged, so unbounded per-core records are preserved. host_build_graph
   keeps its head model, adapted to the buf-ptr record_task signature.

a5-only; a2a3 untouched.

coderabbitai · 2026-06-15T12:22:20Z

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: e63a5510-0430-4a72-9c3b-01ad06d7919d

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

📝 Walkthrough

Walkthrough

The PR replaces the AICore L2 swimlane buffer-rotation mechanism: instead of each AICore core polling a shared AICPU-written L2SwimlaneActiveHead channel pointer, the scheduler reads the post-rotation buffer pointer via a new AICPU getter and stamps it directly into the per-task PTO2DispatchPayload.l2_swimlane_cur_buf_ptr field before dispatching. The host collector simultaneously migrates all buffer allocation from a device-only path to a paired device+host shadow workflow with explicit copy synchronization.

Changes

L2 Swimlane: payload-pointer dispatch and host collector paired-buffer migration

Layer / File(s)	Summary
AICore recording contract `src/a5/platform/include/aicore/l2_swimlane_collector_aicore.h`	Removes `cached_buf_seq` from `L2SwimlaneAicoreLocalState`; rewrites `l2_swimlane_aicore_record_task` to accept `uint64_t cur_buf_ptr`, early-return on 0, detect rotation by pointer comparison, and write records into the locally resolved buffer.
AICPU getter and `PTO2DispatchPayload` field `src/a5/platform/include/aicpu/l2_swimlane_collector_aicpu.h`, `src/a5/platform/shared/aicpu/l2_swimlane_collector_aicpu.cpp`, `src/a5/runtime/tensormap_and_ringbuffer/runtime/pto2_dispatch_payload.h`	Declares and implements `l2_swimlane_aicpu_current_aicore_buf(core_id)` returning `head.current_buf_ptr`; adds `uint64_t l2_swimlane_cur_buf_ptr` to `PTO2DispatchPayload` replacing the prior 8-byte reserved padding, preserving the 512B size assert.
Scheduler stamps `cur_buf_ptr` into payload `src/a5/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_dispatch.cpp`	After the AICore rotation hook in `dispatch_subtask_to_core`, writes `payload.l2_swimlane_cur_buf_ptr` before the memory barrier and `write_reg` that publishes `reg_task_id`.
AICore executor call sites `src/a5/runtime/tensormap_and_ringbuffer/aicore/aicore_executor.cpp`, `src/a5/runtime/host_build_graph/aicore/aicore_executor.cpp`	Drops lazy `L2SwimlaneActiveHead` resolution; local state initialized to `{nullptr, 0}`; recording call passes `exec_payload->l2_swimlane_cur_buf_ptr` (or a `dcci`-read `current_buf_ptr` for host_build_graph path).
Host collector: paired-buffer allocation and device↔host sync `src/a5/platform/include/host/l2_swimlane_collector.h`, `src/a5/platform/shared/host/l2_swimlane_collector.cpp`	Removes `alloc_single_buffer`; all SHM/pool buffers allocated via `alloc_paired_buffer`; `initialize()` pushes primed state to device; `reconcile_counters()` and `read_phase_header_metadata()` pull from device before reading; `finalize()` calls `manager_.clear_mappings()` to free host shadows.

Sequence Diagram(s)

sequenceDiagram
  participant Sched as SchedulerContext
  participant AICPU as l2_swimlane_aicpu
  participant Payload as PTO2DispatchPayload
  participant AICore as AICore Executor
  participant RecordFn as l2_swimlane_aicore_record_task

  Sched->>AICPU: l2_swimlane_aicpu_on_aicore_dispatch(core_id) [rotate buffer]
  Sched->>AICPU: l2_swimlane_aicpu_current_aicore_buf(core_id)
  AICPU-->>Sched: head.current_buf_ptr
  Sched->>Payload: payload.l2_swimlane_cur_buf_ptr = current_buf_ptr
  Sched->>AICore: write_reg(reg_task_id) [publish task]
  AICore->>Payload: read exec_payload->l2_swimlane_cur_buf_ptr
  AICore->>RecordFn: l2_swimlane_aicore_record_task(cur_buf_ptr, local_state, ...)
  RecordFn->>RecordFn: compare cur_buf_ptr vs cached_buf → reset slot if rotated
  RecordFn->>RecordFn: write record, cache flush, DSB

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

hw-native-sys/simpler#939: Introduced the L2SwimlaneActiveHead cache-line model and modified L2SwimlaneAicoreLocalState / l2_swimlane_aicore_record_task — the same contract this PR further replaces with a payload-pointer approach.
hw-native-sys/simpler#942: Refactored the AICore L2 swimlane recording path to use L2SwimlaneActiveHead with cached_buf_seq, which is the exact state structure this PR removes in favor of cur_buf_ptr.
hw-native-sys/simpler#983: Touches the same l2_swimlane_collector_aicore.h recording/rotation path and dispatch→executor plumbing that this PR modifies to consume a payload-delivered buffer pointer.

Poem

🐇 No more polling the shared head in vain,
The buffer pointer rides the payload lane!
Scheduler stamps it, AICore reads true,
No FIN-handshake wedge to muddle the queue.
Paired shadows sync device and host with care—
This bunny hops fast with clean mem to spare! 🎉

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 44.44% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Description check	✅ Passed	The pull request description clearly identifies two specific a5-only bugs preventing l2-swimlane from running, provides detailed explanations of root causes, and describes the implemented fixes that align with the changeset.
Title check	✅ Passed	The title clearly and concisely summarizes the main changes: fixing a5 L2 swimlane by implementing host-shadow allocation and using a payload buffer channel instead of shared polling.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request refactors the L2 swimlane profiling mechanism on the a5 platform to avoid cross-core reads of the shared L2SwimlaneActiveHead line, which previously caused pipeline stalls and wedged the AIC-to-AICPU FIN handshake. Instead, the current rotating record buffer pointer is delivered directly to the AICore via the per-task dispatch payload. Additionally, the host-side memory allocation is updated to use paired buffers with host shadows to support non-SVM environments on a5, and proper device-to-host synchronization is added during counter reconciliation. Feedback was provided to explicitly zero-initialize the allocated shared memory region on the host shadow to prevent potential garbage values from causing undefined behavior.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

src/a5/platform/shared/host/l2_swimlane_collector.cpp (1)

570-601: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Reconcile AICore task counters too.

l2_swimlane_aicpu_on_aicore_dispatch() bumps the AICore pool’s total_record_count, and AICore drops can show up as silent loss, but reconcile_counters() only checks PERF/SCHED/ORCH pools. Add an AICORE reconciliation pass using get_aicore_buffer_state(...) and collected_aicore_records_ so lost AICore records are reported.

Proposed addition

+    uint64_t total_aicore_collected = 0;
+    for (const auto &records : collected_aicore_records_) {
+        total_aicore_collected += records.size();
+    }
+
+    reconcile_one(
+        "AICORE", "core", num_aicore_,
+        [this](int core_index) {
+            return get_aicore_buffer_state(shm_host_, num_aicore_, core_index);
+        },
+        [](void *host_ptr) {
+            return reinterpret_cast<L2SwimlaneAicoreTaskBuffer *>(host_ptr)->count;
+        },
+        sizeof(L2SwimlaneAicoreTaskBuffer), total_aicore_collected, /*optional=*/false
+    );
+
     reconcile_one(
         "SCHED_PHASE", "thread", PLATFORM_MAX_AICPU_THREADS,

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/a5/platform/shared/host/l2_swimlane_collector.cpp` around lines 570 -
601, The reconcile_counters() method currently only reconciles PERF,
SCHED_PHASE, and ORCH_PHASE pools but does not check the AICore pool, which is
incremented by l2_swimlane_aicpu_on_aicore_dispatch(). Add another
reconcile_one() call after the existing three calls to reconcile the AICORE pool
using get_aicore_buffer_state() for buffer state retrieval and
collected_aicore_records_ to track collected AICore records. Follow the same
pattern as the existing reconcile_one() invocations, specifying the appropriate
buffer type cast and size.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/a5/platform/shared/host/l2_swimlane_collector.cpp`:
- Around line 108-113: After each successful alloc_paired_buffer call in
l2_swimlane_collector.cpp, subsequent return -1 paths leak device buffers
because shm_host_ remains unset and finalize() cannot clean them up. Add a local
unwind/guard pattern at each allocation site (lines 108-113 anchor, and siblings
at 159-163, 187-190, 223-227, 254-258, 289-303) that captures the allocated
perf_dev_ptr and perf_host_ptr, and before any return -1 following allocation,
add cleanup code to release the device buffer (via the appropriate deallocation
function) and clear the paired memory mappings. Ensure the cleanup is performed
consistently across all seven affected locations before returning failure.
- Line 317: Check the return codes from all profiling_copy_to_device calls
before proceeding with their results, and abort or log errors appropriately
instead of consuming potentially stale shadow data. This fix must be applied
consistently wherever profiling_copy_to_device is called in the paired-buffer
workflow paths, including initial push operations and pull operations, to ensure
that device-copy failures do not lead to AICPU processing unprimed queues or the
code reading stale host state during reconciliation and metadata operations.

---

Outside diff comments:
In `@src/a5/platform/shared/host/l2_swimlane_collector.cpp`:
- Around line 570-601: The reconcile_counters() method currently only reconciles
PERF, SCHED_PHASE, and ORCH_PHASE pools but does not check the AICore pool,
which is incremented by l2_swimlane_aicpu_on_aicore_dispatch(). Add another
reconcile_one() call after the existing three calls to reconcile the AICORE pool
using get_aicore_buffer_state() for buffer state retrieval and
collected_aicore_records_ to track collected AICore records. Follow the same
pattern as the existing reconcile_one() invocations, specifying the appropriate
buffer type cast and size.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 99549a8f-e4f9-4939-b1ad-949f565f71ff

📥 Commits

Reviewing files that changed from the base of the PR and between ea4d204 and ebb4067.

📒 Files selected for processing (9)

src/a5/platform/include/aicore/l2_swimlane_collector_aicore.h
src/a5/platform/include/aicpu/l2_swimlane_collector_aicpu.h
src/a5/platform/include/host/l2_swimlane_collector.h
src/a5/platform/shared/aicpu/l2_swimlane_collector_aicpu.cpp
src/a5/platform/shared/host/l2_swimlane_collector.cpp
src/a5/runtime/host_build_graph/aicore/aicore_executor.cpp
src/a5/runtime/tensormap_and_ringbuffer/aicore/aicore_executor.cpp
src/a5/runtime/tensormap_and_ringbuffer/runtime/pto2_dispatch_payload.h
src/a5/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_dispatch.cpp

💤 Files with no reviewable changes (1)

src/a5/platform/include/host/l2_swimlane_collector.h

…ffer channel) Two separate bugs kept --enable-l2-swimlane from running on a5 silicon; the test passes without the flag. Both are a5-only. 1) Host SIGSEGV in collector init. a5 profiling memory is pure device HBM (no halHostRegister), so the host cannot write it directly — but L2SwimlaneCollector::initialize fell back to host_ptr = dev_ptr and wrote the header + SPSC pool slots straight to that raw device pointer. The sibling a5 collectors (dep_gen, pmu) already use a host-shadow model; this one did not. Route allocation through ProfilerBase::alloc_paired_buffer, install the real copy callbacks in set_memory_context, push the host-initialized region to device once at end of init, refresh from device on the reconcile/read_phase paths, and add clear_mappings() in finalize to free the host shadows. Also bring the collector in line with its siblings' init hardening: zero the host shadow right after allocation (don't assume the allocator returns zeroed memory / clean padding), add an InitRollbackGuard so a failure on any post-allocation init path releases the buffers + shadows already allocated (shm_host_ is set only at the end, so finalize() cannot clean up a partial init), and validate the alloc/free callbacks up front. 2) AICore hang -> scheduler timeout -> op-timeout 507000. The AICore record path read the shared L2SwimlaneActiveHead line (current_buf_ptr) every task to learn its rotating record buffer. On a5, an AICore read of a GM line the AICPU concurrently writes stalls the AICore pipeline so it never signals FIN; the scheduler then spins to its idle timeout and the AICPU exec op exceeds the 3s CANN op-execute timeout (a2a3 tolerates the same code). Deliver the current buffer pointer through the per-task dispatch payload instead (reusing the former ABI pad; struct stays 512B): the scheduler stamps PTO2DispatchPayload::l2_swimlane_cur_buf_ptr right after the AICPU rotation hook, and AICore reads it from the payload it already dcci's each task — no cross-core shared-head read. AICPU rotation + host recycling are unchanged, so unbounded per-core records are preserved. host_build_graph keeps its head model, adapted to the buf-ptr record_task signature. a5-only; a2a3 untouched.

gemini-code-assist Bot reviewed Jun 15, 2026

View reviewed changes

Comment thread src/a5/platform/shared/host/l2_swimlane_collector.cpp

coderabbitai Bot reviewed Jun 15, 2026

View reviewed changes

Comment thread src/a5/platform/shared/host/l2_swimlane_collector.cpp

Comment thread src/a5/platform/shared/host/l2_swimlane_collector.cpp

indigo1973 force-pushed the a5_0615 branch from ebb4067 to 9bd7be4 Compare June 16, 2026 06:29

indigo1973 changed the title ~~[WIP] Fix: make a5 L2 swimlane work onboard (host-shadow alloc + payload buffer channel)~~ Fix: make a5 L2 swimlane work onboard (host-shadow alloc + payload buffer channel) Jun 16, 2026

indigo1973 force-pushed the a5_0615 branch from 9bd7be4 to 2069787 Compare June 17, 2026 01:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix: make a5 L2 swimlane work onboard (host-shadow alloc + payload buffer channel)#1058

Fix: make a5 L2 swimlane work onboard (host-shadow alloc + payload buffer channel)#1058
indigo1973 wants to merge 1 commit into
hw-native-sys:mainfrom
indigo1973:a5_0615

indigo1973 commented Jun 15, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented Jun 15, 2026 •

edited

Loading

Review skipped

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

indigo1973 commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

indigo1973 commented Jun 15, 2026 •

edited

Loading

coderabbitai Bot commented Jun 15, 2026 •

edited

Loading