Skip to content

Fix: make a5 L2 swimlane work onboard (host-shadow alloc + payload buffer channel)#1058

Open
indigo1973 wants to merge 1 commit into
hw-native-sys:mainfrom
indigo1973:a5_0615
Open

Fix: make a5 L2 swimlane work onboard (host-shadow alloc + payload buffer channel)#1058
indigo1973 wants to merge 1 commit into
hw-native-sys:mainfrom
indigo1973:a5_0615

Conversation

@indigo1973

@indigo1973 indigo1973 commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Two separate bugs kept --enable-l2-swimlane from running on a5 silicon; the
test passes without the flag. Both are a5-only.

1) Host SIGSEGV in collector init. a5 profiling memory is pure device HBM
   (no halHostRegister), so the host cannot write it directly — but
   L2SwimlaneCollector::initialize fell back to host_ptr = dev_ptr and wrote
   the header + SPSC pool slots straight to that raw device pointer. The
   sibling a5 collectors (dep_gen, pmu) already use a host-shadow model; this
   one did not. Route allocation through ProfilerBase::alloc_paired_buffer,
   install the real copy callbacks in set_memory_context, push the
   host-initialized region to device once at end of init, refresh from device
   on the reconcile/read_phase paths, and add clear_mappings() in finalize to
   free the host shadows. Also bring the collector in line with its siblings'
   init hardening: zero the host shadow right after allocation (don't assume
   the allocator returns zeroed memory / clean padding), add an
   InitRollbackGuard so a failure on any post-allocation init path releases
   the buffers + shadows already allocated (shm_host_ is set only at the end,
   so finalize() cannot clean up a partial init), and validate the alloc/free
   callbacks up front.

2) AICore hang -> scheduler timeout -> op-timeout 507000. The AICore record
   path read the shared L2SwimlaneActiveHead line (current_buf_ptr) every task
   to learn its rotating record buffer. On a5, an AICore read of a GM line the
   AICPU concurrently writes stalls the AICore pipeline so it never signals
   FIN; the scheduler then spins to its idle timeout and the AICPU exec op
   exceeds the 3s CANN op-execute timeout (a2a3 tolerates the same code).
   Deliver the current buffer pointer through the per-task dispatch payload
   instead (reusing the former ABI pad; struct stays 512B): the scheduler
   stamps PTO2DispatchPayload::l2_swimlane_cur_buf_ptr right after the AICPU
   rotation hook, and AICore reads it from the payload it already dcci's each
   task — no cross-core shared-head read. AICPU rotation + host recycling are
   unchanged, so unbounded per-core records are preserved. host_build_graph
   keeps its head model, adapted to the buf-ptr record_task signature.

a5-only; a2a3 untouched.

@coderabbitai

coderabbitai Bot commented Jun 15, 2026

Copy link
Copy Markdown

Review Change Stack

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: e63a5510-0430-4a72-9c3b-01ad06d7919d

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

The PR replaces the AICore L2 swimlane buffer-rotation mechanism: instead of each AICore core polling a shared AICPU-written L2SwimlaneActiveHead channel pointer, the scheduler reads the post-rotation buffer pointer via a new AICPU getter and stamps it directly into the per-task PTO2DispatchPayload.l2_swimlane_cur_buf_ptr field before dispatching. The host collector simultaneously migrates all buffer allocation from a device-only path to a paired device+host shadow workflow with explicit copy synchronization.

Changes

L2 Swimlane: payload-pointer dispatch and host collector paired-buffer migration

Layer / File(s) Summary
AICore recording contract
src/a5/platform/include/aicore/l2_swimlane_collector_aicore.h
Removes cached_buf_seq from L2SwimlaneAicoreLocalState; rewrites l2_swimlane_aicore_record_task to accept uint64_t cur_buf_ptr, early-return on 0, detect rotation by pointer comparison, and write records into the locally resolved buffer.
AICPU getter and PTO2DispatchPayload field
src/a5/platform/include/aicpu/l2_swimlane_collector_aicpu.h, src/a5/platform/shared/aicpu/l2_swimlane_collector_aicpu.cpp, src/a5/runtime/tensormap_and_ringbuffer/runtime/pto2_dispatch_payload.h
Declares and implements l2_swimlane_aicpu_current_aicore_buf(core_id) returning head.current_buf_ptr; adds uint64_t l2_swimlane_cur_buf_ptr to PTO2DispatchPayload replacing the prior 8-byte reserved padding, preserving the 512B size assert.
Scheduler stamps cur_buf_ptr into payload
src/a5/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_dispatch.cpp
After the AICore rotation hook in dispatch_subtask_to_core, writes payload.l2_swimlane_cur_buf_ptr before the memory barrier and write_reg that publishes reg_task_id.
AICore executor call sites
src/a5/runtime/tensormap_and_ringbuffer/aicore/aicore_executor.cpp, src/a5/runtime/host_build_graph/aicore/aicore_executor.cpp
Drops lazy L2SwimlaneActiveHead resolution; local state initialized to {nullptr, 0}; recording call passes exec_payload->l2_swimlane_cur_buf_ptr (or a dcci-read current_buf_ptr for host_build_graph path).
Host collector: paired-buffer allocation and device↔host sync
src/a5/platform/include/host/l2_swimlane_collector.h, src/a5/platform/shared/host/l2_swimlane_collector.cpp
Removes alloc_single_buffer; all SHM/pool buffers allocated via alloc_paired_buffer; initialize() pushes primed state to device; reconcile_counters() and read_phase_header_metadata() pull from device before reading; finalize() calls manager_.clear_mappings() to free host shadows.

Sequence Diagram(s)

sequenceDiagram
  participant Sched as SchedulerContext
  participant AICPU as l2_swimlane_aicpu
  participant Payload as PTO2DispatchPayload
  participant AICore as AICore Executor
  participant RecordFn as l2_swimlane_aicore_record_task

  Sched->>AICPU: l2_swimlane_aicpu_on_aicore_dispatch(core_id) [rotate buffer]
  Sched->>AICPU: l2_swimlane_aicpu_current_aicore_buf(core_id)
  AICPU-->>Sched: head.current_buf_ptr
  Sched->>Payload: payload.l2_swimlane_cur_buf_ptr = current_buf_ptr
  Sched->>AICore: write_reg(reg_task_id) [publish task]
  AICore->>Payload: read exec_payload->l2_swimlane_cur_buf_ptr
  AICore->>RecordFn: l2_swimlane_aicore_record_task(cur_buf_ptr, local_state, ...)
  RecordFn->>RecordFn: compare cur_buf_ptr vs cached_buf → reset slot if rotated
  RecordFn->>RecordFn: write record, cache flush, DSB
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • hw-native-sys/simpler#939: Introduced the L2SwimlaneActiveHead cache-line model and modified L2SwimlaneAicoreLocalState / l2_swimlane_aicore_record_task — the same contract this PR further replaces with a payload-pointer approach.
  • hw-native-sys/simpler#942: Refactored the AICore L2 swimlane recording path to use L2SwimlaneActiveHead with cached_buf_seq, which is the exact state structure this PR removes in favor of cur_buf_ptr.
  • hw-native-sys/simpler#983: Touches the same l2_swimlane_collector_aicore.h recording/rotation path and dispatch→executor plumbing that this PR modifies to consume a payload-delivered buffer pointer.

Poem

🐇 No more polling the shared head in vain,
The buffer pointer rides the payload lane!
Scheduler stamps it, AICore reads true,
No FIN-handshake wedge to muddle the queue.
Paired shadows sync device and host with care—
This bunny hops fast with clean mem to spare! 🎉

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 44.44% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description check ✅ Passed The pull request description clearly identifies two specific a5-only bugs preventing l2-swimlane from running, provides detailed explanations of root causes, and describes the implemented fixes that align with the changeset.
Title check ✅ Passed The title clearly and concisely summarizes the main changes: fixing a5 L2 swimlane by implementing host-shadow allocation and using a payload buffer channel instead of shared polling.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the L2 swimlane profiling mechanism on the a5 platform to avoid cross-core reads of the shared L2SwimlaneActiveHead line, which previously caused pipeline stalls and wedged the AIC-to-AICPU FIN handshake. Instead, the current rotating record buffer pointer is delivered directly to the AICore via the per-task dispatch payload. Additionally, the host-side memory allocation is updated to use paired buffers with host shadows to support non-SVM environments on a5, and proper device-to-host synchronization is added during counter reconciliation. Feedback was provided to explicitly zero-initialize the allocated shared memory region on the host shadow to prevent potential garbage values from causing undefined behavior.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread src/a5/platform/shared/host/l2_swimlane_collector.cpp

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/a5/platform/shared/host/l2_swimlane_collector.cpp (1)

570-601: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Reconcile AICore task counters too.

l2_swimlane_aicpu_on_aicore_dispatch() bumps the AICore pool’s total_record_count, and AICore drops can show up as silent loss, but reconcile_counters() only checks PERF/SCHED/ORCH pools. Add an AICORE reconciliation pass using get_aicore_buffer_state(...) and collected_aicore_records_ so lost AICore records are reported.

Proposed addition
+    uint64_t total_aicore_collected = 0;
+    for (const auto &records : collected_aicore_records_) {
+        total_aicore_collected += records.size();
+    }
+
+    reconcile_one(
+        "AICORE", "core", num_aicore_,
+        [this](int core_index) {
+            return get_aicore_buffer_state(shm_host_, num_aicore_, core_index);
+        },
+        [](void *host_ptr) {
+            return reinterpret_cast<L2SwimlaneAicoreTaskBuffer *>(host_ptr)->count;
+        },
+        sizeof(L2SwimlaneAicoreTaskBuffer), total_aicore_collected, /*optional=*/false
+    );
+
     reconcile_one(
         "SCHED_PHASE", "thread", PLATFORM_MAX_AICPU_THREADS,
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/a5/platform/shared/host/l2_swimlane_collector.cpp` around lines 570 -
601, The reconcile_counters() method currently only reconciles PERF,
SCHED_PHASE, and ORCH_PHASE pools but does not check the AICore pool, which is
incremented by l2_swimlane_aicpu_on_aicore_dispatch(). Add another
reconcile_one() call after the existing three calls to reconcile the AICORE pool
using get_aicore_buffer_state() for buffer state retrieval and
collected_aicore_records_ to track collected AICore records. Follow the same
pattern as the existing reconcile_one() invocations, specifying the appropriate
buffer type cast and size.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/a5/platform/shared/host/l2_swimlane_collector.cpp`:
- Around line 108-113: After each successful alloc_paired_buffer call in
l2_swimlane_collector.cpp, subsequent return -1 paths leak device buffers
because shm_host_ remains unset and finalize() cannot clean them up. Add a local
unwind/guard pattern at each allocation site (lines 108-113 anchor, and siblings
at 159-163, 187-190, 223-227, 254-258, 289-303) that captures the allocated
perf_dev_ptr and perf_host_ptr, and before any return -1 following allocation,
add cleanup code to release the device buffer (via the appropriate deallocation
function) and clear the paired memory mappings. Ensure the cleanup is performed
consistently across all seven affected locations before returning failure.
- Line 317: Check the return codes from all profiling_copy_to_device calls
before proceeding with their results, and abort or log errors appropriately
instead of consuming potentially stale shadow data. This fix must be applied
consistently wherever profiling_copy_to_device is called in the paired-buffer
workflow paths, including initial push operations and pull operations, to ensure
that device-copy failures do not lead to AICPU processing unprimed queues or the
code reading stale host state during reconciliation and metadata operations.

---

Outside diff comments:
In `@src/a5/platform/shared/host/l2_swimlane_collector.cpp`:
- Around line 570-601: The reconcile_counters() method currently only reconciles
PERF, SCHED_PHASE, and ORCH_PHASE pools but does not check the AICore pool,
which is incremented by l2_swimlane_aicpu_on_aicore_dispatch(). Add another
reconcile_one() call after the existing three calls to reconcile the AICORE pool
using get_aicore_buffer_state() for buffer state retrieval and
collected_aicore_records_ to track collected AICore records. Follow the same
pattern as the existing reconcile_one() invocations, specifying the appropriate
buffer type cast and size.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 99549a8f-e4f9-4939-b1ad-949f565f71ff

📥 Commits

Reviewing files that changed from the base of the PR and between ea4d204 and ebb4067.

📒 Files selected for processing (9)
  • src/a5/platform/include/aicore/l2_swimlane_collector_aicore.h
  • src/a5/platform/include/aicpu/l2_swimlane_collector_aicpu.h
  • src/a5/platform/include/host/l2_swimlane_collector.h
  • src/a5/platform/shared/aicpu/l2_swimlane_collector_aicpu.cpp
  • src/a5/platform/shared/host/l2_swimlane_collector.cpp
  • src/a5/runtime/host_build_graph/aicore/aicore_executor.cpp
  • src/a5/runtime/tensormap_and_ringbuffer/aicore/aicore_executor.cpp
  • src/a5/runtime/tensormap_and_ringbuffer/runtime/pto2_dispatch_payload.h
  • src/a5/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_dispatch.cpp
💤 Files with no reviewable changes (1)
  • src/a5/platform/include/host/l2_swimlane_collector.h

Comment thread src/a5/platform/shared/host/l2_swimlane_collector.cpp
Comment thread src/a5/platform/shared/host/l2_swimlane_collector.cpp
@indigo1973 indigo1973 changed the title [WIP] Fix: make a5 L2 swimlane work onboard (host-shadow alloc + payload buffer channel) Fix: make a5 L2 swimlane work onboard (host-shadow alloc + payload buffer channel) Jun 16, 2026
…ffer channel)

  Two separate bugs kept --enable-l2-swimlane from running on a5 silicon; the
  test passes without the flag. Both are a5-only.

    1) Host SIGSEGV in collector init. a5 profiling memory is pure device HBM
       (no halHostRegister), so the host cannot write it directly — but
       L2SwimlaneCollector::initialize fell back to host_ptr = dev_ptr and wrote
       the header + SPSC pool slots straight to that raw device pointer. The
       sibling a5 collectors (dep_gen, pmu) already use a host-shadow model; this
       one did not. Route allocation through ProfilerBase::alloc_paired_buffer,
       install the real copy callbacks in set_memory_context, push the
       host-initialized region to device once at end of init, refresh from device
       on the reconcile/read_phase paths, and add clear_mappings() in finalize to
       free the host shadows. Also bring the collector in line with its siblings'
       init hardening: zero the host shadow right after allocation (don't assume
       the allocator returns zeroed memory / clean padding), add an
       InitRollbackGuard so a failure on any post-allocation init path releases
       the buffers + shadows already allocated (shm_host_ is set only at the end,
       so finalize() cannot clean up a partial init), and validate the alloc/free
       callbacks up front.

    2) AICore hang -> scheduler timeout -> op-timeout 507000. The AICore record
       path read the shared L2SwimlaneActiveHead line (current_buf_ptr) every task
       to learn its rotating record buffer. On a5, an AICore read of a GM line the
       AICPU concurrently writes stalls the AICore pipeline so it never signals
       FIN; the scheduler then spins to its idle timeout and the AICPU exec op
       exceeds the 3s CANN op-execute timeout (a2a3 tolerates the same code).
       Deliver the current buffer pointer through the per-task dispatch payload
       instead (reusing the former ABI pad; struct stays 512B): the scheduler
       stamps PTO2DispatchPayload::l2_swimlane_cur_buf_ptr right after the AICPU
       rotation hook, and AICore reads it from the payload it already dcci's each
       task — no cross-core shared-head read. AICPU rotation + host recycling are
       unchanged, so unbounded per-core records are preserved. host_build_graph
       keeps its head model, adapted to the buf-ptr record_task signature.

    a5-only; a2a3 untouched.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant