Fix: make a5 L2 swimlane work onboard (host-shadow alloc + payload buffer channel)#1058
Fix: make a5 L2 swimlane work onboard (host-shadow alloc + payload buffer channel)#1058indigo1973 wants to merge 1 commit into
Conversation
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughThe PR replaces the AICore L2 swimlane buffer-rotation mechanism: instead of each AICore core polling a shared AICPU-written ChangesL2 Swimlane: payload-pointer dispatch and host collector paired-buffer migration
Sequence Diagram(s)sequenceDiagram
participant Sched as SchedulerContext
participant AICPU as l2_swimlane_aicpu
participant Payload as PTO2DispatchPayload
participant AICore as AICore Executor
participant RecordFn as l2_swimlane_aicore_record_task
Sched->>AICPU: l2_swimlane_aicpu_on_aicore_dispatch(core_id) [rotate buffer]
Sched->>AICPU: l2_swimlane_aicpu_current_aicore_buf(core_id)
AICPU-->>Sched: head.current_buf_ptr
Sched->>Payload: payload.l2_swimlane_cur_buf_ptr = current_buf_ptr
Sched->>AICore: write_reg(reg_task_id) [publish task]
AICore->>Payload: read exec_payload->l2_swimlane_cur_buf_ptr
AICore->>RecordFn: l2_swimlane_aicore_record_task(cur_buf_ptr, local_state, ...)
RecordFn->>RecordFn: compare cur_buf_ptr vs cached_buf → reset slot if rotated
RecordFn->>RecordFn: write record, cache flush, DSB
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request refactors the L2 swimlane profiling mechanism on the a5 platform to avoid cross-core reads of the shared L2SwimlaneActiveHead line, which previously caused pipeline stalls and wedged the AIC-to-AICPU FIN handshake. Instead, the current rotating record buffer pointer is delivered directly to the AICore via the per-task dispatch payload. Additionally, the host-side memory allocation is updated to use paired buffers with host shadows to support non-SVM environments on a5, and proper device-to-host synchronization is added during counter reconciliation. Feedback was provided to explicitly zero-initialize the allocated shared memory region on the host shadow to prevent potential garbage values from causing undefined behavior.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/a5/platform/shared/host/l2_swimlane_collector.cpp (1)
570-601:⚠️ Potential issue | 🟠 Major | ⚡ Quick winReconcile AICore task counters too.
l2_swimlane_aicpu_on_aicore_dispatch()bumps the AICore pool’stotal_record_count, and AICore drops can show up as silent loss, butreconcile_counters()only checks PERF/SCHED/ORCH pools. Add anAICOREreconciliation pass usingget_aicore_buffer_state(...)andcollected_aicore_records_so lost AICore records are reported.Proposed addition
+ uint64_t total_aicore_collected = 0; + for (const auto &records : collected_aicore_records_) { + total_aicore_collected += records.size(); + } + + reconcile_one( + "AICORE", "core", num_aicore_, + [this](int core_index) { + return get_aicore_buffer_state(shm_host_, num_aicore_, core_index); + }, + [](void *host_ptr) { + return reinterpret_cast<L2SwimlaneAicoreTaskBuffer *>(host_ptr)->count; + }, + sizeof(L2SwimlaneAicoreTaskBuffer), total_aicore_collected, /*optional=*/false + ); + reconcile_one( "SCHED_PHASE", "thread", PLATFORM_MAX_AICPU_THREADS,🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/a5/platform/shared/host/l2_swimlane_collector.cpp` around lines 570 - 601, The reconcile_counters() method currently only reconciles PERF, SCHED_PHASE, and ORCH_PHASE pools but does not check the AICore pool, which is incremented by l2_swimlane_aicpu_on_aicore_dispatch(). Add another reconcile_one() call after the existing three calls to reconcile the AICORE pool using get_aicore_buffer_state() for buffer state retrieval and collected_aicore_records_ to track collected AICore records. Follow the same pattern as the existing reconcile_one() invocations, specifying the appropriate buffer type cast and size.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/a5/platform/shared/host/l2_swimlane_collector.cpp`:
- Around line 108-113: After each successful alloc_paired_buffer call in
l2_swimlane_collector.cpp, subsequent return -1 paths leak device buffers
because shm_host_ remains unset and finalize() cannot clean them up. Add a local
unwind/guard pattern at each allocation site (lines 108-113 anchor, and siblings
at 159-163, 187-190, 223-227, 254-258, 289-303) that captures the allocated
perf_dev_ptr and perf_host_ptr, and before any return -1 following allocation,
add cleanup code to release the device buffer (via the appropriate deallocation
function) and clear the paired memory mappings. Ensure the cleanup is performed
consistently across all seven affected locations before returning failure.
- Line 317: Check the return codes from all profiling_copy_to_device calls
before proceeding with their results, and abort or log errors appropriately
instead of consuming potentially stale shadow data. This fix must be applied
consistently wherever profiling_copy_to_device is called in the paired-buffer
workflow paths, including initial push operations and pull operations, to ensure
that device-copy failures do not lead to AICPU processing unprimed queues or the
code reading stale host state during reconciliation and metadata operations.
---
Outside diff comments:
In `@src/a5/platform/shared/host/l2_swimlane_collector.cpp`:
- Around line 570-601: The reconcile_counters() method currently only reconciles
PERF, SCHED_PHASE, and ORCH_PHASE pools but does not check the AICore pool,
which is incremented by l2_swimlane_aicpu_on_aicore_dispatch(). Add another
reconcile_one() call after the existing three calls to reconcile the AICORE pool
using get_aicore_buffer_state() for buffer state retrieval and
collected_aicore_records_ to track collected AICore records. Follow the same
pattern as the existing reconcile_one() invocations, specifying the appropriate
buffer type cast and size.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 99549a8f-e4f9-4939-b1ad-949f565f71ff
📒 Files selected for processing (9)
src/a5/platform/include/aicore/l2_swimlane_collector_aicore.hsrc/a5/platform/include/aicpu/l2_swimlane_collector_aicpu.hsrc/a5/platform/include/host/l2_swimlane_collector.hsrc/a5/platform/shared/aicpu/l2_swimlane_collector_aicpu.cppsrc/a5/platform/shared/host/l2_swimlane_collector.cppsrc/a5/runtime/host_build_graph/aicore/aicore_executor.cppsrc/a5/runtime/tensormap_and_ringbuffer/aicore/aicore_executor.cppsrc/a5/runtime/tensormap_and_ringbuffer/runtime/pto2_dispatch_payload.hsrc/a5/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_dispatch.cpp
💤 Files with no reviewable changes (1)
- src/a5/platform/include/host/l2_swimlane_collector.h
…ffer channel)
Two separate bugs kept --enable-l2-swimlane from running on a5 silicon; the
test passes without the flag. Both are a5-only.
1) Host SIGSEGV in collector init. a5 profiling memory is pure device HBM
(no halHostRegister), so the host cannot write it directly — but
L2SwimlaneCollector::initialize fell back to host_ptr = dev_ptr and wrote
the header + SPSC pool slots straight to that raw device pointer. The
sibling a5 collectors (dep_gen, pmu) already use a host-shadow model; this
one did not. Route allocation through ProfilerBase::alloc_paired_buffer,
install the real copy callbacks in set_memory_context, push the
host-initialized region to device once at end of init, refresh from device
on the reconcile/read_phase paths, and add clear_mappings() in finalize to
free the host shadows. Also bring the collector in line with its siblings'
init hardening: zero the host shadow right after allocation (don't assume
the allocator returns zeroed memory / clean padding), add an
InitRollbackGuard so a failure on any post-allocation init path releases
the buffers + shadows already allocated (shm_host_ is set only at the end,
so finalize() cannot clean up a partial init), and validate the alloc/free
callbacks up front.
2) AICore hang -> scheduler timeout -> op-timeout 507000. The AICore record
path read the shared L2SwimlaneActiveHead line (current_buf_ptr) every task
to learn its rotating record buffer. On a5, an AICore read of a GM line the
AICPU concurrently writes stalls the AICore pipeline so it never signals
FIN; the scheduler then spins to its idle timeout and the AICPU exec op
exceeds the 3s CANN op-execute timeout (a2a3 tolerates the same code).
Deliver the current buffer pointer through the per-task dispatch payload
instead (reusing the former ABI pad; struct stays 512B): the scheduler
stamps PTO2DispatchPayload::l2_swimlane_cur_buf_ptr right after the AICPU
rotation hook, and AICore reads it from the payload it already dcci's each
task — no cross-core shared-head read. AICPU rotation + host recycling are
unchanged, so unbounded per-core records are preserved. host_build_graph
keeps its head model, adapted to the buf-ptr record_task signature.
a5-only; a2a3 untouched.
Two separate bugs kept --enable-l2-swimlane from running on a5 silicon; the
test passes without the flag. Both are a5-only.