[WIP] feat(dfx): add l0_swimlane intra-core pipeline trace tool by indigo1973 · Pull Request #1053 · hw-native-sys/simpler

indigo1973 · 2026-06-15T07:30:15Z

Generate an AICore intra-core swimlane trace.json for a single kernel:
read the test's CALLABLE to resolve the kernel by func_id, run a JSON-only
tensor dump to capture the real per-task args, reconstruct them filtered by
func_id (zero hand-guessing), emit a 5-file msprof-op-simulator replay
workspace, and collect the camodel trace. The dump's golden PASS is the
gate that the captured args are trustworthy.

Handles five kernel shapes: AIC-only, AIV-only, SPMD AIV, cooperative SPMD
mix, and offset subtasks (an independent kernel packed into a mix dispatch,
whose args start at a non-zero slot). --set-arg shrinks a replay loop count
(scalar n_blocks or the context_lens control tensor) without distorting the
pipeline structure.

Supporting changes:

tensor dump: stamp each record with the originating kernel's func_id (task->kernel_id[slot]) so a multi-kernel dump can be filtered per kernel; new func_id field in TensorDumpRecord/TensorDumpInfo and the collector JSON (-1 when unknown). Required by l0_swimlane's func_id reconstruction.
tests: declare the incore tensor signatures the dump needs (paged_attention_unroll, spmd_multiblock_aiv SPMD_WRITE_AIV, spmd_paged_attention PA_AIC full 9-tensor layout) and add a small manual SmallCase1 to spmd_paged_attention as an onboard mix trace target.
docs: new docs/dfx/l0-swimlane-profiling.md (usage, kernel-shape table, --set-arg loop shrinking, the cooperative-mix signature rule — declare on exactly one of the pair); tensor-dump.md documents the func_id field; insight-trace SKILL.md records the manual recipe the tool automates.

Generate an AICore intra-core swimlane trace.json for a single kernel: read the test's CALLABLE to resolve the kernel by func_id, run a JSON-only tensor dump to capture the real per-task args, reconstruct them filtered by func_id (zero hand-guessing), emit a 5-file msprof-op-simulator replay workspace, and collect the camodel trace. The dump's golden PASS is the gate that the captured args are trustworthy. Handles five kernel shapes: AIC-only, AIV-only, SPMD AIV, cooperative SPMD mix, and offset subtasks (an independent kernel packed into a mix dispatch, whose args start at a non-zero slot). --set-arg shrinks a replay loop count (scalar n_blocks or the context_lens control tensor) without distorting the pipeline structure. Supporting changes: - tensor dump: stamp each record with the originating kernel's func_id (task->kernel_id[slot]) so a multi-kernel dump can be filtered per kernel; new func_id field in TensorDumpRecord/TensorDumpInfo and the collector JSON (-1 when unknown). Required by l0_swimlane's func_id reconstruction. - tests: declare the incore tensor signatures the dump needs (paged_attention_unroll, spmd_multiblock_aiv SPMD_WRITE_AIV, spmd_paged_attention PA_AIC full 9-tensor layout) and add a small manual SmallCase1 to spmd_paged_attention as an onboard mix trace target. - docs: new docs/dfx/l0-swimlane-profiling.md (usage, kernel-shape table, --set-arg loop shrinking, the cooperative-mix signature rule — declare on exactly one of the pair); tensor-dump.md documents the func_id field; insight-trace SKILL.md records the manual recipe the tool automates.

coderabbitai · 2026-06-15T07:30:31Z

📝 Walkthrough

Walkthrough

Adds func_id to TensorDumpRecord, TensorDumpInfo, and DumpedTensor structs and propagates it through the AICPU dump runtime and JSON manifest. Introduces a new l0_swimlane CLI tool that uses func_id-filtered tensor dumps to reconstruct kernel args, code-generate a replay workspace, smoke-build it, run msprof op simulator, and emit both Insight and Perfetto-friendly trace files. Updates test kernel signatures to expose tensors for dump capture, and adds full documentation.

Changes

L0 Swimlane profiling tool and func_id pipeline

Layer / File(s)	Summary
func_id field added to TensorDumpRecord/TensorDumpInfo/DumpedTensor `src/a2a3/platform/include/common/tensor_dump.h`, `src/a5/platform/include/common/tensor_dump.h`, `src/common/platform/include/host/tensor_dump_collector.h`	`TensorDumpRecord` gains `uint16_t func_id` (pad0 resized to preserve 128B layout) in both a2a3 and a5 headers; `TensorDumpInfo` gains `int32_t func_id` (-1 unknown) in both; `DumpedTensor` gains `uint16_t func_id` (0xFFFF unknown).
func_id population in AICPU runtime, record writer, and JSON export `src/common/platform/include/aicpu/tensor_dump_aicpu.h`, `src/common/platform/shared/aicpu/tensor_dump_aicpu.cpp`, `src/common/platform/shared/host/tensor_dump_collector.cpp`	`dump_tensors_for_task` derives `scalar_func_id` from the first active subtask's `kernel_id` and sets `TensorDumpInfo.func_id` for tensor and scalar records; the host-build-graph overload sets `func_id=-1`. `dump_tensor_record` casts to `uint16_t`; the collector copies the field and maps `0xFFFF` back to `-1` in JSON.
Kernel callable signatures updated for tensor dump capture `tests/st/a2a3/tensormap_and_ringbuffer/paged_attention_unroll/test_paged_attention_unroll.py`, `tests/st/a5/tensormap_and_ringbuffer/paged_attention_unroll/test_paged_attention_unroll.py`, `tests/st/a2a3/tensormap_and_ringbuffer/spmd_multiblock_aiv/test_spmd_multiblock_aiv.py`, `tests/st/a2a3/tensormap_and_ringbuffer/spmd_paged_attention/test_spmd_paged_attention.py`	QK/PV incore signatures gain an extra `D.IN` on both platforms; `SPMD_WRITE_AIV` gains `signature:[D.INOUT]`; `spmd_paged_attention` orchestration gets a full 9-tensor signature and a new `SmallCase1` test case.
l0_swimlane: constants, kernel metadata, and dump acquisition `simpler_setup/tools/l0_swimlane.py` (lines 1–281)	Defines dtype/size/arch constants; `load_kernel_meta` loads SceneTest metadata and detects cooperative mix kernels; `get_or_run_dump` validates/reuses or executes `--dump-tensor 3` and returns the `tensor_dump.json` path.
l0_swimlane: arg reconstruction from tensor dump `simpler_setup/tools/l0_swimlane.py` (lines 287–364)	`reconstruct_task_args` filters records by `func_id`, merges tensor metadata across dump stages, validates slot uniqueness, computes strides, and produces the final tensor descriptor + scalar arg list.
l0_swimlane: replay workspace codegen and build `simpler_setup/tools/l0_swimlane.py` (lines 383–1056)	Single-core and SPMD mix replay code generators (kernel, launch stubs, host main with synthesized SPMD context slots); `run_collect.sh` template; `generate_workspace` writes the 5-file workspace; `smoke_build` compiles via CMake/Ninja and validates exported symbols.
l0_swimlane: collect, Perfetto transform, overrides, and CLI main `simpler_setup/tools/l0_swimlane.py` (lines 1058–1453)	`_to_perfetto` repacks overlapping intervals into sub-lanes and merges B/E pairs into `ph:X` slices; `collect` manages device locking, locates `trace.json`, and produces `_perfetto.json`; `apply_arg_overrides` validates `--set-arg` specs; `main` wires the full CLI workflow.
Documentation: l0-swimlane-profiling, tensor-dump func_id, Perfetto SKILL `docs/dfx/l0-swimlane-profiling.md`, `docs/dfx/tensor-dump.md`, `.claude/skills/insight-trace/SKILL.md`	New 616-line profiling doc covers workflow, CLI flags, workspace internals, SPMD synthesis, fidelity rules, findings, FAQ, and related docs. `tensor-dump.md` adds `func_id` to the example JSON and key-fields. `SKILL.md` adds a Perfetto post-processing section.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant l0_swimlane_main as l0_swimlane main()
    participant get_or_run_dump
    participant reconstruct_task_args
    participant generate_workspace
    participant smoke_build
    participant collect
    participant _to_perfetto

    User->>l0_swimlane_main: --test, --func-id, --platform, --case, --set-arg
    l0_swimlane_main->>get_or_run_dump: run SceneTest --dump-tensor 3
    get_or_run_dump-->>l0_swimlane_main: tensor_dump.json path
    l0_swimlane_main->>reconstruct_task_args: filter by func_id, merge arg stages
    reconstruct_task_args-->>l0_swimlane_main: tensor descriptors + scalar args
    l0_swimlane_main->>generate_workspace: emit 5-file replay workspace
    generate_workspace-->>l0_swimlane_main: workspace directory
    l0_swimlane_main->>smoke_build: cmake/ninja + symbol check
    smoke_build-->>l0_swimlane_main: build OK
    l0_swimlane_main->>collect: run msprof op simulator, locate trace.json
    collect->>_to_perfetto: repack intervals, merge B/E→X, rewrite tid
    _to_perfetto-->>collect: trace_perfetto.json
    collect-->>User: trace.json + trace_perfetto.json

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

🐇 Hop hop, the swimlane's here at last,
Each kernel's trace replayed so fast!
func_id flows from struct to JSON neat,
Perfetto lanes no longer skip a beat.
The rabbit cheers — no more mis-nested pain,
Just clean ph:X slices down each lane! ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 22.22% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description check	✅ Passed	The description comprehensively explains the l0_swimlane tool's functionality, supported kernel shapes, the --set-arg flag, and all supporting changes (tensor dump func_id field, test updates, and documentation additions), clearly relating to the changeset.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Title check	✅ Passed	The title clearly describes the main change: introducing a new `l0_swimlane` intra-core pipeline trace tool, which is the primary feature across all modified files.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request introduces the l0_swimlane.py tool to automate AICore intra-core pipeline trace generation, along with supporting documentation and runtime updates to track func_id in tensor dumps. The review feedback is highly constructive and should be addressed: specifically, a critical bug in l0_swimlane.py where FLOAT32 scalars are truncated to integers during reconstruction, and another bug where sequential B/E trace events overwrite each other in loops instead of being matched via a stack. Additionally, replacing angle-bracket placeholders in the Markdown bash snippets with safe shell variables will prevent syntax and redirection errors.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-15T07:33:51Z

+    for s in scalars:
+        args.append({"kind": "scalar", "slot": s["arg_index"], "value": int(s["value"])})


For FLOAT32 scalars, the tensor dump collector writes the actual float value (e.g., 1.5) into the JSON manifest. Simply casting this value to an integer using int(s["value"]) will truncate the float and result in an incorrect integer value (e.g., 1 instead of the IEEE 754 binary representation of 1.5f, which is 1069547520). When the kernel later interprets this value as a float, it will read a completely wrong value (almost zero).

We should check if the scalar's dtype is FLOAT32 and convert it to its 32-bit float binary representation as an integer using struct.

Suggested change

for s in scalars:

args.append({"kind": "scalar", "slot": s["arg_index"], "value": int(s["value"])})

for s in scalars:

val = s["value"]

if s.get("dtype", "").upper() == "FLOAT32":

import struct

val = struct.unpack("<I", struct.pack("<f", float(val)))[0]

else:

val = int(val)

args.append({"kind": "scalar", "slot": s["arg_index"], "value": val})

gemini-code-assist · 2026-06-15T07:33:51Z

+    intervals = defaultdict(list)
+    be = defaultdict(dict)
+    for e in evs:
+        if not is_core(e):
+            continue
+        ph = e.get("ph")
+        key = (e["pid"], e["tid"])
+        if ph == "X":
+            intervals[key].append(
+                {
+                    "ts": e["ts"],
+                    "end": e["ts"] + e.get("dur", 0.0),
+                    "name": e["name"],
+                    "args": e.get("args", {}),
+                }
+            )
+        elif ph in ("B", "E"):
+            slot = be[(e["pid"], e["tid"], e.get("id"))]
+            slot[ph] = e["ts"]
+            slot.setdefault("src", e)
+    for (pid, tid, eid), slot in be.items():
+        s = slot.get("B", slot.get("E"))
+        en = slot.get("E", slot.get("B"))
+        if s is not None and en is not None and s > en:
+            s, en = en, s
+        src = slot["src"]
+        intervals[(pid, tid)].append(
+            {
+                "ts": s,
+                "end": en,
+                "name": src["name"],
+                "args": src.get("args", {}),
+                "id": eid,
+            }
+        )


The current logic for pairing B and E events uses a single dictionary slot per (pid, tid, id) key. If a flag is set and waited multiple times (which is extremely common in loops), or if multiple events share the same id (or have None as id), subsequent B or E events will overwrite the previous ones in the be dictionary. This results in silent loss of all intermediate events and corrupted intervals (e.g., pairing the first event's metadata with the last event's timestamps).

We should use a stack-based matching approach to correctly pair nested or sequential B and E events on the same thread/id.

intervals = defaultdict(list) be_stacks = defaultdict(list) for e in evs: if not is_core(e): continue ph = e.get("ph") if ph == "X": intervals[(e["pid"], e["tid"])].append( { "ts": e["ts"], "end": e["ts"] + e.get("dur", 0.0), "name": e["name"], "args": e.get("args", {}), } ) elif ph == "B": be_stacks[(e["pid"], e["tid"], e.get("id"))].append(e) elif ph == "E": key = (e["pid"], e["tid"], e.get("id")) stack = be_stacks[key] if stack: b_ev = stack.pop() s = b_ev["ts"] en = e["ts"] if s > en: s, en = en, s intervals[(e["pid"], e["tid"])].append( { "ts": s, "end": en, "name": b_ev["name"], "args": b_ev.get("args", {}), "id": e.get("id"), } ) else: intervals[(e["pid"], e["tid"])].append( { "ts": e["ts"], "end": e["ts"], "name": e["name"], "args": e.get("args", {}), "id": e.get("id"), } ) for key, stack in be_stacks.items(): pid, tid, eid = key for b_ev in stack: intervals[(pid, tid)].append( { "ts": b_ev["ts"], "end": b_ev["ts"], "name": b_ev["name"], "args": b_ev.get("args", {}), "id": eid, } )

gemini-code-assist · 2026-06-15T07:33:51Z

+```bash
+python -m simpler_setup.tools.l0_swimlane \
+    --test tests/st/<case>/test_<name>.py --func-id <N> --platform a2a3sim
+```


Avoid using angle brackets for literal placeholders (e.g., <case>, <name>, <N>) in Markdown bash code blocks to prevent shell syntax errors and ensure snippets are copy-paste safe. Use standard shell variable placeholders or safe, quoted placeholders instead.

Suggested change

```bash

python -m simpler_setup.tools.l0_swimlane \

--test tests/st/<case>/test_<name>.py --func-id <N> --platform a2a3sim

```

python -m simpler_setup.tools.l0_swimlane \

--test tests/st/"$CASE_DIR"/test_"$NAME".py --func-id "$FUNC_ID" --platform a2a3sim

References

In Markdown bash code blocks, avoid using angle brackets for literal placeholders (e.g., ) to prevent shell syntax errors and ensure snippets are copy-paste safe. Use standard shell variable placeholders like $ISSUE_NUMBER instead.

gemini-code-assist · 2026-06-15T07:33:52Z

+```bash
+# Environment (once per shell): activate the venv and source CANN.
+source .venv/bin/activate
+export ASCEND_HOME_PATH=<your CANN install>     # e.g. .../cann-9.0.0


Avoid using angle brackets for placeholders (e.g., <your CANN install>) in Markdown bash code blocks to prevent them from being parsed as input redirection. Use safe, quoted placeholders instead.

Suggested change

export ASCEND_HOME_PATH=<your CANN install> # e.g. .../cann-9.0.0

export ASCEND_HOME_PATH="path/to/cann/install" # e.g. .../cann-9.0.0

References

In Markdown bash code blocks, avoid using angle brackets for placeholders (e.g., ) to prevent them from being parsed as input redirection. Use safe, quoted placeholders like "path/to/file" instead.

gemini-code-assist · 2026-06-15T07:33:52Z

+# First kernel: runs the sim dump.
+python -m simpler_setup.tools.l0_swimlane --test <file> --func-id 0 --platform a2a3sim
+
+# Subsequent kernels: point at the manifest the first run produced.
+python -m simpler_setup.tools.l0_swimlane --test <file> --func-id 2 --platform a2a3sim \
+    --dump-json outputs/<ClassName>_<Case>_<ts>/tensor_dump/tensor_dump.json
+```


Avoid using angle brackets for placeholders (e.g., <file>, <ClassName>, <Case>, <ts>) in Markdown bash code blocks to prevent shell syntax errors or input redirection. Use safe, quoted placeholders or standard shell variables instead.

Suggested change

# First kernel: runs the sim dump.

python -m simpler_setup.tools.l0_swimlane --test <file> --func-id 0 --platform a2a3sim

# Subsequent kernels: point at the manifest the first run produced.

python -m simpler_setup.tools.l0_swimlane --test <file> --func-id 2 --platform a2a3sim \

--dump-json outputs/<ClassName>_<Case>_<ts>/tensor_dump/tensor_dump.json

```

# First kernel: runs the sim dump.

python -m simpler_setup.tools.l0_swimlane --test "path/to/test.py" --func-id 0 --platform a2a3sim

# Subsequent kernels: point at the manifest the first run produced.

python -m simpler_setup.tools.l0_swimlane --test "path/to/test.py" --func-id 2 --platform a2a3sim \

--dump-json outputs/"$CLASS_NAME"_"$CASE"_"$TIMESTAMP"/tensor_dump/tensor_dump.json

References

In Markdown bash code blocks, avoid using angle brackets for placeholders (e.g., ) to prevent them from being parsed as input redirection. Use safe, quoted placeholders like "path/to/file" instead.

coderabbitai

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@simpler_setup/tools/l0_swimlane.py`:
- Around line 193-198: The `inc["source"]` value is being embedded directly into
the generated `#include` directive without validation or escaping, which can
result in invalid C++ code if the path contains quotes or backslashes, or can
compile an unintended kernel if the path is absolute or outside the repo. In
simpler_setup/tools/l0_swimlane.py at lines 193-198 (the dictionary construction
where "source" is set to Path(inc["source"])), at lines 417-419, and at lines
766-768, apply the _cpp_string_literal_path() function to the source value
before using it in the include path to properly resolve it relative to the test
file/repo root, reject out-of-repo paths, and escape it for safe C++ codegen.
- Around line 719-741: Clear the msprof_collect and insight_export directories
before running the collection to avoid selecting stale OPPROF_* artifacts from
previous failed or partial runs. Add rm -rf commands to remove the contents of
COLLECT_DIR and EXPORT_ROOT immediately after their directory creation with
mkdir -p, and before the msprof op simulator command executes. This same cleanup
pattern needs to be applied at both the anchor location (around the mkdir -p
line for COLLECT_DIR and EXPORT_ROOT) and at the sibling location mentioned in
the consolidated_sites section (lines 1225-1231 in the same file).
- Around line 340-349: Add validation to fail fast when tensor rank exceeds the
descriptor capacity of 7 dimensions. The descriptor layout has a fixed capacity
for shapes (byte 44) and strides (byte 72), so any tensor with len(shape) > 7
will silently corrupt adjacent descriptor fields in make_desc. Insert a check
that len(shape) <= 7 at the three affected sites in
simpler_setup/tools/l0_swimlane.py (around lines 340-349 where shape is
initialized, around lines 493-496, and around lines 863-866) and raise a clear
error message immediately if the rank exceeds this limit, preventing corrupt
replay generation.
- Around line 460-496: The issue is that buf_bytes includes start_offset and is
used for both memory allocation (aclrtMalloc) and as the buffer.size parameter
in make_desc, which causes the descriptor to overstate the tensor's logical span
when start_offset is non-zero. Calculate the logical tensor size separately
(without start_offset) and use buf_bytes for the allocation size (aclrtMalloc
with t{ti}Bytes), but pass only the logical tensor span (calculated as
_extent_elem(shape, strides) * esz) to the make_desc function call instead of
t{ti}Bytes.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 0d308910-0bc0-4ec7-bc12-28f236c11b0e

📥 Commits

Reviewing files that changed from the base of the PR and between c5ded40 and 812d8a7.

📒 Files selected for processing (14)

.claude/skills/insight-trace/SKILL.md
docs/dfx/l0-swimlane-profiling.md
docs/dfx/tensor-dump.md
simpler_setup/tools/l0_swimlane.py
src/a2a3/platform/include/common/tensor_dump.h
src/a5/platform/include/common/tensor_dump.h
src/common/platform/include/aicpu/tensor_dump_aicpu.h
src/common/platform/include/host/tensor_dump_collector.h
src/common/platform/shared/aicpu/tensor_dump_aicpu.cpp
src/common/platform/shared/host/tensor_dump_collector.cpp
tests/st/a2a3/tensormap_and_ringbuffer/paged_attention_unroll/test_paged_attention_unroll.py
tests/st/a2a3/tensormap_and_ringbuffer/spmd_multiblock_aiv/test_spmd_multiblock_aiv.py
tests/st/a2a3/tensormap_and_ringbuffer/spmd_paged_attention/test_spmd_paged_attention.py
tests/st/a5/tensormap_and_ringbuffer/paged_attention_unroll/test_paged_attention_unroll.py

coderabbitai · 2026-06-15T07:38:30Z

+            same_src = [i for _, i in incores if i["source"] == inc["source"]]
+            core_types = {i["core_type"] for i in same_src}
+            is_mix = "aic" in core_types and "aiv" in core_types
+            return {
+                "source": Path(inc["source"]),
+                "core_type": inc["core_type"],


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Validate and escape the generated #include path.

inc["source"] comes from the imported test metadata and is emitted verbatim into #include "{source}". A source path containing quotes/backslashes will generate invalid C++, and an unexpected absolute/out-of-repo path can make the replay compile a different kernel than the selected test intended. Resolve the path relative to the test file/repo root, reject paths outside the allowed tree, and escape it before codegen.

🛡️ Proposed fix direction

+def _cpp_string_literal_path(p: Path) -> str: + return str(p).replace("\\", "\\\\").replace('"', '\\"') + def load_kernel_meta(test_path: Path, func_id: int, platform: str): @@ - return { - "source": Path(inc["source"]), + src = Path(inc["source"]) + if not src.is_absolute(): + src = (test_path.parent / src).resolve() + else: + src = src.resolve() + try: + src.relative_to(PROJECT_ROOT) + except ValueError as exc: + raise ValueError(f"CALLABLE source must stay under repo root: {src}") from exc + return { + "source": src,

Then call _cpp_string_literal_path(source) when embedding the include path.

-{_prologue(cfg)}`#include` "{source}" +{_prologue(cfg)}`#include` "{_cpp_string_literal_path(source)}"

Also applies to: 417-419, 766-768

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@simpler_setup/tools/l0_swimlane.py` around lines 193 - 198, The `inc["source"]` value is being embedded directly into the generated `#include` directive without validation or escaping, which can result in invalid C++ code if the path contains quotes or backslashes, or can compile an unintended kernel if the path is absolute or outside the repo. In simpler_setup/tools/l0_swimlane.py at lines 193-198 (the dictionary construction where "source" is set to Path(inc["source"])), at lines 417-419, and at lines 766-768, apply the _cpp_string_literal_path() function to the source value before using it in the include path to properly resolve it relative to the test file/repo root, reject out-of-repo paths, and escape it for safe C++ codegen.

coderabbitai · 2026-06-15T07:38:30Z

+        shape = list(t["shape"])
+        strides = list(t.get("strides") or _row_major(shape))
+        args.append(
+            {
+                "kind": "tensor",
+                "slot": t["arg_index"],
+                "dtype": dt,
+                "shape": shape,
+                "strides": strides,
+                "start_offset": int(t.get("start_offset", 0)),


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fail fast when tensor rank exceeds descriptor capacity.

The generated descriptor layout stores shapes at byte 44 and strides at byte 72, which leaves room for 7 dimensions. A dump with len(shape) > 7 will silently overwrite adjacent descriptor fields in make_desc; validate the rank before codegen instead of emitting a corrupt replay.

🐛 Proposed guard

dt = t["dtype"].upper() shape = list(t["shape"]) + if len(shape) > 7: + raise ValueError(f"tensor arg {t['arg_index']} rank {len(shape)} exceeds descriptor capacity (7)") strides = list(t.get("strides") or _row_major(shape)) + if len(strides) != len(shape): + raise ValueError(f"tensor arg {t['arg_index']} has shape/strides rank mismatch")

Also applies to: 493-496, 863-866

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@simpler_setup/tools/l0_swimlane.py` around lines 340 - 349, Add validation to fail fast when tensor rank exceeds the descriptor capacity of 7 dimensions. The descriptor layout has a fixed capacity for shapes (byte 44) and strides (byte 72), so any tensor with len(shape) > 7 will silently corrupt adjacent descriptor fields in make_desc. Insert a check that len(shape) <= 7 at the three affected sites in simpler_setup/tools/l0_swimlane.py (around lines 340-349 where shape is initialized, around lines 493-496, and around lines 863-866) and raise a clear error message immediately if the rank exceeds this limit, preventing corrupt replay generation.

coderabbitai · 2026-06-15T07:38:30Z

+        buf_bytes = (a["start_offset"] + _extent_elem(shape, strides)) * esz
+        contig = 1 if _is_contiguous(shape, strides, a["start_offset"]) else 0
+        ndims = len(shape)
+        shp = ", ".join(str(x) for x in shape)
+        strd = ", ".join(str(x) for x in strides)
+        # Default: data memset to 0 (only descriptor metadata is real). When
+        # --set-arg fills this tensor, write VALUE into every element instead —
+        # for control tensors whose CONTENT drives the kernel (e.g. paged
+        # attention reads n_blocks from the context_lens tensor). The low `esz`
+        # bytes of the int64 VALUE are copied per element (correct for any
+        # integer width, little-endian).
+        fill = a.get("fill")
+        if fill is None:
+            init = f"    ACL_CHECK(aclrtMemset(d_t{ti}, t{ti}Bytes, 0, t{ti}Bytes));"
+        else:
+            init = (
+                f"    {{\n"
+                f"        std::vector<unsigned char> hbuf{ti}(t{ti}Bytes, 0);\n"
+                f"        const int64_t fillv{ti} = {fill}LL;\n"
+                f"        for (size_t off = 0; off + {esz} <= t{ti}Bytes; off += {esz})\n"
+                f"            memcpy(hbuf{ti}.data() + off, &fillv{ti}, {esz});\n"
+                f"        ACL_CHECK(aclrtMemcpy(d_t{ti}, t{ti}Bytes, hbuf{ti}.data(), t{ti}Bytes,\n"
+                f"                              ACL_MEMCPY_HOST_TO_DEVICE));\n"
+                f"    }}"
+            )
+        alloc.append(
+            f"    void *d_t{ti} = nullptr;\n"
+            f"    const size_t t{ti}Bytes = {buf_bytes}ULL;\n"
+            f"    ACL_CHECK(aclrtMalloc(&d_t{ti}, t{ti}Bytes, ACL_MEM_MALLOC_HUGE_FIRST));\n"
+            f"{init}"
+        )
+        descs.append(
+            f"    {{\n"
+            f"        const uint32_t shp[] = {{{shp}}};\n"
+            f"        const uint32_t strd[] = {{{strd}}};\n"
+            f"        make_desc(h_tensors.data() + {ti} * 128, (uint64_t)(uintptr_t)d_t{ti},\n"
+            f"                  t{ti}Bytes, {a['start_offset']}ULL, shp, strd, {ndims}, {DTYPE_RAW[dt]}, {contig});\n"


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Decouple allocation size from descriptor buffer.size.

buf_bytes includes start_offset and is also written into the descriptor as buffer.size, while start_offset is written separately. For non-zero-offset dump records this overstates the tensor’s logical span by start_offset elements; allocate enough bytes for the offset view, but pass the captured/logical tensor span to make_desc.

🐛 Proposed fix direction

- buf_bytes = (a["start_offset"] + _extent_elem(shape, strides)) * esz + desc_bytes = _extent_elem(shape, strides) * esz + alloc_bytes = (a["start_offset"] + _extent_elem(shape, strides)) * esz @@ - f" const size_t t{ti}Bytes = {buf_bytes}ULL;\n" + f" const size_t t{ti}Bytes = {alloc_bytes}ULL;\n" + f" const size_t t{ti}DescBytes = {desc_bytes}ULL;\n" @@ - f" t{ti}Bytes, {a['start_offset']}ULL, shp, strd, {ndims}, {DTYPE_RAW[dt]}, {contig});\n" + f" t{ti}DescBytes, {a['start_offset']}ULL, shp, strd, {ndims}, {DTYPE_RAW[dt]}, {contig});\n"

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@simpler_setup/tools/l0_swimlane.py` around lines 460 - 496, The issue is that buf_bytes includes start_offset and is used for both memory allocation (aclrtMalloc) and as the buffer.size parameter in make_desc, which causes the descriptor to overstate the tensor's logical span when start_offset is non-zero. Calculate the logical tensor size separately (without start_offset) and use buf_bytes for the allocation size (aclrtMalloc with t{ti}Bytes), but pass only the logical tensor span (calculated as _extent_elem(shape, strides) * esz) to the make_desc function call instead of t{ti}Bytes.

coderabbitai · 2026-06-15T07:38:30Z

+BUILD_DIR="$WS/build"
+COLLECT_DIR="$WS/msprof_collect"
+EXPORT_ROOT="$WS/insight_export"
+
+source "$CANN_HOME/set_env.sh"
+export ASCEND_HOME_PATH="$CANN_HOME"
+SIM_LIB_DIR="$CANN_HOME/aarch64-linux/simulator/$SOC_VERSION/lib"
+LD_LIBS="$BUILD_DIR:$SIM_LIB_DIR:$CANN_HOME/lib64"
+LD_LIBS="$LD_LIBS:$CANN_HOME/aarch64-linux/devlib:$CANN_HOME/devlib"
+export LD_LIBRARY_PATH="$LD_LIBS:${LD_LIBRARY_PATH:-}"
+export ACL_DEVICE_ID="$DEVICE_ID"
+mkdir -p "$BUILD_DIR" "$COLLECT_DIR" "$EXPORT_ROOT"
+
+cmake -G Ninja -S "$WS" -B "$BUILD_DIR" \\
+    -DSOC_VERSION="$SOC_VERSION" -DPTO_ISA_ROOT="$PTO_ISA_ROOT" -DREPO_ROOT="$REPO_ROOT"
+cmake --build "$BUILD_DIR" --target replay_host
+
+msprof op simulator \\
+    --application="$BUILD_DIR/replay_host" --kernel-name="replay_entry" \\
+    --launch-count=1 --soc-version="$SOC_VERSION" --timeout=120 \\
+    --output="$COLLECT_DIR/out" 2>&1 | tee "$COLLECT_DIR/msprof_collect.log"
+
+OPPROF_DIR="$(find "$COLLECT_DIR/out" -maxdepth 1 -mindepth 1 -type d -name 'OPPROF_*' | sort | tail -n 1)"


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Clear per-run collect/export directories before selecting OPPROF_*.

run_collect.sh reuses msprof_collect and insight_export, then both the shell script and Python pick the newest matching export. If a rerun leaves stale OPPROF_* artifacts after a failed or partial collect, the tool can return an old trace for the current kernel. Remove those directories or create a unique run subdirectory before each collection.

🧹 Proposed fix

BUILD_DIR="$WS/build" COLLECT_DIR="$WS/msprof_collect" EXPORT_ROOT="$WS/insight_export" @@ -mkdir -p "$BUILD_DIR" "$COLLECT_DIR" "$EXPORT_ROOT" +rm -rf "$COLLECT_DIR" "$EXPORT_ROOT" +mkdir -p "$BUILD_DIR" "$COLLECT_DIR" "$EXPORT_ROOT"

Also applies to: 1225-1231

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@simpler_setup/tools/l0_swimlane.py` around lines 719 - 741, Clear the msprof_collect and insight_export directories before running the collection to avoid selecting stale OPPROF_* artifacts from previous failed or partial runs. Add rm -rf commands to remove the contents of COLLECT_DIR and EXPORT_ROOT immediately after their directory creation with mkdir -p, and before the msprof op simulator command executes. This same cleanup pattern needs to be applied at both the anchor location (around the mkdir -p line for COLLECT_DIR and EXPORT_ROOT) and at the sibling location mentioned in the consolidated_sites section (lines 1225-1231 in the same file).

gemini-code-assist Bot reviewed Jun 15, 2026

View reviewed changes

coderabbitai Bot reviewed Jun 15, 2026

View reviewed changes

indigo1973 changed the title ~~feat(dfx): add l0_swimlane intra-core pipeline trace tool~~ [WIP] feat(dfx): add l0_swimlane intra-core pipeline trace tool Jun 16, 2026

		for s in scalars:
		args.append({"kind": "scalar", "slot": s["arg_index"], "value": int(s["value"])})

-    for s in scalars:
-        args.append({"kind": "scalar", "slot": s["arg_index"], "value": int(s["value"])})
+    for s in scalars:
+        val = s["value"]
+        if s.get("dtype", "").upper() == "FLOAT32":
+            import struct
+            val = struct.unpack("<I", struct.pack("<f", float(val)))[0]
+        else:
+            val = int(val)
+        args.append({"kind": "scalar", "slot": s["arg_index"], "value": val})

	export ASCEND_HOME_PATH=<your CANN install> # e.g. .../cann-9.0.0
	export ASCEND_HOME_PATH="path/to/cann/install" # e.g. .../cann-9.0.0

Conversation

indigo1973 commented Jun 15, 2026

Uh oh!

coderabbitai Bot commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

coderabbitai Bot commented Jun 15, 2026 •

edited

Loading