Skip to content

[WIP] feat(dfx): add l0_swimlane intra-core pipeline trace tool#1053

Open
indigo1973 wants to merge 1 commit into
hw-native-sys:mainfrom
indigo1973:l0_swim_0611
Open

[WIP] feat(dfx): add l0_swimlane intra-core pipeline trace tool#1053
indigo1973 wants to merge 1 commit into
hw-native-sys:mainfrom
indigo1973:l0_swim_0611

Conversation

@indigo1973

Copy link
Copy Markdown
Contributor

Generate an AICore intra-core swimlane trace.json for a single kernel:
read the test's CALLABLE to resolve the kernel by func_id, run a JSON-only
tensor dump to capture the real per-task args, reconstruct them filtered by
func_id (zero hand-guessing), emit a 5-file msprof-op-simulator replay
workspace, and collect the camodel trace. The dump's golden PASS is the
gate that the captured args are trustworthy.

Handles five kernel shapes: AIC-only, AIV-only, SPMD AIV, cooperative SPMD
mix, and offset subtasks (an independent kernel packed into a mix dispatch,
whose args start at a non-zero slot). --set-arg shrinks a replay loop count
(scalar n_blocks or the context_lens control tensor) without distorting the
pipeline structure.

Supporting changes:

  • tensor dump: stamp each record with the originating kernel's func_id (task->kernel_id[slot]) so a multi-kernel dump can be filtered per kernel; new func_id field in TensorDumpRecord/TensorDumpInfo and the collector JSON (-1 when unknown). Required by l0_swimlane's func_id reconstruction.
  • tests: declare the incore tensor signatures the dump needs (paged_attention_unroll, spmd_multiblock_aiv SPMD_WRITE_AIV, spmd_paged_attention PA_AIC full 9-tensor layout) and add a small manual SmallCase1 to spmd_paged_attention as an onboard mix trace target.
  • docs: new docs/dfx/l0-swimlane-profiling.md (usage, kernel-shape table, --set-arg loop shrinking, the cooperative-mix signature rule — declare on exactly one of the pair); tensor-dump.md documents the func_id field; insight-trace SKILL.md records the manual recipe the tool automates.

  Generate an AICore intra-core swimlane trace.json for a single kernel:
  read the test's CALLABLE to resolve the kernel by func_id, run a JSON-only
  tensor dump to capture the real per-task args, reconstruct them filtered by
  func_id (zero hand-guessing), emit a 5-file msprof-op-simulator replay
  workspace, and collect the camodel trace. The dump's golden PASS is the
  gate that the captured args are trustworthy.

  Handles five kernel shapes: AIC-only, AIV-only, SPMD AIV, cooperative SPMD
  mix, and offset subtasks (an independent kernel packed into a mix dispatch,
  whose args start at a non-zero slot). --set-arg shrinks a replay loop count
  (scalar n_blocks or the context_lens control tensor) without distorting the
  pipeline structure.

  Supporting changes:

  - tensor dump: stamp each record with the originating kernel's func_id
    (task->kernel_id[slot]) so a multi-kernel dump can be filtered per kernel;
    new func_id field in TensorDumpRecord/TensorDumpInfo and the collector
    JSON (-1 when unknown). Required by l0_swimlane's func_id reconstruction.
  - tests: declare the incore tensor signatures the dump needs
    (paged_attention_unroll, spmd_multiblock_aiv SPMD_WRITE_AIV,
    spmd_paged_attention PA_AIC full 9-tensor layout) and add a small manual
    SmallCase1 to spmd_paged_attention as an onboard mix trace target.
  - docs: new docs/dfx/l0-swimlane-profiling.md (usage, kernel-shape table,
    --set-arg loop shrinking, the cooperative-mix signature rule — declare on
    exactly one of the pair); tensor-dump.md documents the func_id field;
    insight-trace SKILL.md records the manual recipe the tool automates.
@coderabbitai

coderabbitai Bot commented Jun 15, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

Adds func_id to TensorDumpRecord, TensorDumpInfo, and DumpedTensor structs and propagates it through the AICPU dump runtime and JSON manifest. Introduces a new l0_swimlane CLI tool that uses func_id-filtered tensor dumps to reconstruct kernel args, code-generate a replay workspace, smoke-build it, run msprof op simulator, and emit both Insight and Perfetto-friendly trace files. Updates test kernel signatures to expose tensors for dump capture, and adds full documentation.

Changes

L0 Swimlane profiling tool and func_id pipeline

Layer / File(s) Summary
func_id field added to TensorDumpRecord/TensorDumpInfo/DumpedTensor
src/a2a3/platform/include/common/tensor_dump.h, src/a5/platform/include/common/tensor_dump.h, src/common/platform/include/host/tensor_dump_collector.h
TensorDumpRecord gains uint16_t func_id (pad0 resized to preserve 128B layout) in both a2a3 and a5 headers; TensorDumpInfo gains int32_t func_id (-1 unknown) in both; DumpedTensor gains uint16_t func_id (0xFFFF unknown).
func_id population in AICPU runtime, record writer, and JSON export
src/common/platform/include/aicpu/tensor_dump_aicpu.h, src/common/platform/shared/aicpu/tensor_dump_aicpu.cpp, src/common/platform/shared/host/tensor_dump_collector.cpp
dump_tensors_for_task derives scalar_func_id from the first active subtask's kernel_id and sets TensorDumpInfo.func_id for tensor and scalar records; the host-build-graph overload sets func_id=-1. dump_tensor_record casts to uint16_t; the collector copies the field and maps 0xFFFF back to -1 in JSON.
Kernel callable signatures updated for tensor dump capture
tests/st/a2a3/tensormap_and_ringbuffer/paged_attention_unroll/test_paged_attention_unroll.py, tests/st/a5/tensormap_and_ringbuffer/paged_attention_unroll/test_paged_attention_unroll.py, tests/st/a2a3/tensormap_and_ringbuffer/spmd_multiblock_aiv/test_spmd_multiblock_aiv.py, tests/st/a2a3/tensormap_and_ringbuffer/spmd_paged_attention/test_spmd_paged_attention.py
QK/PV incore signatures gain an extra D.IN on both platforms; SPMD_WRITE_AIV gains signature:[D.INOUT]; spmd_paged_attention orchestration gets a full 9-tensor signature and a new SmallCase1 test case.
l0_swimlane: constants, kernel metadata, and dump acquisition
simpler_setup/tools/l0_swimlane.py (lines 1–281)
Defines dtype/size/arch constants; load_kernel_meta loads SceneTest metadata and detects cooperative mix kernels; get_or_run_dump validates/reuses or executes --dump-tensor 3 and returns the tensor_dump.json path.
l0_swimlane: arg reconstruction from tensor dump
simpler_setup/tools/l0_swimlane.py (lines 287–364)
reconstruct_task_args filters records by func_id, merges tensor metadata across dump stages, validates slot uniqueness, computes strides, and produces the final tensor descriptor + scalar arg list.
l0_swimlane: replay workspace codegen and build
simpler_setup/tools/l0_swimlane.py (lines 383–1056)
Single-core and SPMD mix replay code generators (kernel, launch stubs, host main with synthesized SPMD context slots); run_collect.sh template; generate_workspace writes the 5-file workspace; smoke_build compiles via CMake/Ninja and validates exported symbols.
l0_swimlane: collect, Perfetto transform, overrides, and CLI main
simpler_setup/tools/l0_swimlane.py (lines 1058–1453)
_to_perfetto repacks overlapping intervals into sub-lanes and merges B/E pairs into ph:X slices; collect manages device locking, locates trace.json, and produces _perfetto.json; apply_arg_overrides validates --set-arg specs; main wires the full CLI workflow.
Documentation: l0-swimlane-profiling, tensor-dump func_id, Perfetto SKILL
docs/dfx/l0-swimlane-profiling.md, docs/dfx/tensor-dump.md, .claude/skills/insight-trace/SKILL.md
New 616-line profiling doc covers workflow, CLI flags, workspace internals, SPMD synthesis, fidelity rules, findings, FAQ, and related docs. tensor-dump.md adds func_id to the example JSON and key-fields. SKILL.md adds a Perfetto post-processing section.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant l0_swimlane_main as l0_swimlane main()
    participant get_or_run_dump
    participant reconstruct_task_args
    participant generate_workspace
    participant smoke_build
    participant collect
    participant _to_perfetto

    User->>l0_swimlane_main: --test, --func-id, --platform, --case, --set-arg
    l0_swimlane_main->>get_or_run_dump: run SceneTest --dump-tensor 3
    get_or_run_dump-->>l0_swimlane_main: tensor_dump.json path
    l0_swimlane_main->>reconstruct_task_args: filter by func_id, merge arg stages
    reconstruct_task_args-->>l0_swimlane_main: tensor descriptors + scalar args
    l0_swimlane_main->>generate_workspace: emit 5-file replay workspace
    generate_workspace-->>l0_swimlane_main: workspace directory
    l0_swimlane_main->>smoke_build: cmake/ninja + symbol check
    smoke_build-->>l0_swimlane_main: build OK
    l0_swimlane_main->>collect: run msprof op simulator, locate trace.json
    collect->>_to_perfetto: repack intervals, merge B/E→X, rewrite tid
    _to_perfetto-->>collect: trace_perfetto.json
    collect-->>User: trace.json + trace_perfetto.json
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

🐇 Hop hop, the swimlane's here at last,
Each kernel's trace replayed so fast!
func_id flows from struct to JSON neat,
Perfetto lanes no longer skip a beat.
The rabbit cheers — no more mis-nested pain,
Just clean ph:X slices down each lane! ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 22.22% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description check ✅ Passed The description comprehensively explains the l0_swimlane tool's functionality, supported kernel shapes, the --set-arg flag, and all supporting changes (tensor dump func_id field, test updates, and documentation additions), clearly relating to the changeset.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Title check ✅ Passed The title clearly describes the main change: introducing a new l0_swimlane intra-core pipeline trace tool, which is the primary feature across all modified files.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the l0_swimlane.py tool to automate AICore intra-core pipeline trace generation, along with supporting documentation and runtime updates to track func_id in tensor dumps. The review feedback is highly constructive and should be addressed: specifically, a critical bug in l0_swimlane.py where FLOAT32 scalars are truncated to integers during reconstruction, and another bug where sequential B/E trace events overwrite each other in loops instead of being matched via a stack. Additionally, replacing angle-bracket placeholders in the Markdown bash snippets with safe shell variables will prevent syntax and redirection errors.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +352 to +353
for s in scalars:
args.append({"kind": "scalar", "slot": s["arg_index"], "value": int(s["value"])})

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

For FLOAT32 scalars, the tensor dump collector writes the actual float value (e.g., 1.5) into the JSON manifest. Simply casting this value to an integer using int(s["value"]) will truncate the float and result in an incorrect integer value (e.g., 1 instead of the IEEE 754 binary representation of 1.5f, which is 1069547520). When the kernel later interprets this value as a float, it will read a completely wrong value (almost zero).

We should check if the scalar's dtype is FLOAT32 and convert it to its 32-bit float binary representation as an integer using struct.

Suggested change
for s in scalars:
args.append({"kind": "scalar", "slot": s["arg_index"], "value": int(s["value"])})
for s in scalars:
val = s["value"]
if s.get("dtype", "").upper() == "FLOAT32":
import struct
val = struct.unpack("<I", struct.pack("<f", float(val)))[0]
else:
val = int(val)
args.append({"kind": "scalar", "slot": s["arg_index"], "value": val})

Comment on lines +1080 to +1114
intervals = defaultdict(list)
be = defaultdict(dict)
for e in evs:
if not is_core(e):
continue
ph = e.get("ph")
key = (e["pid"], e["tid"])
if ph == "X":
intervals[key].append(
{
"ts": e["ts"],
"end": e["ts"] + e.get("dur", 0.0),
"name": e["name"],
"args": e.get("args", {}),
}
)
elif ph in ("B", "E"):
slot = be[(e["pid"], e["tid"], e.get("id"))]
slot[ph] = e["ts"]
slot.setdefault("src", e)
for (pid, tid, eid), slot in be.items():
s = slot.get("B", slot.get("E"))
en = slot.get("E", slot.get("B"))
if s is not None and en is not None and s > en:
s, en = en, s
src = slot["src"]
intervals[(pid, tid)].append(
{
"ts": s,
"end": en,
"name": src["name"],
"args": src.get("args", {}),
"id": eid,
}
)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The current logic for pairing B and E events uses a single dictionary slot per (pid, tid, id) key. If a flag is set and waited multiple times (which is extremely common in loops), or if multiple events share the same id (or have None as id), subsequent B or E events will overwrite the previous ones in the be dictionary. This results in silent loss of all intermediate events and corrupted intervals (e.g., pairing the first event's metadata with the last event's timestamps).

We should use a stack-based matching approach to correctly pair nested or sequential B and E events on the same thread/id.

    intervals = defaultdict(list)
    be_stacks = defaultdict(list)
    for e in evs:
        if not is_core(e):
            continue
        ph = e.get("ph")
        if ph == "X":
            intervals[(e["pid"], e["tid"])].append(
                {
                    "ts": e["ts"],
                    "end": e["ts"] + e.get("dur", 0.0),
                    "name": e["name"],
                    "args": e.get("args", {}),
                }
            )
        elif ph == "B":
            be_stacks[(e["pid"], e["tid"], e.get("id"))].append(e)
        elif ph == "E":
            key = (e["pid"], e["tid"], e.get("id"))
            stack = be_stacks[key]
            if stack:
                b_ev = stack.pop()
                s = b_ev["ts"]
                en = e["ts"]
                if s > en:
                    s, en = en, s
                intervals[(e["pid"], e["tid"])].append(
                    {
                        "ts": s,
                        "end": en,
                        "name": b_ev["name"],
                        "args": b_ev.get("args", {}),
                        "id": e.get("id"),
                    }
                )
            else:
                intervals[(e["pid"], e["tid"])].append(
                    {
                        "ts": e["ts"],
                        "end": e["ts"],
                        "name": e["name"],
                        "args": e.get("args", {}),
                        "id": e.get("id"),
                    }
                )
    for key, stack in be_stacks.items():
        pid, tid, eid = key
        for b_ev in stack:
            intervals[(pid, tid)].append(
                {
                    "ts": b_ev["ts"],
                    "end": b_ev["ts"],
                    "name": b_ev["name"],
                    "args": b_ev.get("args", {}),
                    "id": eid,
                }
            )

Comment on lines +52 to +55
```bash
python -m simpler_setup.tools.l0_swimlane \
--test tests/st/<case>/test_<name>.py --func-id <N> --platform a2a3sim
```

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Avoid using angle brackets for literal placeholders (e.g., <case>, <name>, <N>) in Markdown bash code blocks to prevent shell syntax errors and ensure snippets are copy-paste safe. Use standard shell variable placeholders or safe, quoted placeholders instead.

Suggested change
```bash
python -m simpler_setup.tools.l0_swimlane \
--test tests/st/<case>/test_<name>.py --func-id <N> --platform a2a3sim
```
python -m simpler_setup.tools.l0_swimlane \
--test tests/st/"$CASE_DIR"/test_"$NAME".py --func-id "$FUNC_ID" --platform a2a3sim
References
  1. In Markdown bash code blocks, avoid using angle brackets for literal placeholders (e.g., ) to prevent shell syntax errors and ensure snippets are copy-paste safe. Use standard shell variable placeholders like $ISSUE_NUMBER instead.

```bash
# Environment (once per shell): activate the venv and source CANN.
source .venv/bin/activate
export ASCEND_HOME_PATH=<your CANN install> # e.g. .../cann-9.0.0

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Avoid using angle brackets for placeholders (e.g., <your CANN install>) in Markdown bash code blocks to prevent them from being parsed as input redirection. Use safe, quoted placeholders instead.

Suggested change
export ASCEND_HOME_PATH=<your CANN install> # e.g. .../cann-9.0.0
export ASCEND_HOME_PATH="path/to/cann/install" # e.g. .../cann-9.0.0
References
  1. In Markdown bash code blocks, avoid using angle brackets for placeholders (e.g., ) to prevent them from being parsed as input redirection. Use safe, quoted placeholders like "path/to/file" instead.

Comment on lines +268 to +274
# First kernel: runs the sim dump.
python -m simpler_setup.tools.l0_swimlane --test <file> --func-id 0 --platform a2a3sim

# Subsequent kernels: point at the manifest the first run produced.
python -m simpler_setup.tools.l0_swimlane --test <file> --func-id 2 --platform a2a3sim \
--dump-json outputs/<ClassName>_<Case>_<ts>/tensor_dump/tensor_dump.json
```

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Avoid using angle brackets for placeholders (e.g., <file>, <ClassName>, <Case>, <ts>) in Markdown bash code blocks to prevent shell syntax errors or input redirection. Use safe, quoted placeholders or standard shell variables instead.

Suggested change
# First kernel: runs the sim dump.
python -m simpler_setup.tools.l0_swimlane --test <file> --func-id 0 --platform a2a3sim
# Subsequent kernels: point at the manifest the first run produced.
python -m simpler_setup.tools.l0_swimlane --test <file> --func-id 2 --platform a2a3sim \
--dump-json outputs/<ClassName>_<Case>_<ts>/tensor_dump/tensor_dump.json
```
# First kernel: runs the sim dump.
python -m simpler_setup.tools.l0_swimlane --test "path/to/test.py" --func-id 0 --platform a2a3sim
# Subsequent kernels: point at the manifest the first run produced.
python -m simpler_setup.tools.l0_swimlane --test "path/to/test.py" --func-id 2 --platform a2a3sim \
--dump-json outputs/"$CLASS_NAME"_"$CASE"_"$TIMESTAMP"/tensor_dump/tensor_dump.json
References
  1. In Markdown bash code blocks, avoid using angle brackets for placeholders (e.g., ) to prevent them from being parsed as input redirection. Use safe, quoted placeholders like "path/to/file" instead.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@simpler_setup/tools/l0_swimlane.py`:
- Around line 193-198: The `inc["source"]` value is being embedded directly into
the generated `#include` directive without validation or escaping, which can
result in invalid C++ code if the path contains quotes or backslashes, or can
compile an unintended kernel if the path is absolute or outside the repo. In
simpler_setup/tools/l0_swimlane.py at lines 193-198 (the dictionary construction
where "source" is set to Path(inc["source"])), at lines 417-419, and at lines
766-768, apply the _cpp_string_literal_path() function to the source value
before using it in the include path to properly resolve it relative to the test
file/repo root, reject out-of-repo paths, and escape it for safe C++ codegen.
- Around line 719-741: Clear the msprof_collect and insight_export directories
before running the collection to avoid selecting stale OPPROF_* artifacts from
previous failed or partial runs. Add rm -rf commands to remove the contents of
COLLECT_DIR and EXPORT_ROOT immediately after their directory creation with
mkdir -p, and before the msprof op simulator command executes. This same cleanup
pattern needs to be applied at both the anchor location (around the mkdir -p
line for COLLECT_DIR and EXPORT_ROOT) and at the sibling location mentioned in
the consolidated_sites section (lines 1225-1231 in the same file).
- Around line 340-349: Add validation to fail fast when tensor rank exceeds the
descriptor capacity of 7 dimensions. The descriptor layout has a fixed capacity
for shapes (byte 44) and strides (byte 72), so any tensor with len(shape) > 7
will silently corrupt adjacent descriptor fields in make_desc. Insert a check
that len(shape) <= 7 at the three affected sites in
simpler_setup/tools/l0_swimlane.py (around lines 340-349 where shape is
initialized, around lines 493-496, and around lines 863-866) and raise a clear
error message immediately if the rank exceeds this limit, preventing corrupt
replay generation.
- Around line 460-496: The issue is that buf_bytes includes start_offset and is
used for both memory allocation (aclrtMalloc) and as the buffer.size parameter
in make_desc, which causes the descriptor to overstate the tensor's logical span
when start_offset is non-zero. Calculate the logical tensor size separately
(without start_offset) and use buf_bytes for the allocation size (aclrtMalloc
with t{ti}Bytes), but pass only the logical tensor span (calculated as
_extent_elem(shape, strides) * esz) to the make_desc function call instead of
t{ti}Bytes.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 0d308910-0bc0-4ec7-bc12-28f236c11b0e

📥 Commits

Reviewing files that changed from the base of the PR and between c5ded40 and 812d8a7.

📒 Files selected for processing (14)
  • .claude/skills/insight-trace/SKILL.md
  • docs/dfx/l0-swimlane-profiling.md
  • docs/dfx/tensor-dump.md
  • simpler_setup/tools/l0_swimlane.py
  • src/a2a3/platform/include/common/tensor_dump.h
  • src/a5/platform/include/common/tensor_dump.h
  • src/common/platform/include/aicpu/tensor_dump_aicpu.h
  • src/common/platform/include/host/tensor_dump_collector.h
  • src/common/platform/shared/aicpu/tensor_dump_aicpu.cpp
  • src/common/platform/shared/host/tensor_dump_collector.cpp
  • tests/st/a2a3/tensormap_and_ringbuffer/paged_attention_unroll/test_paged_attention_unroll.py
  • tests/st/a2a3/tensormap_and_ringbuffer/spmd_multiblock_aiv/test_spmd_multiblock_aiv.py
  • tests/st/a2a3/tensormap_and_ringbuffer/spmd_paged_attention/test_spmd_paged_attention.py
  • tests/st/a5/tensormap_and_ringbuffer/paged_attention_unroll/test_paged_attention_unroll.py

Comment on lines +193 to +198
same_src = [i for _, i in incores if i["source"] == inc["source"]]
core_types = {i["core_type"] for i in same_src}
is_mix = "aic" in core_types and "aiv" in core_types
return {
"source": Path(inc["source"]),
"core_type": inc["core_type"],

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Validate and escape the generated #include path.

inc["source"] comes from the imported test metadata and is emitted verbatim into #include "{source}". A source path containing quotes/backslashes will generate invalid C++, and an unexpected absolute/out-of-repo path can make the replay compile a different kernel than the selected test intended. Resolve the path relative to the test file/repo root, reject paths outside the allowed tree, and escape it before codegen.

🛡️ Proposed fix direction
+def _cpp_string_literal_path(p: Path) -> str:
+    return str(p).replace("\\", "\\\\").replace('"', '\\"')
+
 def load_kernel_meta(test_path: Path, func_id: int, platform: str):
@@
-            return {
-                "source": Path(inc["source"]),
+            src = Path(inc["source"])
+            if not src.is_absolute():
+                src = (test_path.parent / src).resolve()
+            else:
+                src = src.resolve()
+            try:
+                src.relative_to(PROJECT_ROOT)
+            except ValueError as exc:
+                raise ValueError(f"CALLABLE source must stay under repo root: {src}") from exc
+            return {
+                "source": src,

Then call _cpp_string_literal_path(source) when embedding the include path.

-{_prologue(cfg)}`#include` "{source}"
+{_prologue(cfg)}`#include` "{_cpp_string_literal_path(source)}"

Also applies to: 417-419, 766-768

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@simpler_setup/tools/l0_swimlane.py` around lines 193 - 198, The
`inc["source"]` value is being embedded directly into the generated `#include`
directive without validation or escaping, which can result in invalid C++ code
if the path contains quotes or backslashes, or can compile an unintended kernel
if the path is absolute or outside the repo. In
simpler_setup/tools/l0_swimlane.py at lines 193-198 (the dictionary construction
where "source" is set to Path(inc["source"])), at lines 417-419, and at lines
766-768, apply the _cpp_string_literal_path() function to the source value
before using it in the include path to properly resolve it relative to the test
file/repo root, reject out-of-repo paths, and escape it for safe C++ codegen.

Comment on lines +340 to +349
shape = list(t["shape"])
strides = list(t.get("strides") or _row_major(shape))
args.append(
{
"kind": "tensor",
"slot": t["arg_index"],
"dtype": dt,
"shape": shape,
"strides": strides,
"start_offset": int(t.get("start_offset", 0)),

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fail fast when tensor rank exceeds descriptor capacity.

The generated descriptor layout stores shapes at byte 44 and strides at byte 72, which leaves room for 7 dimensions. A dump with len(shape) > 7 will silently overwrite adjacent descriptor fields in make_desc; validate the rank before codegen instead of emitting a corrupt replay.

🐛 Proposed guard
         dt = t["dtype"].upper()
         shape = list(t["shape"])
+        if len(shape) > 7:
+            raise ValueError(f"tensor arg {t['arg_index']} rank {len(shape)} exceeds descriptor capacity (7)")
         strides = list(t.get("strides") or _row_major(shape))
+        if len(strides) != len(shape):
+            raise ValueError(f"tensor arg {t['arg_index']} has shape/strides rank mismatch")

Also applies to: 493-496, 863-866

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@simpler_setup/tools/l0_swimlane.py` around lines 340 - 349, Add validation to
fail fast when tensor rank exceeds the descriptor capacity of 7 dimensions. The
descriptor layout has a fixed capacity for shapes (byte 44) and strides (byte
72), so any tensor with len(shape) > 7 will silently corrupt adjacent descriptor
fields in make_desc. Insert a check that len(shape) <= 7 at the three affected
sites in simpler_setup/tools/l0_swimlane.py (around lines 340-349 where shape is
initialized, around lines 493-496, and around lines 863-866) and raise a clear
error message immediately if the rank exceeds this limit, preventing corrupt
replay generation.

Comment on lines +460 to +496
buf_bytes = (a["start_offset"] + _extent_elem(shape, strides)) * esz
contig = 1 if _is_contiguous(shape, strides, a["start_offset"]) else 0
ndims = len(shape)
shp = ", ".join(str(x) for x in shape)
strd = ", ".join(str(x) for x in strides)
# Default: data memset to 0 (only descriptor metadata is real). When
# --set-arg fills this tensor, write VALUE into every element instead —
# for control tensors whose CONTENT drives the kernel (e.g. paged
# attention reads n_blocks from the context_lens tensor). The low `esz`
# bytes of the int64 VALUE are copied per element (correct for any
# integer width, little-endian).
fill = a.get("fill")
if fill is None:
init = f" ACL_CHECK(aclrtMemset(d_t{ti}, t{ti}Bytes, 0, t{ti}Bytes));"
else:
init = (
f" {{\n"
f" std::vector<unsigned char> hbuf{ti}(t{ti}Bytes, 0);\n"
f" const int64_t fillv{ti} = {fill}LL;\n"
f" for (size_t off = 0; off + {esz} <= t{ti}Bytes; off += {esz})\n"
f" memcpy(hbuf{ti}.data() + off, &fillv{ti}, {esz});\n"
f" ACL_CHECK(aclrtMemcpy(d_t{ti}, t{ti}Bytes, hbuf{ti}.data(), t{ti}Bytes,\n"
f" ACL_MEMCPY_HOST_TO_DEVICE));\n"
f" }}"
)
alloc.append(
f" void *d_t{ti} = nullptr;\n"
f" const size_t t{ti}Bytes = {buf_bytes}ULL;\n"
f" ACL_CHECK(aclrtMalloc(&d_t{ti}, t{ti}Bytes, ACL_MEM_MALLOC_HUGE_FIRST));\n"
f"{init}"
)
descs.append(
f" {{\n"
f" const uint32_t shp[] = {{{shp}}};\n"
f" const uint32_t strd[] = {{{strd}}};\n"
f" make_desc(h_tensors.data() + {ti} * 128, (uint64_t)(uintptr_t)d_t{ti},\n"
f" t{ti}Bytes, {a['start_offset']}ULL, shp, strd, {ndims}, {DTYPE_RAW[dt]}, {contig});\n"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Decouple allocation size from descriptor buffer.size.

buf_bytes includes start_offset and is also written into the descriptor as buffer.size, while start_offset is written separately. For non-zero-offset dump records this overstates the tensor’s logical span by start_offset elements; allocate enough bytes for the offset view, but pass the captured/logical tensor span to make_desc.

🐛 Proposed fix direction
-        buf_bytes = (a["start_offset"] + _extent_elem(shape, strides)) * esz
+        desc_bytes = _extent_elem(shape, strides) * esz
+        alloc_bytes = (a["start_offset"] + _extent_elem(shape, strides)) * esz
@@
-            f"    const size_t t{ti}Bytes = {buf_bytes}ULL;\n"
+            f"    const size_t t{ti}Bytes = {alloc_bytes}ULL;\n"
+            f"    const size_t t{ti}DescBytes = {desc_bytes}ULL;\n"
@@
-            f"                  t{ti}Bytes, {a['start_offset']}ULL, shp, strd, {ndims}, {DTYPE_RAW[dt]}, {contig});\n"
+            f"                  t{ti}DescBytes, {a['start_offset']}ULL, shp, strd, {ndims}, {DTYPE_RAW[dt]}, {contig});\n"
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@simpler_setup/tools/l0_swimlane.py` around lines 460 - 496, The issue is that
buf_bytes includes start_offset and is used for both memory allocation
(aclrtMalloc) and as the buffer.size parameter in make_desc, which causes the
descriptor to overstate the tensor's logical span when start_offset is non-zero.
Calculate the logical tensor size separately (without start_offset) and use
buf_bytes for the allocation size (aclrtMalloc with t{ti}Bytes), but pass only
the logical tensor span (calculated as _extent_elem(shape, strides) * esz) to
the make_desc function call instead of t{ti}Bytes.

Comment on lines +719 to +741
BUILD_DIR="$WS/build"
COLLECT_DIR="$WS/msprof_collect"
EXPORT_ROOT="$WS/insight_export"

source "$CANN_HOME/set_env.sh"
export ASCEND_HOME_PATH="$CANN_HOME"
SIM_LIB_DIR="$CANN_HOME/aarch64-linux/simulator/$SOC_VERSION/lib"
LD_LIBS="$BUILD_DIR:$SIM_LIB_DIR:$CANN_HOME/lib64"
LD_LIBS="$LD_LIBS:$CANN_HOME/aarch64-linux/devlib:$CANN_HOME/devlib"
export LD_LIBRARY_PATH="$LD_LIBS:${LD_LIBRARY_PATH:-}"
export ACL_DEVICE_ID="$DEVICE_ID"
mkdir -p "$BUILD_DIR" "$COLLECT_DIR" "$EXPORT_ROOT"

cmake -G Ninja -S "$WS" -B "$BUILD_DIR" \\
-DSOC_VERSION="$SOC_VERSION" -DPTO_ISA_ROOT="$PTO_ISA_ROOT" -DREPO_ROOT="$REPO_ROOT"
cmake --build "$BUILD_DIR" --target replay_host

msprof op simulator \\
--application="$BUILD_DIR/replay_host" --kernel-name="replay_entry" \\
--launch-count=1 --soc-version="$SOC_VERSION" --timeout=120 \\
--output="$COLLECT_DIR/out" 2>&1 | tee "$COLLECT_DIR/msprof_collect.log"

OPPROF_DIR="$(find "$COLLECT_DIR/out" -maxdepth 1 -mindepth 1 -type d -name 'OPPROF_*' | sort | tail -n 1)"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Clear per-run collect/export directories before selecting OPPROF_*.

run_collect.sh reuses msprof_collect and insight_export, then both the shell script and Python pick the newest matching export. If a rerun leaves stale OPPROF_* artifacts after a failed or partial collect, the tool can return an old trace for the current kernel. Remove those directories or create a unique run subdirectory before each collection.

🧹 Proposed fix
 BUILD_DIR="$WS/build"
 COLLECT_DIR="$WS/msprof_collect"
 EXPORT_ROOT="$WS/insight_export"
@@
-mkdir -p "$BUILD_DIR" "$COLLECT_DIR" "$EXPORT_ROOT"
+rm -rf "$COLLECT_DIR" "$EXPORT_ROOT"
+mkdir -p "$BUILD_DIR" "$COLLECT_DIR" "$EXPORT_ROOT"

Also applies to: 1225-1231

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@simpler_setup/tools/l0_swimlane.py` around lines 719 - 741, Clear the
msprof_collect and insight_export directories before running the collection to
avoid selecting stale OPPROF_* artifacts from previous failed or partial runs.
Add rm -rf commands to remove the contents of COLLECT_DIR and EXPORT_ROOT
immediately after their directory creation with mkdir -p, and before the msprof
op simulator command executes. This same cleanup pattern needs to be applied at
both the anchor location (around the mkdir -p line for COLLECT_DIR and
EXPORT_ROOT) and at the sibling location mentioned in the consolidated_sites
section (lines 1225-1231 in the same file).

@indigo1973 indigo1973 changed the title feat(dfx): add l0_swimlane intra-core pipeline trace tool [WIP] feat(dfx): add l0_swimlane intra-core pipeline trace tool Jun 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant