[WIP] feat(dfx): add l0_swimlane intra-core pipeline trace tool#1053
[WIP] feat(dfx): add l0_swimlane intra-core pipeline trace tool#1053indigo1973 wants to merge 1 commit into
Conversation
Generate an AICore intra-core swimlane trace.json for a single kernel:
read the test's CALLABLE to resolve the kernel by func_id, run a JSON-only
tensor dump to capture the real per-task args, reconstruct them filtered by
func_id (zero hand-guessing), emit a 5-file msprof-op-simulator replay
workspace, and collect the camodel trace. The dump's golden PASS is the
gate that the captured args are trustworthy.
Handles five kernel shapes: AIC-only, AIV-only, SPMD AIV, cooperative SPMD
mix, and offset subtasks (an independent kernel packed into a mix dispatch,
whose args start at a non-zero slot). --set-arg shrinks a replay loop count
(scalar n_blocks or the context_lens control tensor) without distorting the
pipeline structure.
Supporting changes:
- tensor dump: stamp each record with the originating kernel's func_id
(task->kernel_id[slot]) so a multi-kernel dump can be filtered per kernel;
new func_id field in TensorDumpRecord/TensorDumpInfo and the collector
JSON (-1 when unknown). Required by l0_swimlane's func_id reconstruction.
- tests: declare the incore tensor signatures the dump needs
(paged_attention_unroll, spmd_multiblock_aiv SPMD_WRITE_AIV,
spmd_paged_attention PA_AIC full 9-tensor layout) and add a small manual
SmallCase1 to spmd_paged_attention as an onboard mix trace target.
- docs: new docs/dfx/l0-swimlane-profiling.md (usage, kernel-shape table,
--set-arg loop shrinking, the cooperative-mix signature rule — declare on
exactly one of the pair); tensor-dump.md documents the func_id field;
insight-trace SKILL.md records the manual recipe the tool automates.
📝 WalkthroughWalkthroughAdds ChangesL0 Swimlane profiling tool and func_id pipeline
Sequence Diagram(s)sequenceDiagram
participant User
participant l0_swimlane_main as l0_swimlane main()
participant get_or_run_dump
participant reconstruct_task_args
participant generate_workspace
participant smoke_build
participant collect
participant _to_perfetto
User->>l0_swimlane_main: --test, --func-id, --platform, --case, --set-arg
l0_swimlane_main->>get_or_run_dump: run SceneTest --dump-tensor 3
get_or_run_dump-->>l0_swimlane_main: tensor_dump.json path
l0_swimlane_main->>reconstruct_task_args: filter by func_id, merge arg stages
reconstruct_task_args-->>l0_swimlane_main: tensor descriptors + scalar args
l0_swimlane_main->>generate_workspace: emit 5-file replay workspace
generate_workspace-->>l0_swimlane_main: workspace directory
l0_swimlane_main->>smoke_build: cmake/ninja + symbol check
smoke_build-->>l0_swimlane_main: build OK
l0_swimlane_main->>collect: run msprof op simulator, locate trace.json
collect->>_to_perfetto: repack intervals, merge B/E→X, rewrite tid
_to_perfetto-->>collect: trace_perfetto.json
collect-->>User: trace.json + trace_perfetto.json
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request introduces the l0_swimlane.py tool to automate AICore intra-core pipeline trace generation, along with supporting documentation and runtime updates to track func_id in tensor dumps. The review feedback is highly constructive and should be addressed: specifically, a critical bug in l0_swimlane.py where FLOAT32 scalars are truncated to integers during reconstruction, and another bug where sequential B/E trace events overwrite each other in loops instead of being matched via a stack. Additionally, replacing angle-bracket placeholders in the Markdown bash snippets with safe shell variables will prevent syntax and redirection errors.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| for s in scalars: | ||
| args.append({"kind": "scalar", "slot": s["arg_index"], "value": int(s["value"])}) |
There was a problem hiding this comment.
For FLOAT32 scalars, the tensor dump collector writes the actual float value (e.g., 1.5) into the JSON manifest. Simply casting this value to an integer using int(s["value"]) will truncate the float and result in an incorrect integer value (e.g., 1 instead of the IEEE 754 binary representation of 1.5f, which is 1069547520). When the kernel later interprets this value as a float, it will read a completely wrong value (almost zero).
We should check if the scalar's dtype is FLOAT32 and convert it to its 32-bit float binary representation as an integer using struct.
| for s in scalars: | |
| args.append({"kind": "scalar", "slot": s["arg_index"], "value": int(s["value"])}) | |
| for s in scalars: | |
| val = s["value"] | |
| if s.get("dtype", "").upper() == "FLOAT32": | |
| import struct | |
| val = struct.unpack("<I", struct.pack("<f", float(val)))[0] | |
| else: | |
| val = int(val) | |
| args.append({"kind": "scalar", "slot": s["arg_index"], "value": val}) |
| intervals = defaultdict(list) | ||
| be = defaultdict(dict) | ||
| for e in evs: | ||
| if not is_core(e): | ||
| continue | ||
| ph = e.get("ph") | ||
| key = (e["pid"], e["tid"]) | ||
| if ph == "X": | ||
| intervals[key].append( | ||
| { | ||
| "ts": e["ts"], | ||
| "end": e["ts"] + e.get("dur", 0.0), | ||
| "name": e["name"], | ||
| "args": e.get("args", {}), | ||
| } | ||
| ) | ||
| elif ph in ("B", "E"): | ||
| slot = be[(e["pid"], e["tid"], e.get("id"))] | ||
| slot[ph] = e["ts"] | ||
| slot.setdefault("src", e) | ||
| for (pid, tid, eid), slot in be.items(): | ||
| s = slot.get("B", slot.get("E")) | ||
| en = slot.get("E", slot.get("B")) | ||
| if s is not None and en is not None and s > en: | ||
| s, en = en, s | ||
| src = slot["src"] | ||
| intervals[(pid, tid)].append( | ||
| { | ||
| "ts": s, | ||
| "end": en, | ||
| "name": src["name"], | ||
| "args": src.get("args", {}), | ||
| "id": eid, | ||
| } | ||
| ) |
There was a problem hiding this comment.
The current logic for pairing B and E events uses a single dictionary slot per (pid, tid, id) key. If a flag is set and waited multiple times (which is extremely common in loops), or if multiple events share the same id (or have None as id), subsequent B or E events will overwrite the previous ones in the be dictionary. This results in silent loss of all intermediate events and corrupted intervals (e.g., pairing the first event's metadata with the last event's timestamps).
We should use a stack-based matching approach to correctly pair nested or sequential B and E events on the same thread/id.
intervals = defaultdict(list)
be_stacks = defaultdict(list)
for e in evs:
if not is_core(e):
continue
ph = e.get("ph")
if ph == "X":
intervals[(e["pid"], e["tid"])].append(
{
"ts": e["ts"],
"end": e["ts"] + e.get("dur", 0.0),
"name": e["name"],
"args": e.get("args", {}),
}
)
elif ph == "B":
be_stacks[(e["pid"], e["tid"], e.get("id"))].append(e)
elif ph == "E":
key = (e["pid"], e["tid"], e.get("id"))
stack = be_stacks[key]
if stack:
b_ev = stack.pop()
s = b_ev["ts"]
en = e["ts"]
if s > en:
s, en = en, s
intervals[(e["pid"], e["tid"])].append(
{
"ts": s,
"end": en,
"name": b_ev["name"],
"args": b_ev.get("args", {}),
"id": e.get("id"),
}
)
else:
intervals[(e["pid"], e["tid"])].append(
{
"ts": e["ts"],
"end": e["ts"],
"name": e["name"],
"args": e.get("args", {}),
"id": e.get("id"),
}
)
for key, stack in be_stacks.items():
pid, tid, eid = key
for b_ev in stack:
intervals[(pid, tid)].append(
{
"ts": b_ev["ts"],
"end": b_ev["ts"],
"name": b_ev["name"],
"args": b_ev.get("args", {}),
"id": eid,
}
)| ```bash | ||
| python -m simpler_setup.tools.l0_swimlane \ | ||
| --test tests/st/<case>/test_<name>.py --func-id <N> --platform a2a3sim | ||
| ``` |
There was a problem hiding this comment.
Avoid using angle brackets for literal placeholders (e.g., <case>, <name>, <N>) in Markdown bash code blocks to prevent shell syntax errors and ensure snippets are copy-paste safe. Use standard shell variable placeholders or safe, quoted placeholders instead.
| ```bash | |
| python -m simpler_setup.tools.l0_swimlane \ | |
| --test tests/st/<case>/test_<name>.py --func-id <N> --platform a2a3sim | |
| ``` | |
| python -m simpler_setup.tools.l0_swimlane \ | |
| --test tests/st/"$CASE_DIR"/test_"$NAME".py --func-id "$FUNC_ID" --platform a2a3sim |
References
- In Markdown bash code blocks, avoid using angle brackets for literal placeholders (e.g., ) to prevent shell syntax errors and ensure snippets are copy-paste safe. Use standard shell variable placeholders like $ISSUE_NUMBER instead.
| ```bash | ||
| # Environment (once per shell): activate the venv and source CANN. | ||
| source .venv/bin/activate | ||
| export ASCEND_HOME_PATH=<your CANN install> # e.g. .../cann-9.0.0 |
There was a problem hiding this comment.
Avoid using angle brackets for placeholders (e.g., <your CANN install>) in Markdown bash code blocks to prevent them from being parsed as input redirection. Use safe, quoted placeholders instead.
| export ASCEND_HOME_PATH=<your CANN install> # e.g. .../cann-9.0.0 | |
| export ASCEND_HOME_PATH="path/to/cann/install" # e.g. .../cann-9.0.0 |
References
- In Markdown bash code blocks, avoid using angle brackets for placeholders (e.g., ) to prevent them from being parsed as input redirection. Use safe, quoted placeholders like "path/to/file" instead.
| # First kernel: runs the sim dump. | ||
| python -m simpler_setup.tools.l0_swimlane --test <file> --func-id 0 --platform a2a3sim | ||
|
|
||
| # Subsequent kernels: point at the manifest the first run produced. | ||
| python -m simpler_setup.tools.l0_swimlane --test <file> --func-id 2 --platform a2a3sim \ | ||
| --dump-json outputs/<ClassName>_<Case>_<ts>/tensor_dump/tensor_dump.json | ||
| ``` |
There was a problem hiding this comment.
Avoid using angle brackets for placeholders (e.g., <file>, <ClassName>, <Case>, <ts>) in Markdown bash code blocks to prevent shell syntax errors or input redirection. Use safe, quoted placeholders or standard shell variables instead.
| # First kernel: runs the sim dump. | |
| python -m simpler_setup.tools.l0_swimlane --test <file> --func-id 0 --platform a2a3sim | |
| # Subsequent kernels: point at the manifest the first run produced. | |
| python -m simpler_setup.tools.l0_swimlane --test <file> --func-id 2 --platform a2a3sim \ | |
| --dump-json outputs/<ClassName>_<Case>_<ts>/tensor_dump/tensor_dump.json | |
| ``` | |
| # First kernel: runs the sim dump. | |
| python -m simpler_setup.tools.l0_swimlane --test "path/to/test.py" --func-id 0 --platform a2a3sim | |
| # Subsequent kernels: point at the manifest the first run produced. | |
| python -m simpler_setup.tools.l0_swimlane --test "path/to/test.py" --func-id 2 --platform a2a3sim \ | |
| --dump-json outputs/"$CLASS_NAME"_"$CASE"_"$TIMESTAMP"/tensor_dump/tensor_dump.json |
References
- In Markdown bash code blocks, avoid using angle brackets for placeholders (e.g., ) to prevent them from being parsed as input redirection. Use safe, quoted placeholders like "path/to/file" instead.
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@simpler_setup/tools/l0_swimlane.py`:
- Around line 193-198: The `inc["source"]` value is being embedded directly into
the generated `#include` directive without validation or escaping, which can
result in invalid C++ code if the path contains quotes or backslashes, or can
compile an unintended kernel if the path is absolute or outside the repo. In
simpler_setup/tools/l0_swimlane.py at lines 193-198 (the dictionary construction
where "source" is set to Path(inc["source"])), at lines 417-419, and at lines
766-768, apply the _cpp_string_literal_path() function to the source value
before using it in the include path to properly resolve it relative to the test
file/repo root, reject out-of-repo paths, and escape it for safe C++ codegen.
- Around line 719-741: Clear the msprof_collect and insight_export directories
before running the collection to avoid selecting stale OPPROF_* artifacts from
previous failed or partial runs. Add rm -rf commands to remove the contents of
COLLECT_DIR and EXPORT_ROOT immediately after their directory creation with
mkdir -p, and before the msprof op simulator command executes. This same cleanup
pattern needs to be applied at both the anchor location (around the mkdir -p
line for COLLECT_DIR and EXPORT_ROOT) and at the sibling location mentioned in
the consolidated_sites section (lines 1225-1231 in the same file).
- Around line 340-349: Add validation to fail fast when tensor rank exceeds the
descriptor capacity of 7 dimensions. The descriptor layout has a fixed capacity
for shapes (byte 44) and strides (byte 72), so any tensor with len(shape) > 7
will silently corrupt adjacent descriptor fields in make_desc. Insert a check
that len(shape) <= 7 at the three affected sites in
simpler_setup/tools/l0_swimlane.py (around lines 340-349 where shape is
initialized, around lines 493-496, and around lines 863-866) and raise a clear
error message immediately if the rank exceeds this limit, preventing corrupt
replay generation.
- Around line 460-496: The issue is that buf_bytes includes start_offset and is
used for both memory allocation (aclrtMalloc) and as the buffer.size parameter
in make_desc, which causes the descriptor to overstate the tensor's logical span
when start_offset is non-zero. Calculate the logical tensor size separately
(without start_offset) and use buf_bytes for the allocation size (aclrtMalloc
with t{ti}Bytes), but pass only the logical tensor span (calculated as
_extent_elem(shape, strides) * esz) to the make_desc function call instead of
t{ti}Bytes.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 0d308910-0bc0-4ec7-bc12-28f236c11b0e
📒 Files selected for processing (14)
.claude/skills/insight-trace/SKILL.mddocs/dfx/l0-swimlane-profiling.mddocs/dfx/tensor-dump.mdsimpler_setup/tools/l0_swimlane.pysrc/a2a3/platform/include/common/tensor_dump.hsrc/a5/platform/include/common/tensor_dump.hsrc/common/platform/include/aicpu/tensor_dump_aicpu.hsrc/common/platform/include/host/tensor_dump_collector.hsrc/common/platform/shared/aicpu/tensor_dump_aicpu.cppsrc/common/platform/shared/host/tensor_dump_collector.cpptests/st/a2a3/tensormap_and_ringbuffer/paged_attention_unroll/test_paged_attention_unroll.pytests/st/a2a3/tensormap_and_ringbuffer/spmd_multiblock_aiv/test_spmd_multiblock_aiv.pytests/st/a2a3/tensormap_and_ringbuffer/spmd_paged_attention/test_spmd_paged_attention.pytests/st/a5/tensormap_and_ringbuffer/paged_attention_unroll/test_paged_attention_unroll.py
| same_src = [i for _, i in incores if i["source"] == inc["source"]] | ||
| core_types = {i["core_type"] for i in same_src} | ||
| is_mix = "aic" in core_types and "aiv" in core_types | ||
| return { | ||
| "source": Path(inc["source"]), | ||
| "core_type": inc["core_type"], |
There was a problem hiding this comment.
Validate and escape the generated #include path.
inc["source"] comes from the imported test metadata and is emitted verbatim into #include "{source}". A source path containing quotes/backslashes will generate invalid C++, and an unexpected absolute/out-of-repo path can make the replay compile a different kernel than the selected test intended. Resolve the path relative to the test file/repo root, reject paths outside the allowed tree, and escape it before codegen.
🛡️ Proposed fix direction
+def _cpp_string_literal_path(p: Path) -> str:
+ return str(p).replace("\\", "\\\\").replace('"', '\\"')
+
def load_kernel_meta(test_path: Path, func_id: int, platform: str):
@@
- return {
- "source": Path(inc["source"]),
+ src = Path(inc["source"])
+ if not src.is_absolute():
+ src = (test_path.parent / src).resolve()
+ else:
+ src = src.resolve()
+ try:
+ src.relative_to(PROJECT_ROOT)
+ except ValueError as exc:
+ raise ValueError(f"CALLABLE source must stay under repo root: {src}") from exc
+ return {
+ "source": src,Then call _cpp_string_literal_path(source) when embedding the include path.
-{_prologue(cfg)}`#include` "{source}"
+{_prologue(cfg)}`#include` "{_cpp_string_literal_path(source)}"Also applies to: 417-419, 766-768
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@simpler_setup/tools/l0_swimlane.py` around lines 193 - 198, The
`inc["source"]` value is being embedded directly into the generated `#include`
directive without validation or escaping, which can result in invalid C++ code
if the path contains quotes or backslashes, or can compile an unintended kernel
if the path is absolute or outside the repo. In
simpler_setup/tools/l0_swimlane.py at lines 193-198 (the dictionary construction
where "source" is set to Path(inc["source"])), at lines 417-419, and at lines
766-768, apply the _cpp_string_literal_path() function to the source value
before using it in the include path to properly resolve it relative to the test
file/repo root, reject out-of-repo paths, and escape it for safe C++ codegen.
| shape = list(t["shape"]) | ||
| strides = list(t.get("strides") or _row_major(shape)) | ||
| args.append( | ||
| { | ||
| "kind": "tensor", | ||
| "slot": t["arg_index"], | ||
| "dtype": dt, | ||
| "shape": shape, | ||
| "strides": strides, | ||
| "start_offset": int(t.get("start_offset", 0)), |
There was a problem hiding this comment.
Fail fast when tensor rank exceeds descriptor capacity.
The generated descriptor layout stores shapes at byte 44 and strides at byte 72, which leaves room for 7 dimensions. A dump with len(shape) > 7 will silently overwrite adjacent descriptor fields in make_desc; validate the rank before codegen instead of emitting a corrupt replay.
🐛 Proposed guard
dt = t["dtype"].upper()
shape = list(t["shape"])
+ if len(shape) > 7:
+ raise ValueError(f"tensor arg {t['arg_index']} rank {len(shape)} exceeds descriptor capacity (7)")
strides = list(t.get("strides") or _row_major(shape))
+ if len(strides) != len(shape):
+ raise ValueError(f"tensor arg {t['arg_index']} has shape/strides rank mismatch")Also applies to: 493-496, 863-866
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@simpler_setup/tools/l0_swimlane.py` around lines 340 - 349, Add validation to
fail fast when tensor rank exceeds the descriptor capacity of 7 dimensions. The
descriptor layout has a fixed capacity for shapes (byte 44) and strides (byte
72), so any tensor with len(shape) > 7 will silently corrupt adjacent descriptor
fields in make_desc. Insert a check that len(shape) <= 7 at the three affected
sites in simpler_setup/tools/l0_swimlane.py (around lines 340-349 where shape is
initialized, around lines 493-496, and around lines 863-866) and raise a clear
error message immediately if the rank exceeds this limit, preventing corrupt
replay generation.
| buf_bytes = (a["start_offset"] + _extent_elem(shape, strides)) * esz | ||
| contig = 1 if _is_contiguous(shape, strides, a["start_offset"]) else 0 | ||
| ndims = len(shape) | ||
| shp = ", ".join(str(x) for x in shape) | ||
| strd = ", ".join(str(x) for x in strides) | ||
| # Default: data memset to 0 (only descriptor metadata is real). When | ||
| # --set-arg fills this tensor, write VALUE into every element instead — | ||
| # for control tensors whose CONTENT drives the kernel (e.g. paged | ||
| # attention reads n_blocks from the context_lens tensor). The low `esz` | ||
| # bytes of the int64 VALUE are copied per element (correct for any | ||
| # integer width, little-endian). | ||
| fill = a.get("fill") | ||
| if fill is None: | ||
| init = f" ACL_CHECK(aclrtMemset(d_t{ti}, t{ti}Bytes, 0, t{ti}Bytes));" | ||
| else: | ||
| init = ( | ||
| f" {{\n" | ||
| f" std::vector<unsigned char> hbuf{ti}(t{ti}Bytes, 0);\n" | ||
| f" const int64_t fillv{ti} = {fill}LL;\n" | ||
| f" for (size_t off = 0; off + {esz} <= t{ti}Bytes; off += {esz})\n" | ||
| f" memcpy(hbuf{ti}.data() + off, &fillv{ti}, {esz});\n" | ||
| f" ACL_CHECK(aclrtMemcpy(d_t{ti}, t{ti}Bytes, hbuf{ti}.data(), t{ti}Bytes,\n" | ||
| f" ACL_MEMCPY_HOST_TO_DEVICE));\n" | ||
| f" }}" | ||
| ) | ||
| alloc.append( | ||
| f" void *d_t{ti} = nullptr;\n" | ||
| f" const size_t t{ti}Bytes = {buf_bytes}ULL;\n" | ||
| f" ACL_CHECK(aclrtMalloc(&d_t{ti}, t{ti}Bytes, ACL_MEM_MALLOC_HUGE_FIRST));\n" | ||
| f"{init}" | ||
| ) | ||
| descs.append( | ||
| f" {{\n" | ||
| f" const uint32_t shp[] = {{{shp}}};\n" | ||
| f" const uint32_t strd[] = {{{strd}}};\n" | ||
| f" make_desc(h_tensors.data() + {ti} * 128, (uint64_t)(uintptr_t)d_t{ti},\n" | ||
| f" t{ti}Bytes, {a['start_offset']}ULL, shp, strd, {ndims}, {DTYPE_RAW[dt]}, {contig});\n" |
There was a problem hiding this comment.
Decouple allocation size from descriptor buffer.size.
buf_bytes includes start_offset and is also written into the descriptor as buffer.size, while start_offset is written separately. For non-zero-offset dump records this overstates the tensor’s logical span by start_offset elements; allocate enough bytes for the offset view, but pass the captured/logical tensor span to make_desc.
🐛 Proposed fix direction
- buf_bytes = (a["start_offset"] + _extent_elem(shape, strides)) * esz
+ desc_bytes = _extent_elem(shape, strides) * esz
+ alloc_bytes = (a["start_offset"] + _extent_elem(shape, strides)) * esz
@@
- f" const size_t t{ti}Bytes = {buf_bytes}ULL;\n"
+ f" const size_t t{ti}Bytes = {alloc_bytes}ULL;\n"
+ f" const size_t t{ti}DescBytes = {desc_bytes}ULL;\n"
@@
- f" t{ti}Bytes, {a['start_offset']}ULL, shp, strd, {ndims}, {DTYPE_RAW[dt]}, {contig});\n"
+ f" t{ti}DescBytes, {a['start_offset']}ULL, shp, strd, {ndims}, {DTYPE_RAW[dt]}, {contig});\n"🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@simpler_setup/tools/l0_swimlane.py` around lines 460 - 496, The issue is that
buf_bytes includes start_offset and is used for both memory allocation
(aclrtMalloc) and as the buffer.size parameter in make_desc, which causes the
descriptor to overstate the tensor's logical span when start_offset is non-zero.
Calculate the logical tensor size separately (without start_offset) and use
buf_bytes for the allocation size (aclrtMalloc with t{ti}Bytes), but pass only
the logical tensor span (calculated as _extent_elem(shape, strides) * esz) to
the make_desc function call instead of t{ti}Bytes.
| BUILD_DIR="$WS/build" | ||
| COLLECT_DIR="$WS/msprof_collect" | ||
| EXPORT_ROOT="$WS/insight_export" | ||
|
|
||
| source "$CANN_HOME/set_env.sh" | ||
| export ASCEND_HOME_PATH="$CANN_HOME" | ||
| SIM_LIB_DIR="$CANN_HOME/aarch64-linux/simulator/$SOC_VERSION/lib" | ||
| LD_LIBS="$BUILD_DIR:$SIM_LIB_DIR:$CANN_HOME/lib64" | ||
| LD_LIBS="$LD_LIBS:$CANN_HOME/aarch64-linux/devlib:$CANN_HOME/devlib" | ||
| export LD_LIBRARY_PATH="$LD_LIBS:${LD_LIBRARY_PATH:-}" | ||
| export ACL_DEVICE_ID="$DEVICE_ID" | ||
| mkdir -p "$BUILD_DIR" "$COLLECT_DIR" "$EXPORT_ROOT" | ||
|
|
||
| cmake -G Ninja -S "$WS" -B "$BUILD_DIR" \\ | ||
| -DSOC_VERSION="$SOC_VERSION" -DPTO_ISA_ROOT="$PTO_ISA_ROOT" -DREPO_ROOT="$REPO_ROOT" | ||
| cmake --build "$BUILD_DIR" --target replay_host | ||
|
|
||
| msprof op simulator \\ | ||
| --application="$BUILD_DIR/replay_host" --kernel-name="replay_entry" \\ | ||
| --launch-count=1 --soc-version="$SOC_VERSION" --timeout=120 \\ | ||
| --output="$COLLECT_DIR/out" 2>&1 | tee "$COLLECT_DIR/msprof_collect.log" | ||
|
|
||
| OPPROF_DIR="$(find "$COLLECT_DIR/out" -maxdepth 1 -mindepth 1 -type d -name 'OPPROF_*' | sort | tail -n 1)" |
There was a problem hiding this comment.
Clear per-run collect/export directories before selecting OPPROF_*.
run_collect.sh reuses msprof_collect and insight_export, then both the shell script and Python pick the newest matching export. If a rerun leaves stale OPPROF_* artifacts after a failed or partial collect, the tool can return an old trace for the current kernel. Remove those directories or create a unique run subdirectory before each collection.
🧹 Proposed fix
BUILD_DIR="$WS/build"
COLLECT_DIR="$WS/msprof_collect"
EXPORT_ROOT="$WS/insight_export"
@@
-mkdir -p "$BUILD_DIR" "$COLLECT_DIR" "$EXPORT_ROOT"
+rm -rf "$COLLECT_DIR" "$EXPORT_ROOT"
+mkdir -p "$BUILD_DIR" "$COLLECT_DIR" "$EXPORT_ROOT"Also applies to: 1225-1231
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@simpler_setup/tools/l0_swimlane.py` around lines 719 - 741, Clear the
msprof_collect and insight_export directories before running the collection to
avoid selecting stale OPPROF_* artifacts from previous failed or partial runs.
Add rm -rf commands to remove the contents of COLLECT_DIR and EXPORT_ROOT
immediately after their directory creation with mkdir -p, and before the msprof
op simulator command executes. This same cleanup pattern needs to be applied at
both the anchor location (around the mkdir -p line for COLLECT_DIR and
EXPORT_ROOT) and at the sibling location mentioned in the consolidated_sites
section (lines 1225-1231 in the same file).
Generate an AICore intra-core swimlane trace.json for a single kernel:
read the test's CALLABLE to resolve the kernel by func_id, run a JSON-only
tensor dump to capture the real per-task args, reconstruct them filtered by
func_id (zero hand-guessing), emit a 5-file msprof-op-simulator replay
workspace, and collect the camodel trace. The dump's golden PASS is the
gate that the captured args are trustworthy.
Handles five kernel shapes: AIC-only, AIV-only, SPMD AIV, cooperative SPMD
mix, and offset subtasks (an independent kernel packed into a mix dispatch,
whose args start at a non-zero slot). --set-arg shrinks a replay loop count
(scalar n_blocks or the context_lens control tensor) without distorting the
pipeline structure.
Supporting changes: