Skip to content

Refactor: share dump arg selection state#1069

Open
zmnobug wants to merge 1 commit into
hw-native-sys:mainfrom
zmnobug:feature/selective-scalar-dump
Open

Refactor: share dump arg selection state#1069
zmnobug wants to merge 1 commit into
hw-native-sys:mainfrom
zmnobug:feature/selective-scalar-dump

Conversation

@zmnobug

@zmnobug zmnobug commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Centralize PTO2 profiling macro defaults in profiling_config.h
  • Move selective dump mask and scalar metadata bookkeeping into DumpArgSelection
  • Keep runtime Arg dump state gated by PTO2_PROFILING

Testing

  • conda run -n zm_pypto pre-commit run --files src/common/task_interface/profiling_config.h src/common/platform/include/aicpu/dump_arg_selection.h src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_types.h src/a5/runtime/tensormap_and_ringbuffer/runtime/pto_types.h src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_runtime2_types.h src/a5/runtime/tensormap_and_ringbuffer/runtime/pto_runtime2_types.h src/a2a3/runtime/host_build_graph/runtime/pto_runtime2_types.h src/a5/runtime/host_build_graph/runtime/pto_runtime2_types.h src/a5/runtime/host_build_graph/runtime/tensor_info.h src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_tensormap.h src/a5/runtime/tensormap_and_ringbuffer/runtime/pto_tensormap.h
  • cmake -B tests/ut/cpp/build -S tests/ut/cpp
  • cmake --build tests/ut/cpp/build --target a2a3_rt_objs a5_rt_objs
  • cmake -B /tmp/simpler-cpput-prof0 -S tests/ut/cpp -DCMAKE_CXX_FLAGS=-DPTO2_PROFILING=0
  • cmake --build /tmp/simpler-cpput-prof0 --target a2a3_rt_objs a5_rt_objs

Closes #1043

@coderabbitai

coderabbitai Bot commented Jun 16, 2026

Copy link
Copy Markdown

Review Change Stack

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 580142c4-e702-447c-babe-776fa677bffd

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Introduces profiling_config.h as a single canonical source for four PTO2_* profiling macros (with #error dependency enforcement) and DumpArgSelection as a new struct encapsulating dump-argument bitmask and scalar-slot metadata. All eight runtime headers across a5 and a2a3 drop their inline macro fallbacks in favor of #include "profiling_config.h", and both pto_types.h variants replace scattered inline profiling fields with a DumpArgSelection dump_arg_selection_ member.

Changes

Profiling Config Centralization and DumpArgSelection Extraction

Layer / File(s) Summary
New shared headers: profiling_config and DumpArgSelection
src/common/task_interface/profiling_config.h, src/common/platform/include/aicpu/dump_arg_selection.h
profiling_config.h defines PTO2_PROFILING (default 1), PTO2_ORCH_PROFILING, PTO2_SCHED_PROFILING, PTO2_TENSORMAP_PROFILING (defaults 0) with #error dependency checks. dump_arg_selection.h defines DumpArgSelection with a 64-bit selection mask, a 64-bit ambiguity mask, per-scalar-slot source pointers and dtypes, and public methods for marking, clearing, pointer-based scalar resolution, metadata recording, and dtype copying.
Runtime headers migrated to profiling_config.h
src/a2a3/runtime/host_build_graph/runtime/pto_runtime2_types.h, src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_runtime2_types.h, src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_tensormap.h, src/a5/runtime/host_build_graph/runtime/pto_runtime2_types.h, src/a5/runtime/host_build_graph/runtime/tensor_info.h, src/a5/runtime/tensormap_and_ringbuffer/runtime/pto_runtime2_types.h, src/a5/runtime/tensormap_and_ringbuffer/runtime/pto_tensormap.h
Each file replaces its inline #ifndef PTO2_PROFILING / #define/#endif`` fallback block (and where present, the full four-macro hierarchy with #error checks) with a single `#include "profiling_config.h"`.
Arg refactored onto DumpArgSelection (a2a3 and a5 pto_types.h)
src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_types.h, src/a5/runtime/tensormap_and_ringbuffer/runtime/pto_types.h
Includes aicpu/dump_arg_selection.h; removes private fields dump_arg_mask_, dump_arg_index_ambiguous_mask_, scalar_source_ptrs_[], scalar_dtypes_[] and adds DumpArgSelection dump_arg_selection_. All profiling code paths — clear(), tensor_dump_arg_mask(), tensor_dump_arg_index_ambiguous_mask(), add_scalars, add_scalars_i32, add_scalar_one, copy_scalars_from, scalar_dtypes(), mark_all_dump_args, mark_dump_arg(Tensor), mark_dump_arg(TensorCreateInfo), mark_dump_arg(scalar) — are rewired to call dump_arg_selection_ methods.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • hw-native-sys/simpler#953: Directly consumes Arg::tensor_dump_arg_mask() to enable selective dump mode, which this PR refactors to read from DumpArgSelection.
  • hw-native-sys/simpler#1028: Extends Arg::dump() marking for tensor vs. scalar lvalues and ambiguity handling — the exact logic this PR moves into DumpArgSelection.

Poem

🐇 Hopping through headers, I swept up the mess,
One profiling_config.h to rule all the rest!
DumpArgSelection born, tidy and trim,
No more mask fields scattered on a whim.
Eight headers rejoice, their clutter now gone —
The rabbit refactors, and hops gently on. 🌿

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 30.30% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'Refactor: share dump arg selection state' clearly and concisely describes the main change—the extraction and sharing of dump argument selection functionality.
Description check ✅ Passed The description provides a clear summary of the three main objectives: centralizing profiling macros, moving dump selection state, and maintaining runtime gating, plus testing instructions.
Linked Issues check ✅ Passed The PR fully addresses both problems identified in #1043: centralizes PTO2_PROFILING macros in profiling_config.h and extracts dump/scalar-selection machinery into DumpArgSelection.
Out of Scope Changes check ✅ Passed All changes are directly aligned with the #1043 objectives: profiling macro centralization and dump arg selection extraction, with no unrelated modifications.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors profiling and tensor dump configurations across multiple runtime directories by introducing a centralized profiling_config.h header and a new DumpArgSelection helper class to manage argument selection and masking. The code review feedback highlights several critical safety concerns in the new DumpArgSelection class, specifically pointing out potential undefined behavior from out-of-bounds bitwise shifts and several instances of missing bounds checks on indices and counts that could lead to out-of-bounds memory reads or writes.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

uint64_t dump_arg_index_ambiguous_mask() const { return dump_arg_index_ambiguous_mask_; }
const uint8_t *scalar_dtypes() const { return scalar_dtypes_; }

void mark_index(int32_t index) { dump_arg_mask_ |= (uint64_t{1} << index); }

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Shifting a 64-bit integer by an out-of-bounds index (negative or >= 64) is undefined behavior in C++. To prevent this, validate that index is within the safe range [0, 63] before performing the bitwise shift.

    void mark_index(int32_t index) {
        if (index >= 0 && index < 64) {
            dump_arg_mask_ |= (uint64_t{1} << index);
        }
    }

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in current head 9166ebb: mark_index now validates the argument with is_valid_arg_index(index) before doing the 64-bit shift.

Comment on lines +45 to +46
bool mark_scalar_by_ptr(uintptr_t ptr, int32_t scalar_count, int32_t tensor_offset) {
int32_t first_match = -1;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Validate that scalar_count is non-negative and does not exceed CORE_MAX_SCALAR_ARGS to prevent out-of-bounds reads from scalar_source_ptrs_.

    bool mark_scalar_by_ptr(uintptr_t ptr, int32_t scalar_count, int32_t tensor_offset) {
        if (scalar_count < 0 || scalar_count > CORE_MAX_SCALAR_ARGS) {
            return false;
        }
        int32_t first_match = -1;
References
  1. When using a signed integer for a count or size parameter in functions that perform bounds checks and memory operations, always validate that the count is non-negative to prevent bypassing bounds checks and causing out-of-bounds writes or signed-to-unsigned conversion overflows.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in current head 9166ebb: mark_scalar_by_ptr now validates scalar_count, tensor_offset, the arg span, and the computed arg_index before scanning or marking.

Comment on lines +68 to +71
void record_scalar_source(int32_t slot, uintptr_t ptr, uint8_t dtype) {
scalar_source_ptrs_[slot] = ptr;
scalar_dtypes_[slot] = dtype;
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Validate that slot is within the bounds of CORE_MAX_SCALAR_ARGS to prevent out-of-bounds writes to scalar_source_ptrs_ and scalar_dtypes_.

    void record_scalar_source(int32_t slot, uintptr_t ptr, uint8_t dtype) {
        if (slot >= 0 && slot < CORE_MAX_SCALAR_ARGS) {
            scalar_source_ptrs_[slot] = ptr;
            scalar_dtypes_[slot] = dtype;
        }
    }

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in current head 9166ebb: record_scalar_source now returns unless slot is a valid single-slot scalar range.

Comment on lines +78 to +81
void copy_scalar_dtypes_from(const DumpArgSelection &src, int32_t dst_offset, int32_t src_offset, int32_t count) {
memcpy(&scalar_dtypes_[dst_offset], &src.scalar_dtypes_[src_offset], count * sizeof(uint8_t));
clear_scalar_sources(dst_offset, count);
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Validate that count, dst_offset, and src_offset are non-negative and within the bounds of CORE_MAX_SCALAR_ARGS to prevent out-of-bounds memory operations during memcpy and clear_scalar_sources.

Suggested change
void copy_scalar_dtypes_from(const DumpArgSelection &src, int32_t dst_offset, int32_t src_offset, int32_t count) {
memcpy(&scalar_dtypes_[dst_offset], &src.scalar_dtypes_[src_offset], count * sizeof(uint8_t));
clear_scalar_sources(dst_offset, count);
}
void copy_scalar_dtypes_from(const DumpArgSelection &src, int32_t dst_offset, int32_t src_offset, int32_t count) {
if (count < 0 || dst_offset < 0 || src_offset < 0 ||
dst_offset + count > CORE_MAX_SCALAR_ARGS ||
src_offset + count > CORE_MAX_SCALAR_ARGS) {
return;
}
memcpy(&scalar_dtypes_[dst_offset], &src.scalar_dtypes_[src_offset], count * sizeof(uint8_t));
clear_scalar_sources(dst_offset, count);
}
References
  1. When using a signed integer for a count or size parameter in functions that perform bounds checks and memory operations, always validate that the count is non-negative to prevent bypassing bounds checks and causing out-of-bounds writes or signed-to-unsigned conversion overflows.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in current head 9166ebb: copy_scalar_dtypes_from now validates both destination and source scalar ranges before memcpy and source clearing.

}

private:
void mark_index_ambiguous(int32_t index) { dump_arg_index_ambiguous_mask_ |= (uint64_t{1} << index); }

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Shifting a 64-bit integer by an out-of-bounds index (negative or >= 64) is undefined behavior in C++. Validate that index is within the safe range [0, 63] before performing the bitwise shift.

    void mark_index_ambiguous(int32_t index) {
        if (index >= 0 && index < 64) {
            dump_arg_index_ambiguous_mask_ |= (uint64_t{1} << index);
        }
    }

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in current head 9166ebb: mark_index_ambiguous now validates the index before setting the ambiguity mask bit.

Comment on lines +88 to +92
void clear_scalar_sources(int32_t start, int32_t count) {
for (int32_t i = 0; i < count; i++) {
scalar_source_ptrs_[start + i] = 0;
}
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Validate that start and count are non-negative and within the bounds of CORE_MAX_SCALAR_ARGS to prevent out-of-bounds writes to scalar_source_ptrs_.

    void clear_scalar_sources(int32_t start, int32_t count) {
        if (count < 0 || start < 0 || start + count > CORE_MAX_SCALAR_ARGS) {
            return;
        }
        for (int32_t i = 0; i < count; i++) {
            scalar_source_ptrs_[start + i] = 0;
        }
    }
References
  1. When using a signed integer for a count or size parameter in functions that perform bounds checks and memory operations, always validate that the count is non-negative to prevent bypassing bounds checks and causing out-of-bounds writes or signed-to-unsigned conversion overflows.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in current head 9166ebb: clear_scalar_sources(start, count) now validates the scalar range before writing scalar_source_ptrs_.

Comment on lines +94 to +96
void clear_scalar_dtypes(int32_t start, int32_t count) {
memset(&scalar_dtypes_[start], 0, count * sizeof(uint8_t));
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Validate that start and count are non-negative and within the bounds of CORE_MAX_SCALAR_ARGS to prevent out-of-bounds writes to scalar_dtypes_ during memset.

Suggested change
void clear_scalar_dtypes(int32_t start, int32_t count) {
memset(&scalar_dtypes_[start], 0, count * sizeof(uint8_t));
}
void clear_scalar_dtypes(int32_t start, int32_t count) {
if (count < 0 || start < 0 || start + count > CORE_MAX_SCALAR_ARGS) {
return;
}
memset(&scalar_dtypes_[start], 0, count * sizeof(uint8_t));
}
References
  1. When using a signed integer for a count or size parameter in functions that perform bounds checks and memory operations, always validate that the count is non-negative to prevent bypassing bounds checks and causing out-of-bounds writes or signed-to-unsigned conversion overflows.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in current head 9166ebb: clear_scalar_dtypes(start, count) now validates the scalar range before memset.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_types.h (1)

392-403: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Reject negative src_offset in both copy_scalars_from() mirrors.

Both implementations only check count < 0 and src_offset + count > src.scalar_count_; a negative src_offset can pass and then index before src.scalars_ and src.dump_arg_selection_ metadata.

  • src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_types.h#L392-L403: add src_offset < 0 to the source-range validation before the memcpy and copy_scalar_dtypes_from() call.
  • src/a5/runtime/tensormap_and_ringbuffer/runtime/pto_types.h#L390-L401: apply the same lower-bound check in the mirror.
Proposed fix pattern
 void copy_scalars_from(const Arg &src, int src_offset, int count) {
-    if (count < 0 || src_offset + count > src.scalar_count_) {
+    if (src_offset < 0 || count < 0 || src_offset + count > src.scalar_count_) {
         set_error("Source scalar range out of bounds in copy_scalars_from");
         return;
     }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_types.h` around lines
392 - 403, Both implementations of the copy_scalars_from() method have an
incomplete bounds check that allows negative src_offset values to pass through
and cause out-of-bounds memory access. In
src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_types.h (lines 392-403,
anchor), add a check for src_offset < 0 in the initial validation condition
alongside the existing count < 0 and src_offset + count > src.scalar_count_
checks. Apply the identical lower-bound check (src_offset < 0) to the mirror
implementation in src/a5/runtime/tensormap_and_ringbuffer/runtime/pto_types.h
(lines 390-401, sibling) to ensure consistent validation across both copies of
the method.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/common/platform/include/aicpu/dump_arg_selection.h`:
- Around line 34-95: Add bounds checking to all index and range-based mutators
in the DumpArgSelection struct to prevent undefined behavior. In mark_index and
mark_index_ambiguous, validate that index is within [0, 63] before performing
the bit-shift operation. In mark_all, validate that tensor_count and
scalar_count are non-negative and that their sum does not exceed valid argument
limits. In mark_scalar_by_ptr, validate that scalar_count is non-negative,
tensor_offset is valid, and the computed arg_index is within [0, 63]. In
record_scalar_source, validate that slot is within [0, CORE_MAX_SCALAR_ARGS). In
clear_scalar_sources and clear_scalar_dtypes (both overloads), validate that
start is non-negative, count is non-negative, and start + count does not exceed
CORE_MAX_SCALAR_ARGS. In copy_scalar_dtypes_from, validate that dst_offset,
src_offset, and count are all non-negative and that both source and destination
ranges fit within the scalar arrays without overflow.

---

Outside diff comments:
In `@src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_types.h`:
- Around line 392-403: Both implementations of the copy_scalars_from() method
have an incomplete bounds check that allows negative src_offset values to pass
through and cause out-of-bounds memory access. In
src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_types.h (lines 392-403,
anchor), add a check for src_offset < 0 in the initial validation condition
alongside the existing count < 0 and src_offset + count > src.scalar_count_
checks. Apply the identical lower-bound check (src_offset < 0) to the mirror
implementation in src/a5/runtime/tensormap_and_ringbuffer/runtime/pto_types.h
(lines 390-401, sibling) to ensure consistent validation across both copies of
the method.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 0380cc61-e1fe-4642-a282-8ed187bf9bf9

📥 Commits

Reviewing files that changed from the base of the PR and between 19b2c0b and 55f4751.

📒 Files selected for processing (11)
  • src/a2a3/runtime/host_build_graph/runtime/pto_runtime2_types.h
  • src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_runtime2_types.h
  • src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_tensormap.h
  • src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_types.h
  • src/a5/runtime/host_build_graph/runtime/pto_runtime2_types.h
  • src/a5/runtime/host_build_graph/runtime/tensor_info.h
  • src/a5/runtime/tensormap_and_ringbuffer/runtime/pto_runtime2_types.h
  • src/a5/runtime/tensormap_and_ringbuffer/runtime/pto_tensormap.h
  • src/a5/runtime/tensormap_and_ringbuffer/runtime/pto_types.h
  • src/common/platform/include/aicpu/dump_arg_selection.h
  • src/common/task_interface/profiling_config.h

Comment thread src/common/platform/include/aicpu/dump_arg_selection.h Outdated
@zmnobug zmnobug force-pushed the feature/selective-scalar-dump branch from e902a8d to 9166ebb Compare June 16, 2026 13:01
  - Centralize PTO2 profiling macro defaults in profiling_config.h
  - Move tensor/scalar dump selection metadata into DumpArgSelection
  - Reuse the shared helper from a2a3 and a5 Arg implementations
  - Update profiling and tensor dump docs for the new macro source
@zmnobug zmnobug force-pushed the feature/selective-scalar-dump branch from 9166ebb to 1266722 Compare June 17, 2026 07:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Code Health] Extract dump/scalar-selection out of runtime Arg + give PTO2_PROFILING a single source of truth

1 participant