fix(runtime): allow zero-size view at any offset (Tensor::view bounds-check) by csy0225 · Pull Request #1023 · hw-native-sys/simpler

csy0225 · 2026-06-10T16:12:49Z

Summary

Tensor::view() asserts view_offsets[i] + view_shapes[i] <= shapes[i] for every dim — but the orchestration codegen emits guards of the form (off >= shape ? 0 : min(...)) to clamp the shape dim to 0 when a dynamic offset runs past the parent, leaving the offset as the raw (possibly out-of-range) value and expecting the runtime to honour the empty-view contract. With the current assertion, that codegen pattern fails on board.

This PR loosens view() to detect any zero-shape dim up front and skip both the per-dim offset assertion and the start_offset advancement (no bytes captured, so offset bounds are irrelevant) and the trailing assert_in_buffer_bounds(). All previously valid views still pass through the same code path; only previously-asserting-but-now-OK views (zero shape + out-of-range offset) now correctly return an empty Tensor.

Repro signature on board (a2a3)

The current assertion surfaces as a generic AICPU exception errorCode=0x2a → host-side aclrtSynchronizeStreamWithTimeout (AICPU) failed: 507018. With ASCEND_GLOBAL_LOG_LEVEL=1 and ASCEND_PROCESS_LOG_PATH=<dir>, the device-side plog shows:

[ERROR] AICPU: Assertion failed:
        view_offsets[i] + view_shapes[i] <= shapes[i]
[ERROR] AICPU: Location:
        .../runtime/tensormap_and_ringbuffer/runtime/tensor.h:356

The trigger is the codegen lowering of dynamic offsets like static_cast<uint32_t>(pos) where pos = ctx_len - 1. When ctx_len is 0 (the very first prefill step, or a dummy-weight bring-up where seq_lens is not pre-populated), pos = -1 and the cast underflows to 0xFFFFFFFF. The codegen guard correctly sets cos_row_shapes[0] = 0 for this offset, but the OFFSET passed to view() is the raw 0xFFFFFFFF, which fails the assertion.

A typical generated chip-side site (paraphrased from aicpu_orchestration_entry):

int32_t ctx_len = get_tensor_data<int32_t>(ext_seq_lens, 1, indices_ctx_len);
int64_t pos = (static_cast<int64_t>(ctx_len) - 1);
uint32_t cos_row_offsets[2] = {static_cast<uint32_t>(pos), 0};
uint32_t cos_row_shapes[2] = {
    (cos_row_offsets[0] >= ext_rope_cos.shapes[0]
        ? 0u
        : std::min<uint32_t>(1, ext_rope_cos.shapes[0] - cos_row_offsets[0])),
    ...
};
Tensor cos_row = ext_rope_cos.view(cos_row_shapes, cos_row_offsets);  // <-- asserts

Why fix it in the runtime, not the codegen

The codegen's intent is unambiguous: (off >= shape ? 0 : ...) is exactly "if offset is out of range, return an empty view". Hardening every codegen call site to also clamp the offset would touch many emit paths and is prone to regression; honoring the empty-view contract once in the runtime is a strict improvement and matches what the codegen already documents in its emitted comments.

Diff shape

Two byte-identical sibling headers (a2a3 and a5) updated symmetrically.
ABI unchanged: sizeof(Tensor) is untouched and the static_asserts in comm_context.h continue to hold.
No call-site changes anywhere — purely a one-method semantic widening.

Test plan

Existing tests/ut/cpp/{a2a3,a5}/test_tensormap.cpp view-based overlap tests continue to pass (no preexisting test exercises the zero-shape-view + out-of-range-offset path).
Suggest adding a regression test: base.view([0, ...], [0xFFFFFFFF, 0]) returns an empty Tensor with numel()==0 and does not abort.
Validated locally on Ascend a2a3 single-rank decode: with this fix the assertion never fires; the chip-side run progresses past the rope-view sites it previously aborted at.

References

The originating model-side investigation is documented in this companion issue / mirror in the consumer project: pypto-side root-cause walkthrough exists at docs/upstream-issues/pypto-1702-followup.md (separate codepath from PR hw-native-sys/pypto#1718).

Tensor::view() asserted view_offsets[i] + view_shapes[i] <= shapes[i] for every dim, which fails even when view_shapes[i] == 0 if the offset alone exceeds the parent shape. The orchestration codegen relies on the inverse contract: it emits guards of the form cos_row_shapes[0] = (off >= shape ? 0u : std::min<uint32_t>(1, shape - off)); Tensor cos_row = ext_rope_cos.view(cos_row_shapes, cos_row_offsets); to produce a zero-size view when the dynamic offset (e.g. context-length-1) runs past the parent. The codegen clamps the SHAPE dim to 0 but leaves the OFFSET dim as the raw — possibly out-of-range — value, expecting the runtime to honour the empty-view contract (no bytes captured, so offset bounds are immaterial). The current assertion is what surfaces on board as the Ascend a2a3 AICPU exception 0x2a / aclrtSynchronizeStream 507018 chain, with this exact device-side stack: [ERROR] AICPU: Assertion failed: view_offsets[i] + view_shapes[i] <= shapes[i] [ERROR] AICPU: Location: .../runtime/tensor.h:356 triggered by `static_cast<uint32_t>(pos)` where `pos = ctx_len - 1` underflows to 0xFFFFFFFF when ctx_len happens to be 0 (e.g. the very first prefill step or a dummy-weight bring-up where the seq_lens input has not been pre-populated). Fix: detect any zero-shape dim up front and skip both the per-dim offset assertion and the start_offset advancement (no bytes accessed, so start_offset is irrelevant) and the trailing assert_in_buffer_bounds() (buffer-bounds check is meaningless on an empty tensor). All previously valid views still pass through the same code path; only previously asserting-but-now-OK views (zero shape + out-of-range offset) start returning a clean empty Tensor. Applied symmetrically to a2a3 and a5; the two tensor.h variants were byte-identical before and after this change. ABI is unchanged (sizeof(Tensor) untouched; the static_asserts in comm_context.h continue to hold), so neither the AICPU shared SO nor host orchestration .so need to be rebuilt against the new header beyond the runtime that hosts this Tensor type itself.

coderabbitai · 2026-06-10T16:15:45Z

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 03b45a46-eb7f-4a61-aafd-1f3dac45dc2d

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

📝 Walkthrough

Walkthrough

This PR applies identical empty-view handling logic to the Tensor::view(...) method in two parallel tensor runtime implementations. When a view has any zero-size dimension, bounds validation checks are skipped; non-empty views retain their original offset and buffer bounds assertions.

Changes

Empty view bounds checking in Tensor::view

Layer / File(s)	Summary
Empty view bounds checking skip `src/a2a3/runtime/tensormap_and_ringbuffer/runtime/tensor.h`, `src/a5/runtime/tensormap_and_ringbuffer/runtime/tensor.h`	`Tensor::view` detects empty views (any dimension is zero) and skips per-dimension offset bounds debug assertions, `start_offset` updates, and final `assert_in_buffer_bounds()` calls; non-empty views preserve original validation.

🎯 2 (Simple) | ⏱️ ~8 minutes

A view that's truly empty needs no bounds—
Zero dimensions, no offset confounds!
The tensor now skips checks with grace,
For views that just vanish without trace. 🐰✨

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and concisely describes the main change: allowing zero-size views at any offset in the Tensor::view() bounds-check logic.
Description check	✅ Passed	The description is comprehensive and directly related to the changeset, explaining the problem, solution, rationale, and validation approach for the Tensor::view() bounds-check modification.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

…#1018) HcclGetRootInfo / HcclCommInitRootInfo are declared weak in CANN's hccl_comm.h. Under a toolchain that defaults to --as-needed (Ubuntu / conda gcc on x86) a weak undefined ref does not mark libhcomm "needed", so its DT_NEEDED is dropped and the symbols resolve to NULL at runtime -> SIGSEGV on the first HcclGetRootInfo call from comm_init. Wrap ${HCCL_LINK_TARGETS} in -Wl,--no-as-needed ... -Wl,--as-needed so the dependency survives any default linker policy. Verified on the host pod: rebuilt libhost_runtime.so now lists libhcomm.so in its NEEDED, and 2-rank allreduce_distributed no longer faults in comm_init. Mirrors a2a3 and a5 onboard.

Three downstream fixes surfaced after hw-native-sys#1018 cleared the comm_init SIGSEGV; together they get a 2-rank allreduce_distributed past comm_alloc_domain_windows. 1. SIMPLER_ENABLE_PTO_SDMA_WORKSPACE: ON -> OFF. On the current driver/CANN the ShmemSdmaStarsQuery AICPU kernel (libcpu_kernels.so::RunCpuKernel) raises an AICPU exception (errorCode 0x2a) during comm bootstrap, which faults the whole chip and cascades every subsequent op to 507899 / 507018 (notably the aclrtIpcMemImportByKey in domain_alloc_via_ipc). The workspace is orthogonal to comm bootstrap (only the sdma_async_completion_demo needs it) and the leaving-workSpace==0 fallback is supported (sync TLOAD / TSTORE + the IPC window path do not need it). Re-enable once ShmemSdmaStarsQuery is fixed for the async-SDMA demo. 2. aclrtIpcMemImportByKey: flags 0 -> ACL_RT_IPC_MEM_IMPORT_FLAG_ENABLE_PEER_ACCESS. The imported buffer lives on a peer device, so the import must request cross-device peer access. Without it the driver's halShmemOpenHandle rejects the cross-card handle (drvRetCode=8 -> rtsIpcMemImportByKey 507899). Applied in both alloc_windows_via_ipc and domain_alloc_via_ipc. 3. aclrtDevicePeerAccessStatus: bounded best-effort instead of 30s fatal poll. Under ASCEND_VISIBLE_DEVICES remapping with one fork per chip, each process aclrtSetDevice's only its own device, so the status query cannot resolve the peer's logical id to a physical one and reports status=0 indefinitely even when EnablePeerAccess succeeded and the HCCS topology is up. Confirm quickly where reliable (ARM / openEuler non-remapped paths still return status=1) and fall through with a warning otherwise; the file_barrier below already synchronizes every rank's enable, and a dead link still surfaces as a kernel-side TWAIT hang. Applied to both call sites; 30s -> 3s deadline.

…TASK-19) Dispatcher .so built for the aicpu side is aarch64 even when the host is x86. GNU strip 2.38 (Ubuntu 22.04 default) cannot read foreign-arch ELF and aborts; install on x86 hosts fails on RuntimeCompiler. Prefer llvm-strip (multi-arch via LLVM Object) when available, fall back to plain strip otherwise so existing aarch64 / openEuler hosts are unaffected.

csy0225 added 3 commits June 12, 2026 16:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(runtime): allow zero-size view at any offset (Tensor::view bounds-check)#1023

fix(runtime): allow zero-size view at any offset (Tensor::view bounds-check)#1023
csy0225 wants to merge 4 commits into
hw-native-sys:mainfrom
csy0225:fix/tensor-zero-size-view-bounds

csy0225 commented Jun 10, 2026

Uh oh!

coderabbitai Bot commented Jun 10, 2026 •

edited

Loading

Review skipped

Walkthrough

Changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

csy0225 commented Jun 10, 2026

Summary

Repro signature on board (a2a3)

Why fix it in the runtime, not the codegen

Diff shape

Test plan

References

Uh oh!

coderabbitai Bot commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

coderabbitai Bot commented Jun 10, 2026 •

edited

Loading