fix(runtime): allow zero-size view at any offset (Tensor::view bounds-check)#1023
fix(runtime): allow zero-size view at any offset (Tensor::view bounds-check)#1023csy0225 wants to merge 4 commits into
Conversation
Tensor::view() asserted view_offsets[i] + view_shapes[i] <= shapes[i] for
every dim, which fails even when view_shapes[i] == 0 if the offset alone
exceeds the parent shape. The orchestration codegen relies on the inverse
contract: it emits guards of the form
cos_row_shapes[0] = (off >= shape ? 0u : std::min<uint32_t>(1, shape - off));
Tensor cos_row = ext_rope_cos.view(cos_row_shapes, cos_row_offsets);
to produce a zero-size view when the dynamic offset (e.g. context-length-1)
runs past the parent. The codegen clamps the SHAPE dim to 0 but leaves the
OFFSET dim as the raw — possibly out-of-range — value, expecting the
runtime to honour the empty-view contract (no bytes captured, so offset
bounds are immaterial).
The current assertion is what surfaces on board as the Ascend a2a3 AICPU
exception 0x2a / aclrtSynchronizeStream 507018 chain, with this exact
device-side stack:
[ERROR] AICPU: Assertion failed:
view_offsets[i] + view_shapes[i] <= shapes[i]
[ERROR] AICPU: Location: .../runtime/tensor.h:356
triggered by `static_cast<uint32_t>(pos)` where `pos = ctx_len - 1`
underflows to 0xFFFFFFFF when ctx_len happens to be 0 (e.g. the very
first prefill step or a dummy-weight bring-up where the seq_lens input
has not been pre-populated).
Fix: detect any zero-shape dim up front and skip both the per-dim offset
assertion and the start_offset advancement (no bytes accessed, so
start_offset is irrelevant) and the trailing assert_in_buffer_bounds()
(buffer-bounds check is meaningless on an empty tensor). All previously
valid views still pass through the same code path; only previously
asserting-but-now-OK views (zero shape + out-of-range offset) start
returning a clean empty Tensor.
Applied symmetrically to a2a3 and a5; the two tensor.h variants were
byte-identical before and after this change.
ABI is unchanged (sizeof(Tensor) untouched; the static_asserts in
comm_context.h continue to hold), so neither the AICPU shared SO nor
host orchestration .so need to be rebuilt against the new header beyond
the runtime that hosts this Tensor type itself.
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughThis PR applies identical empty-view handling logic to the ChangesEmpty view bounds checking in Tensor::view
🎯 2 (Simple) | ⏱️ ~8 minutes
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
…#1018) HcclGetRootInfo / HcclCommInitRootInfo are declared weak in CANN's hccl_comm.h. Under a toolchain that defaults to --as-needed (Ubuntu / conda gcc on x86) a weak undefined ref does not mark libhcomm "needed", so its DT_NEEDED is dropped and the symbols resolve to NULL at runtime -> SIGSEGV on the first HcclGetRootInfo call from comm_init. Wrap ${HCCL_LINK_TARGETS} in -Wl,--no-as-needed ... -Wl,--as-needed so the dependency survives any default linker policy. Verified on the host pod: rebuilt libhost_runtime.so now lists libhcomm.so in its NEEDED, and 2-rank allreduce_distributed no longer faults in comm_init. Mirrors a2a3 and a5 onboard.
Three downstream fixes surfaced after hw-native-sys#1018 cleared the comm_init SIGSEGV; together they get a 2-rank allreduce_distributed past comm_alloc_domain_windows. 1. SIMPLER_ENABLE_PTO_SDMA_WORKSPACE: ON -> OFF. On the current driver/CANN the ShmemSdmaStarsQuery AICPU kernel (libcpu_kernels.so::RunCpuKernel) raises an AICPU exception (errorCode 0x2a) during comm bootstrap, which faults the whole chip and cascades every subsequent op to 507899 / 507018 (notably the aclrtIpcMemImportByKey in domain_alloc_via_ipc). The workspace is orthogonal to comm bootstrap (only the sdma_async_completion_demo needs it) and the leaving-workSpace==0 fallback is supported (sync TLOAD / TSTORE + the IPC window path do not need it). Re-enable once ShmemSdmaStarsQuery is fixed for the async-SDMA demo. 2. aclrtIpcMemImportByKey: flags 0 -> ACL_RT_IPC_MEM_IMPORT_FLAG_ENABLE_PEER_ACCESS. The imported buffer lives on a peer device, so the import must request cross-device peer access. Without it the driver's halShmemOpenHandle rejects the cross-card handle (drvRetCode=8 -> rtsIpcMemImportByKey 507899). Applied in both alloc_windows_via_ipc and domain_alloc_via_ipc. 3. aclrtDevicePeerAccessStatus: bounded best-effort instead of 30s fatal poll. Under ASCEND_VISIBLE_DEVICES remapping with one fork per chip, each process aclrtSetDevice's only its own device, so the status query cannot resolve the peer's logical id to a physical one and reports status=0 indefinitely even when EnablePeerAccess succeeded and the HCCS topology is up. Confirm quickly where reliable (ARM / openEuler non-remapped paths still return status=1) and fall through with a warning otherwise; the file_barrier below already synchronizes every rank's enable, and a dead link still surfaces as a kernel-side TWAIT hang. Applied to both call sites; 30s -> 3s deadline.
…TASK-19) Dispatcher .so built for the aicpu side is aarch64 even when the host is x86. GNU strip 2.38 (Ubuntu 22.04 default) cannot read foreign-arch ELF and aborts; install on x86 hosts fails on RuntimeCompiler. Prefer llvm-strip (multi-arch via LLVM Object) when available, fall back to plain strip otherwise so existing aarch64 / openEuler hosts are unaffected.
Summary
Tensor::view()assertsview_offsets[i] + view_shapes[i] <= shapes[i]for every dim — but the orchestration codegen emits guards of the form(off >= shape ? 0 : min(...))to clamp the shape dim to 0 when a dynamic offset runs past the parent, leaving the offset as the raw (possibly out-of-range) value and expecting the runtime to honour the empty-view contract. With the current assertion, that codegen pattern fails on board.This PR loosens
view()to detect any zero-shape dim up front and skip both the per-dim offset assertion and thestart_offsetadvancement (no bytes captured, so offset bounds are irrelevant) and the trailingassert_in_buffer_bounds(). All previously valid views still pass through the same code path; only previously-asserting-but-now-OK views (zero shape + out-of-range offset) now correctly return an empty Tensor.Repro signature on board (a2a3)
The current assertion surfaces as a generic AICPU exception
errorCode=0x2a→ host-sideaclrtSynchronizeStreamWithTimeout (AICPU) failed: 507018. WithASCEND_GLOBAL_LOG_LEVEL=1andASCEND_PROCESS_LOG_PATH=<dir>, the device-side plog shows:The trigger is the codegen lowering of dynamic offsets like
static_cast<uint32_t>(pos)wherepos = ctx_len - 1. Whenctx_lenis 0 (the very first prefill step, or a dummy-weight bring-up where seq_lens is not pre-populated),pos = -1and the cast underflows to0xFFFFFFFF. The codegen guard correctly setscos_row_shapes[0] = 0for this offset, but the OFFSET passed toview()is the raw0xFFFFFFFF, which fails the assertion.A typical generated chip-side site (paraphrased from
aicpu_orchestration_entry):Why fix it in the runtime, not the codegen
The codegen's intent is unambiguous:
(off >= shape ? 0 : ...)is exactly "if offset is out of range, return an empty view". Hardening every codegen call site to also clamp the offset would touch many emit paths and is prone to regression; honoring the empty-view contract once in the runtime is a strict improvement and matches what the codegen already documents in its emitted comments.Diff shape
sizeof(Tensor)is untouched and thestatic_assertsincomm_context.hcontinue to hold.Test plan
tests/ut/cpp/{a2a3,a5}/test_tensormap.cppview-based overlap tests continue to pass (no preexisting test exercises the zero-shape-view + out-of-range-offset path).base.view([0, ...], [0xFFFFFFFF, 0])returns an empty Tensor withnumel()==0and does not abort.References
The originating model-side investigation is documented in this companion issue / mirror in the consumer project: pypto-side root-cause walkthrough exists at
docs/upstream-issues/pypto-1702-followup.md(separate codepath from PRhw-native-sys/pypto#1718).