Fix bounds in WhisperDecoderSubgraph::CreateInitialFeeds initial feeds#29239
Fix bounds in WhisperDecoderSubgraph::CreateInitialFeeds initial feeds#29239jiafatom wants to merge 6 commits into
Conversation
Bool initializers provided via TensorProto raw_data are copied verbatim, so
their bytes are not guaranteed to be the canonical {0, 1}. Kernels assume bool
tensors hold {0, 1}, and the CUDA Compress sizing path in particular sign-extends
the condition bytes (int8 -> int32) to size the output while the kernel selects
elements using bool truthiness. For bytes outside {0, 1} the two interpretations
disagree, producing an incorrectly sized output.
Normalize bool raw_data to {0, 1} in UnpackTensor<bool> so every consumer sees a
consistent value, and harden the CUDA Compress CastToInt32 functor to normalize
as well so its sizing path matches its write predicate.
Add a unit test covering bool raw_data with non-canonical bytes.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The decoder initial-feed construction conflated a byte count with an element count when sizing the staging buffer and the device-copy spans, causing the copy to run past the input_ids tensor. Separate the element count (used for the spans and staging allocation) from the per-sequence byte count (used for the per-beam memcpy), mirroring the T5 decoder sibling. Add a regression test (DummyWhisperWithSequenceInputIds) plus a model generator that exercises the use_sequence_as_input_ids path. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Note
Copilot couldn't run its full agentic review because no GitHub Actions runner was available. Make sure your repository has a runner available to run Copilot's review, or add a copilot-setup-steps.yml file specifying one with the runs-on attribute. See the docs for more details.
Fixes an out-of-bounds write in Whisper decoder initial feed construction when use_sequence_as_input_ids is enabled, and adds regression coverage to exercise that path.
Changes:
- Corrected element-count vs byte-count math in
WhisperDecoderSubgraph::CreateInitialFeedsand fixed per-beammemcpysizing. - Added a deterministic dummy Whisper BeamSearch ONNX model generator plus a new BeamSearch regression test.
- Normalized non-canonical bool
raw_datavalues to{0,1}in tensor unpacking and aligned CUDA Compress sizing with bool “truthiness”.
Reviewed changes
Copilot reviewed 6 out of 7 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| onnxruntime/contrib_ops/cpu/transformers/subgraph_whisper_decoder.cc | Fixes initial feed buffer sizing/copy logic for sequence-as-input-ids path. |
| onnxruntime/test/contrib_ops/beam_search_test.cc | Adds regression test exercising Whisper sequence-as-input-ids decoder path (CPU/CUDA). |
| onnxruntime/test/testdata/dummy_whisper_model_generator.py | Adds script to generate the dummy Whisper BeamSearch model used by the new test. |
| onnxruntime/core/framework/tensorprotoutils.cc | Normalizes bool tensor raw_data to canonical {0,1} during unpack. |
| onnxruntime/test/framework/tensorutils_test.cc | Adds unit test covering bool raw_data normalization behavior. |
| onnxruntime/core/providers/cuda/tensor/compress_impl.cu | Normalizes condition bytes for prefix-sum sizing to match selection truthiness. |
- subgraph_whisper_decoder.cc: use int32_t consistently for staging buffer/spans - tensorprotoutils.cc: add static_assert(sizeof(bool) == 1) for bool normalization loop - dummy_whisper_model_generator.py: add --no-run flag and lazy onnxruntime import - beam_search_test.cc: also validate scores output in DummyWhisperWithSequenceInputIds Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
tianleiwu
left a comment
There was a problem hiding this comment.
Review summary
The core fix is correct. CreateInitialFeeds previously conflated a byte count with an element count: total_size = cur_len * batch_beam_size * sizeof(int) was reused both as the MakeUniquePtr<int> element count and as the gsl::span extent, so the staging buffer and the device-copy spans claimed 4x the real batch_beam_size * cur_len extent, overrunning the input_ids allocation. Splitting into total_size (element count) and sequence_bytes (per-beam byte count) resolves the overrun, and switching int->int32_t matches the tensor/sequence element type. The new memcpy length (sequence_bytes = cur_len * sizeof(int32_t)) is bounded by Sequences::GetSequence(i) (length current_length_), and the per-beam write at seq_copy_ptr + i*cur_len stays within total_size elements. Both multiplications cast to size_t before multiplying, so no intermediate overflow.
Note: this is actually slightly tighter than the T5 sibling it mirrors — subgraph_t5_decoder.cc still passes total_size_bytes (bytes) as the element count to MakeUniquePtr<int>, over-allocating ~4x (wasteful but safe). Out of scope here, but worth aligning T5 in a follow-up.
Independent changes bundled in: the bool-normalization in tensorprotoutils.cc (UnpackTensor<bool> raw_data path) and compress_impl.cu (CastToInt32) are logically separate from the Whisper bounds fix. The PR description now documents them, which is good; just calling out that a reviewer bisecting a future regression in CUDA Compress or bool initializer loading would not expect those in a "Whisper decoder feed bounds" PR.
Tests: the bounds fix gets a dedicated regression model/test, and the bool unpack path gets UnpackBoolTensorWithRawDataNormalizesToZeroOne. The CUDA Compress normalization has no direct coverage (see inline). Verdict: looks good, only minor suggestions.
Most existing inline comments on this PR are from the automated reviewer against an older commit (097246e) and appear already addressed on the current head (e.g. int->int32_t, the static_assert, the scores output, the --no-run flag).
…ess regression test - compress_impl.cu: note that runtime-produced bool tensors are the remaining source of non-canonical condition bytes (initializers are normalized on unpack). - Add CUDA Compress test feeding a raw 0xFF condition byte (which OpTester cannot produce) to lock in that output sizing agrees with truthiness-based element selection. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
tianleiwu
left a comment
There was a problem hiding this comment.
Re-reviewed at b7a842a. The only change since my last review (7f2c2a9) is the follow-up that addresses my prior feedback on the CUDA Compress normalization:
compress_impl.cu: added a comment clarifying that bool initializers are normalized on unpack, so the remaining source of non-canonical bytes is runtime-produced bool condition tensors — this is no longer dead-looking defensive code.compress_op.test.cc: newCompress_cuda_non_canonical_bool_conditionregression test. It builds the graph/session directly and writes a raw0xFFcondition byte (which OpTester can't produce because it normalizes bool inputs), then asserts the output is sized by truthiness. Verified the expected result: condition{0x00, 0xFF, 0x01}over input rows[[1,2],[3,4],[5,6]]selects rows 1 and 2 → shape{2,2}, values{3,4,5,6}. Without the fix,0xFF(int8_t -1) would corrupt the prefix-sum sizing rather than counting as one selected element.
The earlier core fix (byte-count vs element-count bounds in CreateInitialFeeds) and the bool UnpackTensor normalization remain correct. All my prior feedback is resolved. LGTM.
Description
WhisperDecoderSubgraph::CreateInitialFeedsconstructed the decoder initial feeds using a single value that mixed a byte count with an element count. The total size was computed ascur_len * batch_beam_size * sizeof(int)(bytes) and then reused as:MakeUniquePtr<int>), andgsl::span<int>source/destination passed to the device copy.Because the
input_idstensor is allocated for exactlybatch_beam_size * cur_lenint32 elements, the spans claimed 4x the real extent, so the device copy ran past the end of the buffer. The per-beammemcpyalso used the same combined value as its length instead of a single sequence's byte size.This mirrors the correct T5 sibling (
subgraph_t5_decoder.cc), which separates the element count (used for the spans/staging allocation) from the per-sequence byte count (used for thememcpy).Changes
subgraph_whisper_decoder.cc:total_sizeis now the element countcur_len * batch_beam_size; introducedsequence_bytes = cur_len * sizeof(int32_t)for the per-beammemcpy. The staging buffer and spans useint32_tconsistently to match theint32_ttensors/sequences.BeamSearchTest.DummyWhisperWithSequenceInputIds(CPU, and CUDA underUSE_CUDA) exercising theuse_sequence_as_input_idspath, with a deterministic dummy model and its generator script. The test validates both thesequencesandscoresoutputs.Related bool-tensor normalization fixes
While exercising the Whisper path, bool tensors copied from raw data could hold non-canonical byte values (anything non-zero rather than strictly
{0, 1}), causing provider-dependent behavior. To keep the fix self-contained, the following normalization changes are included:tensorprotoutils.cc:UnpackTensor<bool>normalizes raw-data bytes to{0, 1}(with astatic_assert(sizeof(bool) == 1)guarding the byte-wise loop).compress_impl.cu(CUDACompress): the prefix-sum sizing predicate normalizes bool bytes to{0, 1}so the output sizing agrees with the element-selection truthiness check. Since bool initializers are now normalized on unpack, the remaining exposure is runtime-produced bool condition tensors.CompressTest.Compress_cuda_non_canonical_bool_condition(underUSE_CUDA), which feeds a raw0xFFcondition byte through a session-level run (OpTesternormalizes bool inputs and so cannot reproduce this) and asserts the Compress output is sized by truthiness rather than by the sign-extended byte value.Motivation
The decoder shares one implementation file across CPU/CUDA/ROCm, so this single change covers all execution providers. The previous behavior could overrun the staging/feed buffers for models that drive the sequence-as-input-ids decoder path.
Co-authored-by: Copilot 223556219+Copilot@users.noreply.github.com