Skip to content

Fix bounds in WhisperDecoderSubgraph::CreateInitialFeeds initial feeds#29239

Open
jiafatom wants to merge 6 commits into
microsoft:mainfrom
jiafatom:jiafatom/whisper-decoder-feed-bounds
Open

Fix bounds in WhisperDecoderSubgraph::CreateInitialFeeds initial feeds#29239
jiafatom wants to merge 6 commits into
microsoft:mainfrom
jiafatom:jiafatom/whisper-decoder-feed-bounds

Conversation

@jiafatom

@jiafatom jiafatom commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

Description

WhisperDecoderSubgraph::CreateInitialFeeds constructed the decoder initial feeds using a single value that mixed a byte count with an element count. The total size was computed as cur_len * batch_beam_size * sizeof(int) (bytes) and then reused as:

  • the element count for the int32 staging buffer (MakeUniquePtr<int>), and
  • the element count for the gsl::span<int> source/destination passed to the device copy.

Because the input_ids tensor is allocated for exactly batch_beam_size * cur_len int32 elements, the spans claimed 4x the real extent, so the device copy ran past the end of the buffer. The per-beam memcpy also used the same combined value as its length instead of a single sequence's byte size.

This mirrors the correct T5 sibling (subgraph_t5_decoder.cc), which separates the element count (used for the spans/staging allocation) from the per-sequence byte count (used for the memcpy).

Changes

  • subgraph_whisper_decoder.cc: total_size is now the element count cur_len * batch_beam_size; introduced sequence_bytes = cur_len * sizeof(int32_t) for the per-beam memcpy. The staging buffer and spans use int32_t consistently to match the int32_t tensors/sequences.
  • Added regression test BeamSearchTest.DummyWhisperWithSequenceInputIds (CPU, and CUDA under USE_CUDA) exercising the use_sequence_as_input_ids path, with a deterministic dummy model and its generator script. The test validates both the sequences and scores outputs.

Related bool-tensor normalization fixes

While exercising the Whisper path, bool tensors copied from raw data could hold non-canonical byte values (anything non-zero rather than strictly {0, 1}), causing provider-dependent behavior. To keep the fix self-contained, the following normalization changes are included:

  • tensorprotoutils.cc: UnpackTensor<bool> normalizes raw-data bytes to {0, 1} (with a static_assert(sizeof(bool) == 1) guarding the byte-wise loop).
  • compress_impl.cu (CUDA Compress): the prefix-sum sizing predicate normalizes bool bytes to {0, 1} so the output sizing agrees with the element-selection truthiness check. Since bool initializers are now normalized on unpack, the remaining exposure is runtime-produced bool condition tensors.
  • Added CompressTest.Compress_cuda_non_canonical_bool_condition (under USE_CUDA), which feeds a raw 0xFF condition byte through a session-level run (OpTester normalizes bool inputs and so cannot reproduce this) and asserts the Compress output is sized by truthiness rather than by the sign-extended byte value.

Motivation

The decoder shares one implementation file across CPU/CUDA/ROCm, so this single change covers all execution providers. The previous behavior could overrun the staging/feed buffers for models that drive the sequence-as-input-ids decoder path.

Co-authored-by: Copilot 223556219+Copilot@users.noreply.github.com

jiafatom and others added 2 commits June 23, 2026 20:50
Bool initializers provided via TensorProto raw_data are copied verbatim, so
their bytes are not guaranteed to be the canonical {0, 1}. Kernels assume bool
tensors hold {0, 1}, and the CUDA Compress sizing path in particular sign-extends
the condition bytes (int8 -> int32) to size the output while the kernel selects
elements using bool truthiness. For bytes outside {0, 1} the two interpretations
disagree, producing an incorrectly sized output.

Normalize bool raw_data to {0, 1} in UnpackTensor<bool> so every consumer sees a
consistent value, and harden the CUDA Compress CastToInt32 functor to normalize
as well so its sizing path matches its write predicate.

Add a unit test covering bool raw_data with non-canonical bytes.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The decoder initial-feed construction conflated a byte count with an element
count when sizing the staging buffer and the device-copy spans, causing the
copy to run past the input_ids tensor. Separate the element count (used for
the spans and staging allocation) from the per-sequence byte count (used for
the per-beam memcpy), mirroring the T5 decoder sibling.

Add a regression test (DummyWhisperWithSequenceInputIds) plus a model
generator that exercises the use_sequence_as_input_ids path.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Comment thread onnxruntime/test/testdata/dummy_whisper_model_generator.py Fixed
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@jiafatom jiafatom requested a review from Copilot June 24, 2026 03:59

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot couldn't run its full agentic review because no GitHub Actions runner was available. Make sure your repository has a runner available to run Copilot's review, or add a copilot-setup-steps.yml file specifying one with the runs-on attribute. See the docs for more details.

Fixes an out-of-bounds write in Whisper decoder initial feed construction when use_sequence_as_input_ids is enabled, and adds regression coverage to exercise that path.

Changes:

  • Corrected element-count vs byte-count math in WhisperDecoderSubgraph::CreateInitialFeeds and fixed per-beam memcpy sizing.
  • Added a deterministic dummy Whisper BeamSearch ONNX model generator plus a new BeamSearch regression test.
  • Normalized non-canonical bool raw_data values to {0,1} in tensor unpacking and aligned CUDA Compress sizing with bool “truthiness”.

Reviewed changes

Copilot reviewed 6 out of 7 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
onnxruntime/contrib_ops/cpu/transformers/subgraph_whisper_decoder.cc Fixes initial feed buffer sizing/copy logic for sequence-as-input-ids path.
onnxruntime/test/contrib_ops/beam_search_test.cc Adds regression test exercising Whisper sequence-as-input-ids decoder path (CPU/CUDA).
onnxruntime/test/testdata/dummy_whisper_model_generator.py Adds script to generate the dummy Whisper BeamSearch model used by the new test.
onnxruntime/core/framework/tensorprotoutils.cc Normalizes bool tensor raw_data to canonical {0,1} during unpack.
onnxruntime/test/framework/tensorutils_test.cc Adds unit test covering bool raw_data normalization behavior.
onnxruntime/core/providers/cuda/tensor/compress_impl.cu Normalizes condition bytes for prefix-sum sizing to match selection truthiness.

Comment thread onnxruntime/contrib_ops/cpu/transformers/subgraph_whisper_decoder.cc Outdated
Comment thread onnxruntime/contrib_ops/cpu/transformers/subgraph_whisper_decoder.cc Outdated
Comment thread onnxruntime/core/framework/tensorprotoutils.cc
Comment thread onnxruntime/test/testdata/dummy_whisper_model_generator.py Outdated
Comment thread onnxruntime/test/contrib_ops/beam_search_test.cc
Comment thread onnxruntime/core/providers/cuda/tensor/compress_impl.cu
jiafatom and others added 2 commits June 24, 2026 14:12
- subgraph_whisper_decoder.cc: use int32_t consistently for staging buffer/spans
- tensorprotoutils.cc: add static_assert(sizeof(bool) == 1) for bool normalization loop
- dummy_whisper_model_generator.py: add --no-run flag and lazy onnxruntime import
- beam_search_test.cc: also validate scores output in DummyWhisperWithSequenceInputIds

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

@tianleiwu tianleiwu left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review summary

The core fix is correct. CreateInitialFeeds previously conflated a byte count with an element count: total_size = cur_len * batch_beam_size * sizeof(int) was reused both as the MakeUniquePtr<int> element count and as the gsl::span extent, so the staging buffer and the device-copy spans claimed 4x the real batch_beam_size * cur_len extent, overrunning the input_ids allocation. Splitting into total_size (element count) and sequence_bytes (per-beam byte count) resolves the overrun, and switching int->int32_t matches the tensor/sequence element type. The new memcpy length (sequence_bytes = cur_len * sizeof(int32_t)) is bounded by Sequences::GetSequence(i) (length current_length_), and the per-beam write at seq_copy_ptr + i*cur_len stays within total_size elements. Both multiplications cast to size_t before multiplying, so no intermediate overflow.

Note: this is actually slightly tighter than the T5 sibling it mirrors — subgraph_t5_decoder.cc still passes total_size_bytes (bytes) as the element count to MakeUniquePtr<int>, over-allocating ~4x (wasteful but safe). Out of scope here, but worth aligning T5 in a follow-up.

Independent changes bundled in: the bool-normalization in tensorprotoutils.cc (UnpackTensor<bool> raw_data path) and compress_impl.cu (CastToInt32) are logically separate from the Whisper bounds fix. The PR description now documents them, which is good; just calling out that a reviewer bisecting a future regression in CUDA Compress or bool initializer loading would not expect those in a "Whisper decoder feed bounds" PR.

Tests: the bounds fix gets a dedicated regression model/test, and the bool unpack path gets UnpackBoolTensorWithRawDataNormalizesToZeroOne. The CUDA Compress normalization has no direct coverage (see inline). Verdict: looks good, only minor suggestions.

Most existing inline comments on this PR are from the automated reviewer against an older commit (097246e) and appear already addressed on the current head (e.g. int->int32_t, the static_assert, the scores output, the --no-run flag).

Comment thread onnxruntime/core/providers/cuda/tensor/compress_impl.cu
…ess regression test

- compress_impl.cu: note that runtime-produced bool tensors are the
  remaining source of non-canonical condition bytes (initializers are
  normalized on unpack).
- Add CUDA Compress test feeding a raw 0xFF condition byte (which
  OpTester cannot produce) to lock in that output sizing agrees with
  truthiness-based element selection.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

@tianleiwu tianleiwu left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-reviewed at b7a842a. The only change since my last review (7f2c2a9) is the follow-up that addresses my prior feedback on the CUDA Compress normalization:

  • compress_impl.cu: added a comment clarifying that bool initializers are normalized on unpack, so the remaining source of non-canonical bytes is runtime-produced bool condition tensors — this is no longer dead-looking defensive code.
  • compress_op.test.cc: new Compress_cuda_non_canonical_bool_condition regression test. It builds the graph/session directly and writes a raw 0xFF condition byte (which OpTester can't produce because it normalizes bool inputs), then asserts the output is sized by truthiness. Verified the expected result: condition {0x00, 0xFF, 0x01} over input rows [[1,2],[3,4],[5,6]] selects rows 1 and 2 → shape {2,2}, values {3,4,5,6}. Without the fix, 0xFF (int8_t -1) would corrupt the prefix-sum sizing rather than counting as one selected element.

The earlier core fix (byte-count vs element-count bounds in CreateInitialFeeds) and the bool UnpackTensor normalization remain correct. All my prior feedback is resolved. LGTM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants