Fix bounds in WhisperDecoderSubgraph::CreateInitialFeeds initial feeds by jiafatom · Pull Request #29239 · microsoft/onnxruntime

jiafatom · 2026-06-23T21:18:32Z

Description

WhisperDecoderSubgraph::CreateInitialFeeds constructed the decoder initial feeds using a single value that mixed a byte count with an element count. The total size was computed as cur_len * batch_beam_size * sizeof(int) (bytes) and then reused as:

the element count for the int32 staging buffer (MakeUniquePtr<int>), and
the element count for the gsl::span<int> source/destination passed to the device copy.

Because the input_ids tensor is allocated for exactly batch_beam_size * cur_len int32 elements, the spans claimed 4x the real extent, so the device copy ran past the end of the buffer. The per-beam memcpy also used the same combined value as its length instead of a single sequence's byte size.

This mirrors the correct T5 sibling (subgraph_t5_decoder.cc), which separates the element count (used for the spans/staging allocation) from the per-sequence byte count (used for the memcpy).

Changes

subgraph_whisper_decoder.cc: total_size is now the element count cur_len * batch_beam_size; introduced sequence_bytes = cur_len * sizeof(int32_t) for the per-beam memcpy. The staging buffer and spans use int32_t consistently to match the int32_t tensors/sequences.
Added regression test BeamSearchTest.DummyWhisperWithSequenceInputIds (CPU, and CUDA under USE_CUDA) exercising the use_sequence_as_input_ids path, with a deterministic dummy model and its generator script. The test validates both the sequences and scores outputs.

Related bool-tensor normalization fixes

While exercising the Whisper path, bool tensors copied from raw data could hold non-canonical byte values (anything non-zero rather than strictly {0, 1}), causing provider-dependent behavior. To keep the fix self-contained, the following normalization changes are included:

tensorprotoutils.cc: UnpackTensor<bool> normalizes raw-data bytes to {0, 1} (with a static_assert(sizeof(bool) == 1) guarding the byte-wise loop).
compress_impl.cu (CUDA Compress): the prefix-sum sizing predicate normalizes bool bytes to {0, 1} so the output sizing agrees with the element-selection truthiness check. Since bool initializers are now normalized on unpack, the remaining exposure is runtime-produced bool condition tensors.
Added CompressTest.Compress_cuda_non_canonical_bool_condition (under USE_CUDA), which feeds a raw 0xFF condition byte through a session-level run (OpTester normalizes bool inputs and so cannot reproduce this) and asserts the Compress output is sized by truthiness rather than by the sign-extended byte value.

Motivation

The decoder shares one implementation file across CPU/CUDA/ROCm, so this single change covers all execution providers. The previous behavior could overrun the staging/feed buffers for models that drive the sequence-as-input-ids decoder path.

Co-authored-by: Copilot 223556219+Copilot@users.noreply.github.com

Bool initializers provided via TensorProto raw_data are copied verbatim, so their bytes are not guaranteed to be the canonical {0, 1}. Kernels assume bool tensors hold {0, 1}, and the CUDA Compress sizing path in particular sign-extends the condition bytes (int8 -> int32) to size the output while the kernel selects elements using bool truthiness. For bytes outside {0, 1} the two interpretations disagree, producing an incorrectly sized output. Normalize bool raw_data to {0, 1} in UnpackTensor<bool> so every consumer sees a consistent value, and harden the CUDA Compress CastToInt32 functor to normalize as well so its sizing path matches its write predicate. Add a unit test covering bool raw_data with non-canonical bytes. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

The decoder initial-feed construction conflated a byte count with an element count when sizing the staging buffer and the device-copy spans, causing the copy to run past the input_ids tensor. Separate the element count (used for the spans and staging allocation) from the per-sequence byte count (used for the per-beam memcpy), mirroring the T5 decoder sibling. Add a regression test (DummyWhisperWithSequenceInputIds) plus a model generator that exercises the use_sequence_as_input_ids path. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Note

Copilot couldn't run its full agentic review because no GitHub Actions runner was available. Make sure your repository has a runner available to run Copilot's review, or add a copilot-setup-steps.yml file specifying one with the runs-on attribute. See the docs for more details.

Fixes an out-of-bounds write in Whisper decoder initial feed construction when use_sequence_as_input_ids is enabled, and adds regression coverage to exercise that path.

Changes:

Corrected element-count vs byte-count math in WhisperDecoderSubgraph::CreateInitialFeeds and fixed per-beam memcpy sizing.
Added a deterministic dummy Whisper BeamSearch ONNX model generator plus a new BeamSearch regression test.
Normalized non-canonical bool raw_data values to {0,1} in tensor unpacking and aligned CUDA Compress sizing with bool “truthiness”.

Reviewed changes

Copilot reviewed 6 out of 7 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
onnxruntime/contrib_ops/cpu/transformers/subgraph_whisper_decoder.cc	Fixes initial feed buffer sizing/copy logic for sequence-as-input-ids path.
onnxruntime/test/contrib_ops/beam_search_test.cc	Adds regression test exercising Whisper sequence-as-input-ids decoder path (CPU/CUDA).
onnxruntime/test/testdata/dummy_whisper_model_generator.py	Adds script to generate the dummy Whisper BeamSearch model used by the new test.
onnxruntime/core/framework/tensorprotoutils.cc	Normalizes bool tensor `raw_data` to canonical `{0,1}` during unpack.
onnxruntime/test/framework/tensorutils_test.cc	Adds unit test covering bool `raw_data` normalization behavior.
onnxruntime/core/providers/cuda/tensor/compress_impl.cu	Normalizes condition bytes for prefix-sum sizing to match selection truthiness.

- subgraph_whisper_decoder.cc: use int32_t consistently for staging buffer/spans - tensorprotoutils.cc: add static_assert(sizeof(bool) == 1) for bool normalization loop - dummy_whisper_model_generator.py: add --no-run flag and lazy onnxruntime import - beam_search_test.cc: also validate scores output in DummyWhisperWithSequenceInputIds Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

tianleiwu

Review summary

The core fix is correct. CreateInitialFeeds previously conflated a byte count with an element count: total_size = cur_len * batch_beam_size * sizeof(int) was reused both as the MakeUniquePtr<int> element count and as the gsl::span extent, so the staging buffer and the device-copy spans claimed 4x the real batch_beam_size * cur_len extent, overrunning the input_ids allocation. Splitting into total_size (element count) and sequence_bytes (per-beam byte count) resolves the overrun, and switching int->int32_t matches the tensor/sequence element type. The new memcpy length (sequence_bytes = cur_len * sizeof(int32_t)) is bounded by Sequences::GetSequence(i) (length current_length_), and the per-beam write at seq_copy_ptr + i*cur_len stays within total_size elements. Both multiplications cast to size_t before multiplying, so no intermediate overflow.

Note: this is actually slightly tighter than the T5 sibling it mirrors — subgraph_t5_decoder.cc still passes total_size_bytes (bytes) as the element count to MakeUniquePtr<int>, over-allocating ~4x (wasteful but safe). Out of scope here, but worth aligning T5 in a follow-up.

Independent changes bundled in: the bool-normalization in tensorprotoutils.cc (UnpackTensor<bool> raw_data path) and compress_impl.cu (CastToInt32) are logically separate from the Whisper bounds fix. The PR description now documents them, which is good; just calling out that a reviewer bisecting a future regression in CUDA Compress or bool initializer loading would not expect those in a "Whisper decoder feed bounds" PR.

Tests: the bounds fix gets a dedicated regression model/test, and the bool unpack path gets UnpackBoolTensorWithRawDataNormalizesToZeroOne. The CUDA Compress normalization has no direct coverage (see inline). Verdict: looks good, only minor suggestions.

Most existing inline comments on this PR are from the automated reviewer against an older commit (097246e) and appear already addressed on the current head (e.g. int->int32_t, the static_assert, the scores output, the --no-run flag).

…ess regression test - compress_impl.cu: note that runtime-produced bool tensors are the remaining source of non-canonical condition bytes (initializers are normalized on unpack). - Add CUDA Compress test feeding a raw 0xFF condition byte (which OpTester cannot produce) to lock in that output sizing agrees with truthiness-based element selection. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

tianleiwu

Re-reviewed at b7a842a. The only change since my last review (7f2c2a9) is the follow-up that addresses my prior feedback on the CUDA Compress normalization:

compress_impl.cu: added a comment clarifying that bool initializers are normalized on unpack, so the remaining source of non-canonical bytes is runtime-produced bool condition tensors — this is no longer dead-looking defensive code.
compress_op.test.cc: new Compress_cuda_non_canonical_bool_condition regression test. It builds the graph/session directly and writes a raw 0xFF condition byte (which OpTester can't produce because it normalizes bool inputs), then asserts the output is sized by truthiness. Verified the expected result: condition {0x00, 0xFF, 0x01} over input rows [[1,2],[3,4],[5,6]] selects rows 1 and 2 → shape {2,2}, values {3,4,5,6}. Without the fix, 0xFF (int8_t -1) would corrupt the prefix-sum sizing rather than counting as one selected element.

The earlier core fix (byte-count vs element-count bounds in CreateInitialFeeds) and the bool UnpackTensor normalization remain correct. All my prior feedback is resolved. LGTM.

jiafatom and others added 2 commits June 23, 2026 20:50

github-advanced-security AI found potential problems Jun 23, 2026

View reviewed changes

Comment thread onnxruntime/test/testdata/dummy_whisper_model_generator.py Fixed

Apply ruff format to dummy_whisper_model_generator.py

097246e

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

jiafatom requested a review from Copilot June 24, 2026 03:59

Copilot AI reviewed Jun 24, 2026

View reviewed changes

Copilot started reviewing on behalf of jiafatom June 24, 2026 04:45 View session

jiafatom and others added 2 commits June 24, 2026 14:12

Fix clang-format alignment in beam_search_test scores output

7f2c2a9

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

tianleiwu reviewed Jun 25, 2026

View reviewed changes

Comment thread onnxruntime/core/providers/cuda/tensor/compress_impl.cu

tianleiwu approved these changes Jun 25, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix bounds in WhisperDecoderSubgraph::CreateInitialFeeds initial feeds#29239

Fix bounds in WhisperDecoderSubgraph::CreateInitialFeeds initial feeds#29239
jiafatom wants to merge 6 commits into
microsoft:mainfrom
jiafatom:jiafatom/whisper-decoder-feed-bounds

jiafatom commented Jun 23, 2026 •

edited

Loading

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tianleiwu left a comment

Uh oh!

Uh oh!

tianleiwu left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

jiafatom commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Changes

Related bool-tensor normalization fixes

Motivation

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tianleiwu left a comment

Choose a reason for hiding this comment

Review summary

Uh oh!

Uh oh!

tianleiwu left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jiafatom commented Jun 23, 2026 •

edited

Loading