Reject CUDA BERT EmbedLayerNorm/SkipLayerNorm shapes exceeding 32-bit output indexing by titaiwangms · Pull Request #29264 · microsoft/onnxruntime

titaiwangms · 2026-06-25T17:48:12Z

Summary

The CUDA EmbedLayerNormalization and SkipLayerNormalization kernels compute output write offsets (row_index * hidden_size) using 32-bit arithmetic. For very large output tensors the element count can exceed INT32_MAX, at which point the offset is no longer representable in 32 bits.

Every output write index in these kernels is a pure function of the launch grid and hidden_size — there is no data-dependent write indexing — so the maximum index is exactly output_element_count - 1, which the host knows from the input shapes before launch. This PR adds a host-side guard in each op's ComputeInternal that computes the output element count in 64-bit arithmetic and returns a clear error when it exceeds the supported 32-bit indexing range.

Design

EmbedLayerNormalization (embed_layer_norm.cc): output_element_count = (int64)batch_size * sequence_length * hidden_size, guarded with ORT_RETURN_IF_NOT(... <= INT32_MAX, ...).
SkipLayerNormalization (skip_layer_norm.cc): output_element_count = input->Shape().Size() (output shares the input shape), same guard.
Kernels are unchanged — they keep the original int32 indexing, so there is no extra register/occupancy cost in the hot path. This is pure host-side validation.

Behavior

This rejects (rather than silently attempting) single-op LayerNorm outputs larger than 2³¹ elements — a regime no real BERT-family model produces (it would require a multi-GB single-op activation). For all supported shapes there is no behavior or numeric change.

Co-authored-by: Copilot 223556219+Copilot@users.noreply.github.com

Copilot

Pull request overview

This PR addresses integer overflow in CUDA BERT LayerNorm-family kernels by widening the global element write offset (row * hidden_size) from 32-bit to 64-bit, preventing wrapped output indexing for very large tensors.

Changes:

Widen LayerNorm device helper offset/index parameters to int64_t in layer_norm.cuh.
Compute per-row offsets/indices in 64-bit in skip_layer_norm_impl.cu kernels.
Compute output_offset in 64-bit in embed_layer_norm_impl.cu and pass it through to LayerNorm.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File	Description
onnxruntime/contrib_ops/cuda/bert/skip_layer_norm_impl.cu	Uses `int64_t` for per-row `offset`/`idx` to avoid overflow when indexing large output tensors.
onnxruntime/contrib_ops/cuda/bert/layer_norm.cuh	Updates LayerNorm helpers to accept 64-bit offsets and use 64-bit indices for global element access.
onnxruntime/contrib_ops/cuda/bert/embed_layer_norm_impl.cu	Uses 64-bit `output_offset` for writing/normalizing large outputs in EmbedLayerNorm.

tianleiwu · 2026-06-25T20:41:12Z

Is it needed? Typical max sequence length for BERT model is 512, and int32 offset is enough.
You may just check in host code like sequence_length * hidden_size < int32_max, and no need to do it in cuda kernel. Using int64 in cuda kernel will use more registers.

… output indexing The CUDA EmbedLayerNormalization and SkipLayerNormalization kernels compute output write offsets (row_index * hidden_size) using 32-bit arithmetic. For very large output tensors the element count can exceed INT32_MAX and the offset would no longer be representable in 32 bits. Every output write index in these kernels is a pure function of the launch grid and hidden_size (no data-dependent write indexing), so the maximum index is exactly output_element_count - 1, which the host knows from the input shapes before launch. Add a host-side guard in each ComputeInternal that computes the output element count in 64-bit arithmetic and returns a clear error when it exceeds the supported 32-bit indexing range, instead of silently relying on the int32 kernels for shapes they cannot index. Kernels are unchanged (int32 baseline); no numeric behavior change for supported shapes. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

titaiwangms requested a review from Copilot June 25, 2026 18:21

Copilot started reviewing on behalf of titaiwangms June 25, 2026 18:22 View session

Copilot AI reviewed Jun 25, 2026

View reviewed changes

Comment thread onnxruntime/contrib_ops/cuda/bert/embed_layer_norm_impl.cu Outdated

titaiwangms force-pushed the standalone-int64-bert-layernorm-write-offsets branch from 1379258 to 0b9d5e2 Compare June 25, 2026 22:12

titaiwangms changed the title ~~Use 64-bit element offsets in CUDA BERT LayerNorm/SkipLayerNorm write index~~ Reject CUDA BERT EmbedLayerNorm/SkipLayerNorm shapes exceeding 32-bit output indexing Jun 25, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Reject CUDA BERT EmbedLayerNorm/SkipLayerNorm shapes exceeding 32-bit output indexing#29264

Reject CUDA BERT EmbedLayerNorm/SkipLayerNorm shapes exceeding 32-bit output indexing#29264
titaiwangms wants to merge 1 commit into
microsoft:mainfrom
titaiwangms:standalone-int64-bert-layernorm-write-offsets

titaiwangms commented Jun 25, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

tianleiwu commented Jun 25, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

titaiwangms commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Design

Behavior

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

tianleiwu commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

titaiwangms commented Jun 25, 2026 •

edited

Loading

tianleiwu commented Jun 25, 2026 •

edited

Loading