Skip to content

Reject CUDA BERT EmbedLayerNorm/SkipLayerNorm shapes exceeding 32-bit output indexing#29264

Open
titaiwangms wants to merge 1 commit into
microsoft:mainfrom
titaiwangms:standalone-int64-bert-layernorm-write-offsets
Open

Reject CUDA BERT EmbedLayerNorm/SkipLayerNorm shapes exceeding 32-bit output indexing#29264
titaiwangms wants to merge 1 commit into
microsoft:mainfrom
titaiwangms:standalone-int64-bert-layernorm-write-offsets

Conversation

@titaiwangms

@titaiwangms titaiwangms commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Summary

The CUDA EmbedLayerNormalization and SkipLayerNormalization kernels compute output write offsets (row_index * hidden_size) using 32-bit arithmetic. For very large output tensors the element count can exceed INT32_MAX, at which point the offset is no longer representable in 32 bits.

Every output write index in these kernels is a pure function of the launch grid and hidden_size — there is no data-dependent write indexing — so the maximum index is exactly output_element_count - 1, which the host knows from the input shapes before launch. This PR adds a host-side guard in each op's ComputeInternal that computes the output element count in 64-bit arithmetic and returns a clear error when it exceeds the supported 32-bit indexing range.

Design

  • EmbedLayerNormalization (embed_layer_norm.cc): output_element_count = (int64)batch_size * sequence_length * hidden_size, guarded with ORT_RETURN_IF_NOT(... <= INT32_MAX, ...).
  • SkipLayerNormalization (skip_layer_norm.cc): output_element_count = input->Shape().Size() (output shares the input shape), same guard.
  • Kernels are unchanged — they keep the original int32 indexing, so there is no extra register/occupancy cost in the hot path. This is pure host-side validation.

Behavior

This rejects (rather than silently attempting) single-op LayerNorm outputs larger than 2³¹ elements — a regime no real BERT-family model produces (it would require a multi-GB single-op activation). For all supported shapes there is no behavior or numeric change.

Co-authored-by: Copilot 223556219+Copilot@users.noreply.github.com

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses integer overflow in CUDA BERT LayerNorm-family kernels by widening the global element write offset (row * hidden_size) from 32-bit to 64-bit, preventing wrapped output indexing for very large tensors.

Changes:

  • Widen LayerNorm device helper offset/index parameters to int64_t in layer_norm.cuh.
  • Compute per-row offsets/indices in 64-bit in skip_layer_norm_impl.cu kernels.
  • Compute output_offset in 64-bit in embed_layer_norm_impl.cu and pass it through to LayerNorm.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
onnxruntime/contrib_ops/cuda/bert/skip_layer_norm_impl.cu Uses int64_t for per-row offset/idx to avoid overflow when indexing large output tensors.
onnxruntime/contrib_ops/cuda/bert/layer_norm.cuh Updates LayerNorm helpers to accept 64-bit offsets and use 64-bit indices for global element access.
onnxruntime/contrib_ops/cuda/bert/embed_layer_norm_impl.cu Uses 64-bit output_offset for writing/normalizing large outputs in EmbedLayerNorm.

Comment thread onnxruntime/contrib_ops/cuda/bert/embed_layer_norm_impl.cu Outdated
@tianleiwu

tianleiwu commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Is it needed? Typical max sequence length for BERT model is 512, and int32 offset is enough.
You may just check in host code like sequence_length * hidden_size < int32_max, and no need to do it in cuda kernel. Using int64 in cuda kernel will use more registers.

… output indexing

The CUDA EmbedLayerNormalization and SkipLayerNormalization kernels compute
output write offsets (row_index * hidden_size) using 32-bit arithmetic. For very
large output tensors the element count can exceed INT32_MAX and the offset would
no longer be representable in 32 bits.

Every output write index in these kernels is a pure function of the launch grid
and hidden_size (no data-dependent write indexing), so the maximum index is
exactly output_element_count - 1, which the host knows from the input shapes
before launch. Add a host-side guard in each ComputeInternal that computes the
output element count in 64-bit arithmetic and returns a clear error when it
exceeds the supported 32-bit indexing range, instead of silently relying on the
int32 kernels for shapes they cannot index.

Kernels are unchanged (int32 baseline); no numeric behavior change for supported
shapes.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@titaiwangms titaiwangms force-pushed the standalone-int64-bert-layernorm-write-offsets branch from 1379258 to 0b9d5e2 Compare June 25, 2026 22:12
@titaiwangms titaiwangms changed the title Use 64-bit element offsets in CUDA BERT LayerNorm/SkipLayerNorm write index Reject CUDA BERT EmbedLayerNorm/SkipLayerNorm shapes exceeding 32-bit output indexing Jun 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants