Skip to content

[https://nvbugs/6242591][fix] Fix bugs in Beam Search kernels#15621

Draft
wili-65535 wants to merge 1 commit into
NVIDIA:mainfrom
wili-65535:wili/fix-beam-search
Draft

[https://nvbugs/6242591][fix] Fix bugs in Beam Search kernels#15621
wili-65535 wants to merge 1 commit into
NVIDIA:mainfrom
wili-65535:wili/fix-beam-search

Conversation

@wili-65535

@wili-65535 wili-65535 commented Jun 25, 2026

Copy link
Copy Markdown
Collaborator

Description

Fix bugs in the beam search V2 path with beam_width >= 9 and beam_search_diversity_rate > 0.

Root cause:

  1. addCumLogProbs compares a candidate-slot index i (range 0..nBMIn*nBMOut*2-1) against a vocabulary token ID (endIds[slot]) for the EOS forcing condition on finished beams. The comparison is always false, so those candidates receive inflated scores, rank into the top-nBM every step, and continuously increment numBeamsCBA.
  2. A separate overflow guard uses == nBM instead of >= nBM, so once numBeamsCBA exceeds nBM the guard never triggers, leading to unbounded out-of-bounds writes into the CBA output buffer and eventual SIGSEGV.
  3. EOS candidate length-penalty scoring uses the post-sort candidate rank i as the parent beam index instead of (topId / nV) % nBM, producing wrong scores in the CBA.
  4. When all top candidates for a step are EOS tokens (nBeamForNextStep < nBMOut), the state-update block reads uninitialized outputIdsPtr slots, corrupting parentIdsPtr/outputIdsPtr/sequenceLengths and the traceback chain.

Changed Files

File Change
cpp/tensorrt_llm/kernels/beamSearchKernels.h Add pStage1Ids parameter to both addCumLogProbs overloads
cpp/tensorrt_llm/kernels/beamSearchKernels.cu Use pStage1Ids for EOS token lookup on finished beams
cpp/tensorrt_llm/kernels/beamSearchKernels/beamSearchKernelsTemplate.h Change three == nBM into >= nBM; Use parentBeam in EOS scoring

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • If PR introduces API changes, an appropriate PR label is added - either api-compatible or api-breaking. For api-breaking, include BREAKING in the PR title.

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

@wili-65535

Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55813 [ run ] triggered by Bot. Commit: 6ae77a8 Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55813 [ run ] completed with state SUCCESS. Commit: 6ae77a8
/LLM/main/L0_MergeRequest_PR pipeline #44705 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@wili-65535 wili-65535 force-pushed the wili/fix-beam-search branch from 6ae77a8 to 73131cd Compare June 26, 2026 00:51
Signed-off-by: wili-65535 <12345678+wili@users.noreply.github.com>
@wili-65535 wili-65535 force-pushed the wili/fix-beam-search branch from 73131cd to d1af560 Compare June 26, 2026 00:59
@NVIDIA NVIDIA deleted a comment from tensorrt-cicd Jun 26, 2026
@NVIDIA NVIDIA deleted a comment from tensorrt-cicd Jun 26, 2026
@NVIDIA NVIDIA deleted a comment from github-actions Bot Jun 26, 2026
@wili-65535

Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants