Skip to content

[TRTLLM-12982][chore] relocate torch_multi_arange#15416

Merged
ixlmar merged 4 commits into
NVIDIA:mainfrom
ixlmar:chore/move-torch-multi-arange
Jun 25, 2026
Merged

[TRTLLM-12982][chore] relocate torch_multi_arange#15416
ixlmar merged 4 commits into
NVIDIA:mainfrom
ixlmar:chore/move-torch-multi-arange

Conversation

@ixlmar

@ixlmar ixlmar commented Jun 16, 2026

Copy link
Copy Markdown
Collaborator

Description

Follow-up on #14693 (comment).

Commit 800c7ee is from #15413, which is to be merged before this PR.

Test Coverage

Covered by existing tests

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • If PR introduces API changes, an appropriate PR label is added - either api-compatible or api-breaking. For api-breaking, include BREAKING in the PR title.

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

Summary by CodeRabbit

  • Improvements

    • Encoder CUDA graphs now properly detect multi-item scoring scenarios and fall back to eager execution when necessary.
  • Refactoring

    • Optimized attention metadata to accept multi-item configuration during preparation phase instead of forward pass.
    • Reorganized utility functions for improved code maintainability.
  • Chores

    • Updated test infrastructure and file organization.

@ixlmar ixlmar requested a review from Funatiq June 16, 2026 12:35
@ixlmar

ixlmar commented Jun 16, 2026

Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #54587 [ run ] triggered by Bot. Commit: faad7dc Link to invocation

@brb-nv brb-nv left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@ixlmar ixlmar marked this pull request as ready for review June 16, 2026 17:55
@ixlmar ixlmar requested review from a team as code owners June 16, 2026 17:55
@coderabbitai

coderabbitai Bot commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

📝 Walkthrough

Walkthrough

torch_multi_arange (with _AcceptSyncCompute/ACCEPT_SYNC_COMPUTE) is moved from sampling_utils.py to utils.py and all import sites are updated. Separately, multi_item_part_lens is removed from AttentionForwardArgs and from all attention forward signatures, and instead passed as a keyword argument to AttentionMetadata.prepare(), with FlashInfer caching the computed FlashInferMultiItemParams in _multi_item_params for use during plan().

Changes

torch_multi_arange relocation and multi_item_part_lens prepare() refactor

Layer / File(s) Summary
torch_multi_arange relocated to utils.py
tensorrt_llm/_torch/utils.py, tensorrt_llm/_torch/pyexecutor/sampling_utils.py, tensorrt_llm/_torch/pyexecutor/sampler.py, tests/unittest/_torch/test_torch_multi_arange.py, tests/integration/test_lists/test-db/l0_a10.yml, .pre-commit-config.yaml, legacy-files.txt, pyproject.toml, ruff-legacy.toml
_AcceptSyncCompute, ACCEPT_SYNC_COMPUTE, and torch_multi_arange are added to utils.py and deleted from sampling_utils.py; sampler.py import is redirected; test, test-list, and lint config entries are updated to the new path.
AttentionMetadata.prepare() and AttentionForwardArgs contract
tensorrt_llm/_torch/attention_backend/interface.py
AttentionMetadata.prepare() gains a keyword-only multi_item_part_lens parameter; AttentionForwardArgs drops its multi_item_part_lens field, removing multi-item layout from per-forward args.
Backend prepare() enforce/reject multi_item_part_lens
tensorrt_llm/_torch/attention_backend/vanilla.py, tensorrt_llm/_torch/attention_backend/star_flashinfer.py, tensorrt_llm/_torch/attention_backend/trtllm.py
VanillaAttentionMetadata, StarAttentionMetadata, TrtllmAttentionMetadata, and prepare_encoder_only each add the keyword-only multi_item_part_lens parameter and raise ValueError when non-None; per-forward ValueError checks are removed.
FlashInfer metadata caches multi_item_params at prepare() time
tensorrt_llm/_torch/attention_backend/flashinfer.py
FlashInferAttentionMetadata gains _multi_item_params field and _process_multi_item_part_lens() instance method; prepare() computes and stores multi-item tensors; plan() passes _multi_item_params into PlanParams; forward_impl/forward() have multi_item_part_lens removed; metadata.plan() is wrapped in nvtx_range.
Attention module removes multi_item_part_lens from forward path
tensorrt_llm/_torch/modules/attention.py
_attn_impl, forward_impl, and forward drop multi_item_part_lens parameters; AttentionForwardArgs construction no longer includes it; the RoPE position_ids rewrite block for multi-item scoring is deleted.
Executor and LLM API wire multi_item_part_lens into prepare()
tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py, tensorrt_llm/_torch/pyexecutor/model_engine.py, tensorrt_llm/llmapi/llm.py
EncoderCUDAGraphRunner falls back to eager when multi_item_part_lens is present; model_engine._prepare_encoder_inputs reads and passes multi_item_part_lens to prepare_encoder_only()/prepare(), asserting None on CUDA-graph replay; llm.py encode() gains @torch.inference_mode() and computes CUDA position_ids via torch_multi_arange for multi-item scoring.

Sequence Diagram(s)

sequenceDiagram
    participant encode as llm.encode()
    participant model_engine as _prepare_encoder_inputs
    participant cuda_runner as EncoderCUDAGraphRunner
    participant metadata as FlashInferAttentionMetadata
    participant plan as FlashInferAttentionMetadata.plan()

    encode->>encode: compute position_ids via torch_multi_arange
    encode->>model_engine: inputs (multi_item_part_lens, position_ids)
    model_engine->>cuda_runner: maybe_get_cuda_graph(inputs)
    cuda_runner-->>model_engine: (None, None) — fallback to eager
    model_engine->>metadata: prepare(multi_item_part_lens=...)
    metadata->>metadata: _process_multi_item_part_lens() → _multi_item_params
    model_engine->>plan: plan(...)
    plan->>plan: PlanParams(multi_item_params=_multi_item_params)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • NVIDIA/TensorRT-LLM#14693: Introduced the original multi-item scoring support via multi_item_part_lens in the FlashInfer backend and AttentionForwardArgs, which this PR refactors by moving the handling from the forward path into prepare().

Suggested reviewers

  • tburt-nv
  • Funatiq
  • brb-nv
  • chang-l
  • eopXD
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 41.38% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main change: relocating the torch_multi_arange function as a chore task.
Description check ✅ Passed The PR description follows the template structure, includes a clear explanation referencing the related PR and commit dependencies, specifies test coverage, and completes the PR checklist.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tensorrt_llm/_torch/attention_backend/flashinfer.py`:
- Around line 744-755: The code accesses req_part_lens[0] and req_part_lens[1:]
without validating that each req_part_lens in multi_item_part_lens has the
required structure, which can cause IndexError or ValueError when constructing
tensors for malformed entries like empty lists or lists with only a prefix_len.
Before constructing the prefix_len_ptr and max_item_len_ptr tensors, add
validation to ensure each req_part_lens has at least two elements (one for
prefix_len and at least one for scored items), and raise an API-level ValueError
with a descriptive message if any request part list fails this validation.
- Around line 762-770: The zip() call combining multi_item_part_lens and
token_pos_in_items_raw_lens needs to add strict=True parameter to document that
these iterables have the same length, which resolves the B905 lint finding.
Additionally, replace the list concatenation in the innermost for loop
(req_part_lens[1:] + [token_pos_in_items_len - token_pos_in_items_raw_len]) with
iterable unpacking syntax instead to resolve the RUF005 lint finding.

In `@tensorrt_llm/_torch/utils.py`:
- Around line 574-580: The variable repeats is initialized as an alias to the
ends tensor, and when starts is None, this alias is never broken before the
in-place multiplication operation repeats *= steps.sign() on line 579. This
mutates the caller's ends tensor. Fix this by using out-of-place arithmetic for
the repeat count calculation: instead of the in-place multiplication repeats *=
steps.sign(), use repeats = repeats * steps.sign() to create a new tensor and
avoid mutating the input.
- Around line 584-602: The prev_range_ends calculation using range_ends.roll(1)
doesn't account for empty ranges where repeats == 0. When a range is empty, its
nominal end value should not be used as the previous range end for the next
range; instead, the end of the last non-empty range should be carried forward.
Modify the logic that computes prev_range_ends to propagate the previous
non-empty range's end value through empty ranges, ensuring that jumps
calculations correctly reflect transitions only between actual non-empty ranges.
- Around line 541-557: Replace the assert statements in the function that
validates dtype, shape, and device compatibility between ends, steps, and starts
parameters with explicit ValueError exceptions that include descriptive error
messages. Additionally, add validation at the function entry to ensure that all
input tensors (starts, ends, and steps) are 1-D tensors, raising ValueError if
they are not, since the implementation later uses unsqueeze and torch.cat
operations that expect 1-D inputs.

In `@tensorrt_llm/llmapi/llm.py`:
- Around line 904-932: The code does not sufficiently validate the structure of
multi_item_part_lens before constructing starts_cuda and ends_cuda, allowing
malformed inputs like [prefix_len] with no item lengths to pass through and fail
later in FlashInfer. Add validation before the torch.tensor calls that construct
starts_cuda and ends_cuda to ensure that each multi_item_part_lens in
batch_multi_item_part_lens has length greater than 1 (meaning at least one item
length in addition to the prefix length) and that all length values are
non-negative. Reject the inputs early with a clear error message if these
conditions are not met.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 89ea82b0-3e34-43c9-bc70-8761dbd903f9

📥 Commits

Reviewing files that changed from the base of the PR and between 0b0a03e and faad7dc.

📒 Files selected for processing (18)
  • .pre-commit-config.yaml
  • legacy-files.txt
  • pyproject.toml
  • ruff-legacy.toml
  • tensorrt_llm/_torch/attention_backend/flashinfer.py
  • tensorrt_llm/_torch/attention_backend/interface.py
  • tensorrt_llm/_torch/attention_backend/star_flashinfer.py
  • tensorrt_llm/_torch/attention_backend/trtllm.py
  • tensorrt_llm/_torch/attention_backend/vanilla.py
  • tensorrt_llm/_torch/modules/attention.py
  • tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py
  • tensorrt_llm/_torch/pyexecutor/model_engine.py
  • tensorrt_llm/_torch/pyexecutor/sampler.py
  • tensorrt_llm/_torch/pyexecutor/sampling_utils.py
  • tensorrt_llm/_torch/utils.py
  • tensorrt_llm/llmapi/llm.py
  • tests/integration/test_lists/test-db/l0_a10.yml
  • tests/unittest/_torch/test_torch_multi_arange.py
💤 Files with no reviewable changes (2)
  • tensorrt_llm/_torch/pyexecutor/sampling_utils.py
  • tensorrt_llm/_torch/modules/attention.py

Comment thread tensorrt_llm/_torch/attention_backend/flashinfer.py
Comment thread tensorrt_llm/_torch/attention_backend/flashinfer.py
Comment thread tensorrt_llm/_torch/utils.py Outdated
Comment thread tensorrt_llm/_torch/utils.py Outdated
Comment thread tensorrt_llm/_torch/utils.py Outdated
Comment thread tensorrt_llm/llmapi/llm.py Outdated
@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #54587 [ run ] completed with state FAILURE. Commit: faad7dc
/LLM/main/L0_MergeRequest_PR pipeline #43630 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

Comment thread legacy-files.txt Outdated
Comment thread tensorrt_llm/_torch/attention_backend/interface.py Outdated

@juney-nvidia juney-nvidia left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved

@ixlmar ixlmar removed the request for review from ZhanruiSunCh June 17, 2026 09:37
Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>
@ixlmar ixlmar force-pushed the chore/move-torch-multi-arange branch from faad7dc to 6cf0c06 Compare June 24, 2026 10:10
Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>
ixlmar added 2 commits June 24, 2026 15:46
Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>
Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>
@ixlmar

ixlmar commented Jun 24, 2026

Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55558 [ run ] triggered by Bot. Commit: 7d978c9 Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55558 [ run ] completed with state SUCCESS. Commit: 7d978c9
/LLM/main/L0_MergeRequest_PR pipeline #44479 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

CI Report

Link to invocation

@ixlmar ixlmar merged commit 157e533 into NVIDIA:main Jun 25, 2026
7 checks passed
@ixlmar ixlmar deleted the chore/move-torch-multi-arange branch June 25, 2026 07:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants