[https://nvbugs/6368480][fix] Cache the SM count once in FmhaDispatcher's constructor and reuse the cached… by chenfeiz0326 · Pull Request #15611 · NVIDIA/TensorRT-LLM

chenfeiz0326 · 2026-06-25T02:58:35Z

Summary

Root cause: FmhaDispatcher::isSupported() and FmhaDispatcher::run() are on the per-iter FMHA dispatch hot path, and each invocation called tensorrt_llm::common::getMultiProcessorCount(), which is a cudaDeviceGetAttribute() call into the CUDA driver. The repeated driver round-trips show up as host overhead on llama-3.3-70B FP4 TP4 (llama70b_fp4_tp4_512_32-con512_iter10_512_32) and caused the perf regression observed in PR12643.
Fix: Cache the SM count once in FmhaDispatcher's constructor and reuse the cached value on every isSupported() / run() invocation. New private member mMultiProcessorCount is initialized from getMultiProcessorCount() in the constructor initializer list; both call sites that previously called getMultiProcessorCount() now read mMultiProcessorCount. SM count is fixed per process, so caching is safe.
Automated fix generated by repair-bot

Test plan

Verify fix on the same GPU type as the original failure
Check for regressions in related tests

Links

Bug: https://nvbugs/6368480

Summary by CodeRabbit

Refactor
- Improved dispatch efficiency by caching a GPU capability value instead of recalculating it repeatedly.
- This should reduce overhead in attention kernel setup and help keep inference runs more efficient.

coderabbitai · 2026-06-25T03:01:37Z

📝 Walkthrough

Walkthrough

FmhaDispatcher now caches the multi-processor count on construction and reuses it when setting tllmRunnerParams.mMultiProcessorCount in isSupported() and run().

Changes

FmhaDispatcher SM count cache

Layer / File(s)	Summary
Cache and reuse multi-processor count `cpp/tensorrt_llm/kernels/fmhaDispatcher.h`, `cpp/tensorrt_llm/kernels/fmhaDispatcher.cpp`	`FmhaDispatcher` adds a cached multi-processor count member, initializes it in the constructor, and uses it to populate `tllmRunnerParams.mMultiProcessorCount` in `isSupported()` and `run()`.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 57.14% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title matches the change and includes the required ticket and fix type.
Description check	✅ Passed	It explains the issue, fix, and test plan, though it uses Summary/Test plan instead of the template's exact section names.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands.}

coderabbitai

🧹 Nitpick comments (1)

cpp/tensorrt_llm/kernels/fmhaDispatcher.h (1)
64-66: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Make the cached SM count immutable.

mMultiProcessorCount is initialized once and then only read; declare it const to enforce that contract.
Suggested change
-    int mMultiProcessorCount{0};
+    int const mMultiProcessorCount{0};
As per coding guidelines, "Variables not modified after initialization should be declared as const."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/tensorrt_llm/kernels/fmhaDispatcher.h` around lines 64 - 66, The cached
SM count in FMHADispatcher is only initialized once and then read, so make
mMultiProcessorCount immutable by declaring it const where it is defined. Update
the member declaration in FMHADispatcher to reflect that it is never modified
after construction, while keeping the existing initialization and any read
access in isSupported/run unchanged.
Source: Coding guidelines

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@cpp/tensorrt_llm/kernels/fmhaDispatcher.h`:
- Around line 64-66: The cached SM count in FMHADispatcher is only initialized
once and then read, so make mMultiProcessorCount immutable by declaring it const
where it is defined. Update the member declaration in FMHADispatcher to reflect
that it is never modified after construction, while keeping the existing
initialization and any read access in isSupported/run unchanged.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 8b53a9c6-0157-4cb9-aede-74f646cdada8

📥 Commits

Reviewing files that changed from the base of the PR and between 7cc568c and d0f68ce.

📒 Files selected for processing (2)

cpp/tensorrt_llm/kernels/fmhaDispatcher.cpp
cpp/tensorrt_llm/kernels/fmhaDispatcher.h

chenfeiz0326 · 2026-06-25T03:04:07Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-06-25T03:09:30Z

PR_Github #55678 [ run ] triggered by Bot. Commit: 8959236 Link to invocation

tensorrt-cicd · 2026-06-25T03:14:00Z

PR_Github #55679 [ run ] triggered by Bot. Commit: 8959236 Link to invocation

tensorrt-cicd · 2026-06-25T03:19:12Z

PR_Github #55678 [ run ] completed with state ABORTED. Commit: 8959236

Link to invocation

tensorrt-cicd · 2026-06-25T04:32:17Z

PR_Github #55679 [ run ] completed with state FAILURE. Commit: 8959236
/LLM/main/L0_MergeRequest_PR pipeline #44585 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

longlee0622 · 2026-06-25T04:33:40Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-06-25T04:40:29Z

PR_Github #55698 [ run ] triggered by Bot. Commit: 8959236 Link to invocation

tensorrt-cicd · 2026-06-25T11:38:03Z

PR_Github #55698 [ run ] completed with state SUCCESS. Commit: 8959236
/LLM/main/L0_MergeRequest_PR pipeline #44601 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

longlee0622 · 2026-06-25T12:03:52Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-06-25T12:09:37Z

PR_Github #55778 [ run ] triggered by Bot. Commit: 8959236 Link to invocation

tensorrt-cicd · 2026-06-25T14:09:49Z

PR_Github #55778 [ run ] completed with state SUCCESS. Commit: 8959236
/LLM/main/L0_MergeRequest_PR pipeline #44675 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

…id per-iter cudaDeviceGetAttribute tensorrt_llm::common::getMultiProcessorCount() calls cudaDeviceGetAttribute every invocation. FmhaDispatcher::isSupported() and run() are on the per-iter FMHA dispatch hot path, so the repeated CUDA driver calls show up as host overhead in the llama-3.3-70B FP4 TP4 perf case (con512_iter10_512_32). Cache the value in a member at construction and reuse it. Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>

longlee0622 · 2026-06-25T15:25:40Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-06-25T15:32:42Z

PR_Github #55808 [ run ] triggered by Bot. Commit: 5718551 Link to invocation

tensorrt-cicd · 2026-06-25T22:44:46Z

PR_Github #55808 [ run ] completed with state FAILURE. Commit: 5718551
/LLM/main/L0_MergeRequest_PR pipeline #44702 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

github-actions Bot assigned chenfeiz0326 Jun 25, 2026

longlee0622 requested a review from yuxianq June 25, 2026 03:00

coderabbitai Bot reviewed Jun 25, 2026

View reviewed changes

chenfeiz0326 force-pushed the repair-bot-bug6368480 branch from d0f68ce to 8959236 Compare June 25, 2026 03:02

yuxianq approved these changes Jun 25, 2026

View reviewed changes

chenfeiz0326 enabled auto-merge (squash) June 25, 2026 03:29

longlee0622 force-pushed the repair-bot-bug6368480 branch from 8959236 to 5718551 Compare June 25, 2026 15:25

Uh oh!

Conversation

chenfeiz0326 commented Jun 25, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Links

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 25, 2026

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

chenfeiz0326 commented Jun 25, 2026

Uh oh!

tensorrt-cicd commented Jun 25, 2026

Uh oh!

tensorrt-cicd commented Jun 25, 2026

Uh oh!

tensorrt-cicd commented Jun 25, 2026

Uh oh!

tensorrt-cicd commented Jun 25, 2026

Uh oh!

longlee0622 commented Jun 25, 2026

Uh oh!

tensorrt-cicd commented Jun 25, 2026

Uh oh!

tensorrt-cicd commented Jun 25, 2026

Uh oh!

longlee0622 commented Jun 25, 2026

Uh oh!

tensorrt-cicd commented Jun 25, 2026

Uh oh!

tensorrt-cicd commented Jun 25, 2026

Uh oh!

longlee0622 commented Jun 25, 2026

Uh oh!

tensorrt-cicd commented Jun 25, 2026

Uh oh!

tensorrt-cicd commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

chenfeiz0326 commented Jun 25, 2026 •

edited by coderabbitai Bot

Loading