Skip to content

[https://nvbugs/6368480][fix] Cache the SM count once in FmhaDispatcher's constructor and reuse the cached…#15611

Open
chenfeiz0326 wants to merge 1 commit into
NVIDIA:mainfrom
chenfeiz0326:repair-bot-bug6368480
Open

[https://nvbugs/6368480][fix] Cache the SM count once in FmhaDispatcher's constructor and reuse the cached…#15611
chenfeiz0326 wants to merge 1 commit into
NVIDIA:mainfrom
chenfeiz0326:repair-bot-bug6368480

Conversation

@chenfeiz0326

@chenfeiz0326 chenfeiz0326 commented Jun 25, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • Root cause: FmhaDispatcher::isSupported() and FmhaDispatcher::run() are on the per-iter FMHA dispatch hot path, and each invocation called tensorrt_llm::common::getMultiProcessorCount(), which is a cudaDeviceGetAttribute() call into the CUDA driver. The repeated driver round-trips show up as host overhead on llama-3.3-70B FP4 TP4 (llama70b_fp4_tp4_512_32-con512_iter10_512_32) and caused the perf regression observed in PR12643.
  • Fix: Cache the SM count once in FmhaDispatcher's constructor and reuse the cached value on every isSupported() / run() invocation. New private member mMultiProcessorCount is initialized from getMultiProcessorCount() in the constructor initializer list; both call sites that previously called getMultiProcessorCount() now read mMultiProcessorCount. SM count is fixed per process, so caching is safe.
  • Automated fix generated by repair-bot

Test plan

  • Verify fix on the same GPU type as the original failure
  • Check for regressions in related tests

Links

Summary by CodeRabbit

  • Refactor
    • Improved dispatch efficiency by caching a GPU capability value instead of recalculating it repeatedly.
    • This should reduce overhead in attention kernel setup and help keep inference runs more efficient.

@coderabbitai

coderabbitai Bot commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

📝 Walkthrough

Walkthrough

FmhaDispatcher now caches the multi-processor count on construction and reuses it when setting tllmRunnerParams.mMultiProcessorCount in isSupported() and run().

Changes

FmhaDispatcher SM count cache

Layer / File(s) Summary
Cache and reuse multi-processor count
cpp/tensorrt_llm/kernels/fmhaDispatcher.h, cpp/tensorrt_llm/kernels/fmhaDispatcher.cpp
FmhaDispatcher adds a cached multi-processor count member, initializes it in the constructor, and uses it to populate tllmRunnerParams.mMultiProcessorCount in isSupported() and run().

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 57.14% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title matches the change and includes the required ticket and fix type.
Description check ✅ Passed It explains the issue, fix, and test plan, though it uses Summary/Test plan instead of the template's exact section names.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
cpp/tensorrt_llm/kernels/fmhaDispatcher.h (1)

64-66: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Make the cached SM count immutable.

mMultiProcessorCount is initialized once and then only read; declare it const to enforce that contract.

Suggested change
-    int mMultiProcessorCount{0};
+    int const mMultiProcessorCount{0};

As per coding guidelines, "Variables not modified after initialization should be declared as const."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/tensorrt_llm/kernels/fmhaDispatcher.h` around lines 64 - 66, The cached
SM count in FMHADispatcher is only initialized once and then read, so make
mMultiProcessorCount immutable by declaring it const where it is defined. Update
the member declaration in FMHADispatcher to reflect that it is never modified
after construction, while keeping the existing initialization and any read
access in isSupported/run unchanged.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@cpp/tensorrt_llm/kernels/fmhaDispatcher.h`:
- Around line 64-66: The cached SM count in FMHADispatcher is only initialized
once and then read, so make mMultiProcessorCount immutable by declaring it const
where it is defined. Update the member declaration in FMHADispatcher to reflect
that it is never modified after construction, while keeping the existing
initialization and any read access in isSupported/run unchanged.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 8b53a9c6-0157-4cb9-aede-74f646cdada8

📥 Commits

Reviewing files that changed from the base of the PR and between 7cc568c and d0f68ce.

📒 Files selected for processing (2)
  • cpp/tensorrt_llm/kernels/fmhaDispatcher.cpp
  • cpp/tensorrt_llm/kernels/fmhaDispatcher.h

@chenfeiz0326 chenfeiz0326 force-pushed the repair-bot-bug6368480 branch from d0f68ce to 8959236 Compare June 25, 2026 03:02
@chenfeiz0326

Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55678 [ run ] triggered by Bot. Commit: 8959236 Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55679 [ run ] triggered by Bot. Commit: 8959236 Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55678 [ run ] completed with state ABORTED. Commit: 8959236

Link to invocation

@chenfeiz0326 chenfeiz0326 enabled auto-merge (squash) June 25, 2026 03:29
@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55679 [ run ] completed with state FAILURE. Commit: 8959236
/LLM/main/L0_MergeRequest_PR pipeline #44585 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@longlee0622

Copy link
Copy Markdown
Collaborator

/bot run --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55698 [ run ] triggered by Bot. Commit: 8959236 Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55698 [ run ] completed with state SUCCESS. Commit: 8959236
/LLM/main/L0_MergeRequest_PR pipeline #44601 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@longlee0622

Copy link
Copy Markdown
Collaborator

/bot run --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55778 [ run ] triggered by Bot. Commit: 8959236 Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55778 [ run ] completed with state SUCCESS. Commit: 8959236
/LLM/main/L0_MergeRequest_PR pipeline #44675 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

…id per-iter cudaDeviceGetAttribute

tensorrt_llm::common::getMultiProcessorCount() calls cudaDeviceGetAttribute every invocation. FmhaDispatcher::isSupported() and run() are on the per-iter FMHA dispatch hot path, so the repeated CUDA driver calls show up as host overhead in the llama-3.3-70B FP4 TP4 perf case (con512_iter10_512_32). Cache the value in a member at construction and reuse it.

Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>
@longlee0622 longlee0622 force-pushed the repair-bot-bug6368480 branch from 8959236 to 5718551 Compare June 25, 2026 15:25
@longlee0622

Copy link
Copy Markdown
Collaborator

/bot run --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55808 [ run ] triggered by Bot. Commit: 5718551 Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55808 [ run ] completed with state FAILURE. Commit: 5718551
/LLM/main/L0_MergeRequest_PR pipeline #44702 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants