[https://nvbugs/6368480][fix] Cache the SM count once in FmhaDispatcher's constructor and reuse the cached…#15611
Conversation
📝 WalkthroughWalkthrough
ChangesFmhaDispatcher SM count cache
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
cpp/tensorrt_llm/kernels/fmhaDispatcher.h (1)
64-66: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winMake the cached SM count immutable.
mMultiProcessorCountis initialized once and then only read; declare itconstto enforce that contract.Suggested change
- int mMultiProcessorCount{0}; + int const mMultiProcessorCount{0};As per coding guidelines, "Variables not modified after initialization should be declared as
const."🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cpp/tensorrt_llm/kernels/fmhaDispatcher.h` around lines 64 - 66, The cached SM count in FMHADispatcher is only initialized once and then read, so make mMultiProcessorCount immutable by declaring it const where it is defined. Update the member declaration in FMHADispatcher to reflect that it is never modified after construction, while keeping the existing initialization and any read access in isSupported/run unchanged.Source: Coding guidelines
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@cpp/tensorrt_llm/kernels/fmhaDispatcher.h`:
- Around line 64-66: The cached SM count in FMHADispatcher is only initialized
once and then read, so make mMultiProcessorCount immutable by declaring it const
where it is defined. Update the member declaration in FMHADispatcher to reflect
that it is never modified after construction, while keeping the existing
initialization and any read access in isSupported/run unchanged.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 8b53a9c6-0157-4cb9-aede-74f646cdada8
📒 Files selected for processing (2)
cpp/tensorrt_llm/kernels/fmhaDispatcher.cppcpp/tensorrt_llm/kernels/fmhaDispatcher.h
d0f68ce to
8959236
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #55678 [ run ] triggered by Bot. Commit: |
|
PR_Github #55679 [ run ] triggered by Bot. Commit: |
|
PR_Github #55678 [ run ] completed with state |
|
PR_Github #55679 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #55698 [ run ] triggered by Bot. Commit: |
|
PR_Github #55698 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #55778 [ run ] triggered by Bot. Commit: |
|
PR_Github #55778 [ run ] completed with state
|
…id per-iter cudaDeviceGetAttribute tensorrt_llm::common::getMultiProcessorCount() calls cudaDeviceGetAttribute every invocation. FmhaDispatcher::isSupported() and run() are on the per-iter FMHA dispatch hot path, so the repeated CUDA driver calls show up as host overhead in the llama-3.3-70B FP4 TP4 perf case (con512_iter10_512_32). Cache the value in a member at construction and reuse it. Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>
8959236 to
5718551
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #55808 [ run ] triggered by Bot. Commit: |
|
PR_Github #55808 [ run ] completed with state
|
Summary
Test plan
Links
Summary by CodeRabbit