Improve handling of the NVTE_CUDA_ARCHS #2665

ptrendx · 2026-02-09T23:09:22Z

Description

Improve handling of the NVTE_CUDA_ARCHS env variable:

add the regular architectures to the build of the sources with specific architectures to enable some support for GPU architectures in the family that were not specialized directly.
automatically add sm75 to the build in case the CMAKE_CUDA_ARCHITECTURES becomes empty (which currently should only happen when cmake < 4.0.2 and sm120 is the only selected architecture)

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

Change A
Change B

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

greptile-apps · 2026-02-09T23:20:51Z

Greptile Overview

Greptile Summary

This PR adjusts Transformer Engine’s CUDA arch selection logic by splitting CMAKE_CUDA_ARCHITECTURES into generic vs arch-specific compile passes (and defines NVTE_HAS_ARCH_SPECIFIC_TARGETS for the arch-specific sources), plus a fallback for older CMake when the architecture list becomes empty.

The main concern is in transformer_engine/common/CMakeLists.txt: the new handling for 100/101 appends 100a/101a but does not remove 100/101 from CMAKE_CUDA_ARCHITECTURES nor route them into NVTE_GENERIC_ARCHS, which breaks the intended separation and can cause unintended compilation targeting/duplication. The older-CMake fallback message also doesn’t match behavior: it replaces the arch list with 75 rather than adding 75 alongside the originally requested target(s).

Confidence Score: 2/5

This PR is not safe to merge as-is due to incorrect CUDA architecture list handling in CMake for sm_100/sm_101 and a fallback that can drop requested targets.
The change set is small, but the CMake logic as written will keep 100/101 in CMAKE_CUDA_ARCHITECTURES while also compiling arch-specific sources for 100a/101a, contradicting the intended generic-vs-specific split used for 110/120. Additionally, the old-CMake fallback replaces the architecture list with 75, which can remove the user’s requested architecture and change build output.
transformer_engine/common/CMakeLists.txt

Important Files Changed

Filename	Overview
transformer_engine/common/CMakeLists.txt	Refactors CUDA arch handling into generic vs specific arch lists and adds fallback for older CMake; current logic leaves arch 100/101 in CMAKE_CUDA_ARCHITECTURES while also adding 100a/101a, leading to unintended double-compilation / wrong arch list.
transformer_engine/common/util/ptx.cuh	Updates NVTE_CUDA_ARCH_MATCHES compatibility checks to behave differently when building arch-specific targets (NVTE_HAS_ARCH_SPECIFIC_TARGETS). Change looks consistent and avoids compile-time static_asserts for arch-specific builds.

Sequence Diagram

sequenceDiagram
  participant UserEnv as User env (NVTE_CUDA_ARCHS)
  participant CMake as CMake configure
  participant TE as common/CMakeLists.txt
  participant NVCC as nvcc
  participant Gen as Generic .cu sources
  participant Spec as Arch-specific .cu sources
  participant PTX as util/ptx.cuh

  UserEnv->>CMake: "Configure build"
  CMake->>TE: "Evaluate arch selection"
  TE->>TE: "Split into NVTE_GENERIC_ARCHS / NVTE_SPECIFIC_ARCHS"

  alt arch is 110/120
    TE->>TE: "Remove base arch from CMAKE_CUDA_ARCHITECTURES"
    TE->>TE: "Append base to NVTE_GENERIC_ARCHS"
    TE->>TE: "Append suffixed arch to NVTE_SPECIFIC_ARCHS"
  else arch is 100/101 (current PR)
    TE->>TE: "Keep base arch in CMAKE_CUDA_ARCHITECTURES"
    TE->>TE: "Append 100a/101a to NVTE_SPECIFIC_ARCHS"
  end

  TE->>NVCC: "Compile Gen with --generate-code for NVTE_GENERIC_ARCHS"
  TE->>NVCC: "Compile Spec with --generate-code for NVTE_SPECIFIC_ARCHS"
  TE->>NVCC: "Compile all sources for CMAKE_CUDA_ARCHITECTURES"

  NVCC->>PTX: "Define NVTE_HAS_ARCH_SPECIFIC_TARGETS for Spec"
  PTX->>PTX: "NVTE_CUDA_ARCH_MATCHES: runtime check when macro defined"

greptile-apps

_{2 files reviewed, 2 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-02-09T23:20:54Z

transformer_engine/common/CMakeLists.txt

@@ -54,31 +52,57 @@ endif()
 # Check for architecture 101 (if we see this we are in toolkit <= 12.9)
 list(FIND CMAKE_CUDA_ARCHITECTURES "101" arch_101_index)
 if(NOT arch_101_index EQUAL -1)
-  list(REMOVE_ITEM CMAKE_CUDA_ARCHITECTURES "101")
-  list(APPEND NVTE_GENERIC_ARCHS "101")
  list(APPEND NVTE_SPECIFIC_ARCHS "101a")
 endif()


Arch 100/101 not removed

When "100" / "101" are present in CMAKE_CUDA_ARCHITECTURES, this block only appends 100a/101a to NVTE_SPECIFIC_ARCHS but never removes the base arch from CMAKE_CUDA_ARCHITECTURES nor adds it to NVTE_GENERIC_ARCHS. As a result, the build will still compile all sources for sm_100 / sm_101 (via CMAKE_CUDA_ARCHITECTURES) while also compiling arch-specific sources for sm_100a / sm_101a (via --generate-code), which defeats the “generic + specific” split and can cause unexpected extra build work / incorrect targeting. This differs from the 110/120 handling below, which removes the base arch and routes it into the explicit lists.

@greptileai That is because arches 110/120 are routed to the "f" (like 120f) variant where the entire family is covered, whereas in the case of 100, it routes to "a" variants (like 100a) - this is intentional behavior, but in order to support different Arches in the same family, we need to still compile for the regular 100 then.

greptile-apps · 2026-02-09T23:20:56Z

transformer_engine/common/CMakeLists.txt

+if(CMAKE_VERSION VERSION_LESS 4.0.2)
+  if(NOT CMAKE_CUDA_ARCHITECTURES)
+    message(WARNING
+            "CMAKE_CUDA_ARCHITECTURES is empty after replacing arch-specific targets. "
+            "Please upgrade to CMake 4.0.2+ for native 'f' architecture support. "
+            "Adding sm_75 target in addition to the specified target to avoid configuration "
+            "errors - this will result in longer build time, but does not affect correctness.")
+    set(CMAKE_CUDA_ARCHITECTURES 75)
  endif()


Fallback wipes user archs

On CMake < 4.0.2, if the arch list becomes empty you set(CMAKE_CUDA_ARCHITECTURES 75) (dropping the user’s originally requested arch entirely). Given the warning text says “add sm_75 target in addition to the specified target”, this looks like a functional mismatch: the fallback should preserve the originally selected arch(es) and append 75, not replace everything with 75.

@greptileai If the list is empty, then setting is the same as appending.

ptrendx added 4 commits January 26, 2026 15:57

Trial and error

8f9804a

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

Should be ok

09e6ca8

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

Fix one issue and add better message to the fallback path

5740300

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

Fixes

71041ae

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

ptrendx requested a review from ksivaman February 9, 2026 23:09

greptile-apps bot reviewed Feb 9, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve handling of the NVTE_CUDA_ARCHS #2665

Improve handling of the NVTE_CUDA_ARCHS #2665

ptrendx commented Feb 9, 2026

Uh oh!

greptile-apps bot commented Feb 9, 2026

Uh oh!

greptile-apps bot left a comment

Uh oh!

greptile-apps bot Feb 9, 2026

Uh oh!

ptrendx Feb 10, 2026

Uh oh!

greptile-apps bot Feb 9, 2026

Uh oh!

ptrendx Feb 10, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Improve handling of the NVTE_CUDA_ARCHS #2665

Are you sure you want to change the base?

Improve handling of the NVTE_CUDA_ARCHS #2665

Conversation

ptrendx commented Feb 9, 2026

Description

Type of change

Changes

Checklist:

Uh oh!

greptile-apps bot commented Feb 9, 2026

Greptile Overview

Greptile Summary

Confidence Score: 2/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

ptrendx Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

ptrendx Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ptrendx Feb 10, 2026 •

edited

Loading