Skip to content

Conversation

@ptrendx
Copy link
Member

@ptrendx ptrendx commented Feb 9, 2026

Description

Improve handling of the NVTE_CUDA_ARCHS env variable:

  • add the regular architectures to the build of the sources with specific architectures to enable some support for GPU architectures in the family that were not specialized directly.
  • automatically add sm75 to the build in case the CMAKE_CUDA_ARCHITECTURES becomes empty (which currently should only happen when cmake < 4.0.2 and sm120 is the only selected architecture)

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring

Changes

Please list the changes introduced in this PR:

  • Change A
  • Change B

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
@ptrendx ptrendx requested a review from ksivaman February 9, 2026 23:09
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Feb 9, 2026

Greptile Overview

Greptile Summary

This PR adjusts Transformer Engine’s CUDA arch selection logic by splitting CMAKE_CUDA_ARCHITECTURES into generic vs arch-specific compile passes (and defines NVTE_HAS_ARCH_SPECIFIC_TARGETS for the arch-specific sources), plus a fallback for older CMake when the architecture list becomes empty.

The main concern is in transformer_engine/common/CMakeLists.txt: the new handling for 100/101 appends 100a/101a but does not remove 100/101 from CMAKE_CUDA_ARCHITECTURES nor route them into NVTE_GENERIC_ARCHS, which breaks the intended separation and can cause unintended compilation targeting/duplication. The older-CMake fallback message also doesn’t match behavior: it replaces the arch list with 75 rather than adding 75 alongside the originally requested target(s).

Confidence Score: 2/5

  • This PR is not safe to merge as-is due to incorrect CUDA architecture list handling in CMake for sm_100/sm_101 and a fallback that can drop requested targets.
  • The change set is small, but the CMake logic as written will keep 100/101 in CMAKE_CUDA_ARCHITECTURES while also compiling arch-specific sources for 100a/101a, contradicting the intended generic-vs-specific split used for 110/120. Additionally, the old-CMake fallback replaces the architecture list with 75, which can remove the user’s requested architecture and change build output.
  • transformer_engine/common/CMakeLists.txt

Important Files Changed

Filename Overview
transformer_engine/common/CMakeLists.txt Refactors CUDA arch handling into generic vs specific arch lists and adds fallback for older CMake; current logic leaves arch 100/101 in CMAKE_CUDA_ARCHITECTURES while also adding 100a/101a, leading to unintended double-compilation / wrong arch list.
transformer_engine/common/util/ptx.cuh Updates NVTE_CUDA_ARCH_MATCHES compatibility checks to behave differently when building arch-specific targets (NVTE_HAS_ARCH_SPECIFIC_TARGETS). Change looks consistent and avoids compile-time static_asserts for arch-specific builds.

Sequence Diagram

sequenceDiagram
  participant UserEnv as User env (NVTE_CUDA_ARCHS)
  participant CMake as CMake configure
  participant TE as common/CMakeLists.txt
  participant NVCC as nvcc
  participant Gen as Generic .cu sources
  participant Spec as Arch-specific .cu sources
  participant PTX as util/ptx.cuh

  UserEnv->>CMake: "Configure build"
  CMake->>TE: "Evaluate arch selection"
  TE->>TE: "Split into NVTE_GENERIC_ARCHS / NVTE_SPECIFIC_ARCHS"

  alt arch is 110/120
    TE->>TE: "Remove base arch from CMAKE_CUDA_ARCHITECTURES"
    TE->>TE: "Append base to NVTE_GENERIC_ARCHS"
    TE->>TE: "Append suffixed arch to NVTE_SPECIFIC_ARCHS"
  else arch is 100/101 (current PR)
    TE->>TE: "Keep base arch in CMAKE_CUDA_ARCHITECTURES"
    TE->>TE: "Append 100a/101a to NVTE_SPECIFIC_ARCHS"
  end

  TE->>NVCC: "Compile Gen with --generate-code for NVTE_GENERIC_ARCHS"
  TE->>NVCC: "Compile Spec with --generate-code for NVTE_SPECIFIC_ARCHS"
  TE->>NVCC: "Compile all sources for CMAKE_CUDA_ARCHITECTURES"

  NVCC->>PTX: "Define NVTE_HAS_ARCH_SPECIFIC_TARGETS for Spec"
  PTX->>PTX: "NVTE_CUDA_ARCH_MATCHES: runtime check when macro defined"
Loading

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

Comment on lines 43 to 56
@@ -54,31 +52,57 @@ endif()
# Check for architecture 101 (if we see this we are in toolkit <= 12.9)
list(FIND CMAKE_CUDA_ARCHITECTURES "101" arch_101_index)
if(NOT arch_101_index EQUAL -1)
list(REMOVE_ITEM CMAKE_CUDA_ARCHITECTURES "101")
list(APPEND NVTE_GENERIC_ARCHS "101")
list(APPEND NVTE_SPECIFIC_ARCHS "101a")
endif()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arch 100/101 not removed

When "100" / "101" are present in CMAKE_CUDA_ARCHITECTURES, this block only appends 100a/101a to NVTE_SPECIFIC_ARCHS but never removes the base arch from CMAKE_CUDA_ARCHITECTURES nor adds it to NVTE_GENERIC_ARCHS. As a result, the build will still compile all sources for sm_100 / sm_101 (via CMAKE_CUDA_ARCHITECTURES) while also compiling arch-specific sources for sm_100a / sm_101a (via --generate-code), which defeats the “generic + specific” split and can cause unexpected extra build work / incorrect targeting. This differs from the 110/120 handling below, which removes the base arch and routes it into the explicit lists.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@greptileai That is because arches 110/120 are routed to the "f" (like 120f) variant where the entire family is covered, whereas in the case of 100, it routes to "a" variants (like 100a) - this is intentional behavior, but in order to support different Arches in the same family, we need to still compile for the regular 100 then.

Comment on lines +93 to 101
if(CMAKE_VERSION VERSION_LESS 4.0.2)
if(NOT CMAKE_CUDA_ARCHITECTURES)
message(WARNING
"CMAKE_CUDA_ARCHITECTURES is empty after replacing arch-specific targets. "
"Please upgrade to CMake 4.0.2+ for native 'f' architecture support. "
"Adding sm_75 target in addition to the specified target to avoid configuration "
"errors - this will result in longer build time, but does not affect correctness.")
set(CMAKE_CUDA_ARCHITECTURES 75)
endif()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fallback wipes user archs

On CMake < 4.0.2, if the arch list becomes empty you set(CMAKE_CUDA_ARCHITECTURES 75) (dropping the user’s originally requested arch entirely). Given the warning text says “add sm_75 target in addition to the specified target”, this looks like a functional mismatch: the fallback should preserve the originally selected arch(es) and append 75, not replace everything with 75.

Copy link
Member Author

@ptrendx ptrendx Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@greptileai If the list is empty, then setting is the same as appending.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant