Skip to content

Conversation

@vcherepanov-nv
Copy link
Collaborator

Description

Starting with cuBLASMp 0.8.0, they're moving away from using nvshmem for symmetric memory, use NCCL instead.
This change adapts the to changed API.

Fixes # (issue)

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring

Changes

Please list the changes introduced in this PR:

  • Get rid of nvshmem dependency (for cuBLASMp)
  • Update cuBLASMp API usage

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

vcherepanov-nv and others added 5 commits January 27, 2026 02:36
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Feb 9, 2026

Greptile Overview

Greptile Summary

This PR updates TransformerEngine’s cuBLASMp integration to reflect cuBLASMp >= 0.8.0 moving symmetric-memory support from nvshmem to NCCL. It removes the nvshmem build dependency, updates CMake linkage to bring in NCCL for the cuBLASMp path, and adjusts the comm GEMM implementation and its public header to use the updated cuBLASMp APIs.

I did not find additional fix-required issues in the changed files beyond the two items already raised in prior threads (NCCL discoverability in CMake, and workspace lifetime/synchronization in ctx destroy).

Confidence Score: 2/5

  • This PR has known merge-blocking issues called out in prior review threads that should be resolved before merging.
  • Review of the changed files did not surface additional defects beyond prior threads, but the existing unresolved items are significant: CMake must reliably find/link NCCL when NVTE_WITH_CUBLASMP is enabled, and ctx destroy currently lacks a guarantee that in-flight work using ctx->workspace has completed before deregistration/free. Those can break builds or cause use-after-free in valid usage patterns.
  • transformer_engine/common/CMakeLists.txt, transformer_engine/common/comm_gemm/comm_gemm.cpp, transformer_engine/common/include/transformer_engine/comm_gemm.h

Important Files Changed

Filename Overview
setup.py Updates build configuration to remove nvshmem dependency and pass cuBLASMp-related settings; no correctness issues found in this review.
transformer_engine/common/CMakeLists.txt Adjusts CMake options/dependencies for cuBLASMp to use NCCL instead of nvshmem; no additional issues found beyond already-tracked NCCL discovery comment.
transformer_engine/common/comm_gemm/comm_gemm.cpp Adapts cuBLASMp comm GEMM implementation to updated API (NCCL symmetric memory); no new functional issues found beyond already-commented workspace lifetime/synchronization concern.
transformer_engine/common/include/transformer_engine/comm_gemm.h Updates comm GEMM public API/comments to match new cuBLASMp/NCCL integration; no new issues found.

Sequence Diagram

sequenceDiagram
    participant Py as setup.py (build)
    participant CM as CMakeLists.txt
    participant CG as comm_gemm.cpp
    participant MP as cuBLASMp
    participant NC as NCCL

    Py->>CM: Configure NVTE_WITH_CUBLASMP and dependency paths
    CM->>CM: Link cuBLASMp + NCCL (nvshmem removed)
    CG->>MP: Create/Configure cuBLASMp handle
    CG->>NC: Initialize/associate NCCL-backed symmetric memory
    CG->>MP: Register/deregister workspace buffer (new API)
    CG->>MP: Enqueue cublasMpMatmul on main_stream
    CG-->>MP: Destroy ctx/handle (must ensure stream work complete)
Loading

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 files reviewed, 4 comments

Edit Code Review Agent Settings | Greptile

Comment on lines +292 to +296
find_library(NCCL_LIB
NAMES nccl libnccl
PATH_SUFFIXES lib
REQUIRED)
target_link_libraries(transformer_engine PUBLIC ${CUBLASMP_LIB} ${NVSHMEM_HOST_LIB})
target_link_libraries(transformer_engine PUBLIC ${NCCL_LIB} ${CUBLASMP_LIB})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NCCL library not discoverable

find_library(NCCL_LIB ...) is missing a PATHS/hint variable (unlike CUBLASMP_LIB, which uses ${CUBLASMP_DIR}). Unless NCCL is already in the default linker search paths, enabling NVTE_WITH_CUBLASMP will fail at configure time with NCCL_LIB not found. This PR should add a way to point CMake at NCCL (env var / CMake cache var) and pass it from setup.py similarly to CUBLASMP_DIR.

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Feb 9, 2026

Additional Comments (3)

transformer_engine/common/CMakeLists.txt
Undefined NVSHMEM_DIR include

In the NVTE_WITH_CUBLASMP block, target_include_directories(... ${NVSHMEM_DIR}/include) still references NVSHMEM_DIR, but this PR removed the only place that was setting it (setup.py). With NVTE_WITH_CUBLASMP=ON, CMake will now expand an unset var here, which can break configuration/build (and it also contradicts the goal of removing nvshmem for cuBLASMp).


transformer_engine/common/comm_gemm/comm_gemm.cpp
Uninitialized workspace fields

NVTECommGemmCtx has raw workspace/workspace_size fields, but nvte_comm_gemm_ctx_create()'s designated initializer never sets them. That means cublasmp_gemm() will read uninitialized ctx->workspace_size and may also treat an uninitialized ctx->workspace as non-null (leading to invalid deregister/free calls). These need explicit initialization (e.g., workspace=nullptr, workspace_size=0) in the create path.


transformer_engine/common/comm_gemm/comm_gemm.cpp
Workspace leak + missing cleanup

After switching from nvshmem_malloc/free to cublasMpMalloc/cublasMpFree + cublasMpBufferRegister, nvte_comm_gemm_ctx_destroy() now just deletes the ctx and never deregisters/frees ctx->workspace. Since workspace is a raw pointer (not RAII-managed), this leaks device memory and leaves the cuBLASMp registration state dirty for the lifetime of the process.

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Comment on lines 479 to 485
void nvte_comm_gemm_ctx_destroy(NVTECommGemmCtx* ctx) {
NVTE_API_CALL(nvte_comm_gemm_ctx_destroy);
nvshmemx_sync_all_on_stream(ctx->stream.get());
if (ctx->workspace) {
NVTE_CHECK_CUBLASMP(cublasMpBufferDeregister(ctx->grid_row_major.get(), ctx->workspace));
NVTE_CHECK_CUBLASMP(cublasMpFree(ctx->grid_col_major.get(), ctx->workspace));
}
delete ctx;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unsafe workspace free

nvte_comm_gemm_ctx_destroy can deregister/free ctx->workspace while cublasMpMatmul work on the user-provided main_stream is still in flight. cublasmp_gemm sets the cuBLASMp handle stream to main_stream (comm_gemm.cpp:389) and uses ctx->workspace in the enqueue (comm_gemm.cpp:436-439), but destroy() does not synchronize main_stream (or otherwise ensure completion) before calling cublasMpBufferDeregister/cublasMpFree.

This can become a use-after-free if a caller destroys the ctx shortly after launching a comm GEMM. Either synchronize the relevant stream(s) before freeing, or explicitly document (in comm_gemm.h) that callers must synchronize main_stream before calling destroy().

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant