Skip to content

Symmetric memory pytorch backends#6023

Open
saivishal1999 wants to merge 11 commits intomainfrom
symmetric-memory-pytorch-backends
Open

Symmetric memory pytorch backends#6023
saivishal1999 wants to merge 11 commits intomainfrom
symmetric-memory-pytorch-backends

Conversation

@saivishal1999
Copy link
Collaborator

No description provided.

@github-actions
Copy link

github-actions bot commented Mar 2, 2026

Review updated until commit 6996d05

Description

  • Add PyTorch symmetric memory backends (NCCL, NVSHMEM, CUDA) as alternatives to native VMM

  • Implement getSymmetricMemoryBackend() to select backend via NVFUSER_ENABLE=symmetric_memory_backend option

  • Integrate PyTorch's c10d::symmetric_memory for allocation, rendezvous, and remote tensor access

  • Add Communicator methods to expose Store and Backend for PyTorch symmetric memory integration

Changes walkthrough

Relevant files
Enhancement
6 files
ipc_utils.h
Add SymmetricMemoryBackend enum and getter                             
+13/-0   
ipc_utils.cpp
Implement getSymmetricMemoryBackend option parsing             
+18/-0   
symmetric_tensor.h
Add PyTorch symmetric memory handle member                             
+15/-6   
symmetric_tensor.cpp
Implement PyTorch backend allocation and remote access     
+162/-1 
communicator.h
Declare getStore and getWorldBackendIntrusivePtr                 
+13/-0   
communicator.cpp
Implement getStore and getWorldBackendIntrusivePtr             
+16/-0   
Configuration changes
2 files
options.h
Add SymmetricMemoryBackend to EnableOption enum                   
+2/-0     
options.cpp
Register symmetric_memory_backend enable option                   
+1/-0     
Tests
1 files
test_multidevice_symmetric_tensor.cpp
Add tests for symmetric memory backend selection                 
+108/-0 
Miscellaneous
1 files
fbuild.sh
Add build script for development                                                 
+24/-0   

PR Reviewer Guide

Here are some key observations to aid the review process:

🧪 PR contains tests
⚡ Recommended focus areas for review
Silent fallback to Native backend

When an invalid argument is passed to symmetric_memory_backend option (e.g., "pytorch_invalid"),
getSymmetricMemoryBackend() silently falls back to Native instead of reporting an error.
This could mask user configuration mistakes. Consider adding validation to warn or error
on unknown backend arguments.

SymmetricMemoryBackend getSymmetricMemoryBackend() {
  if (isOptionEnabled(EnableOption::SymmetricMemoryBackend)) {
    if (hasEnableOptionArgument(
            EnableOption::SymmetricMemoryBackend, "pytorch_nccl")) {
      return SymmetricMemoryBackend::PyTorchNccl;
    }
    if (hasEnableOptionArgument(
            EnableOption::SymmetricMemoryBackend, "pytorch_nvshmem")) {
      return SymmetricMemoryBackend::PyTorchNvshmem;
    }
    if (hasEnableOptionArgument(
            EnableOption::SymmetricMemoryBackend, "pytorch_cuda")) {
      return SymmetricMemoryBackend::PyTorchCuda;
    }
  }
  return SymmetricMemoryBackend::Native;
}
PyTorch backend tests commented out

The test PyTorchBackend_RemoteAccessCorrectness (lines 125-163) is commented out. Since this
PR introduces PyTorch symmetric memory backends, having at least one active test for the
non-native paths would be valuable to ensure correctness. Consider enabling or adding an
alternative test for the PyTorch backend path.

// TEST_F(SymmetricTensorTest, PyTorchBackend_RemoteAccessCorrectness) {
//   if (communicator_->size() == 1) {
//     GTEST_SKIP() << "Skipping test for single device";
//   }
//   SymmetricMemoryBackend backend = getSymmetricMemoryBackend();
//   if (backend == SymmetricMemoryBackend::Native) {
//     GTEST_SKIP()
//         << "PyTorch backend not selected; set NVFUSER_ENABLE=symmetric_memory_backend(pytorch_nccl) to run";
//   }

//   const int64_t rank = communicator_->deviceId();
//   const int64_t world_size = communicator_->size();

//   at::Tensor local_tensor = SymmetricTensor::allocate(
//       {256, 512}, at::ScalarType::Float, communicator_->device());
//   SymmetricTensor sym_tensor(local_tensor);

//   EXPECT_TRUE(local_tensor.is_cuda());
//   EXPECT_EQ(local_tensor.numel(), 256 * 512);

//   float local_value = static_cast<float>(rank + 200);
//   local_tensor.fill_(local_value);

//   sym_tensor.setupRemoteHandles();

//   for (int64_t peer_rank = 0; peer_rank < world_size; ++peer_rank) {
//     void* peer_ptr = sym_tensor.remoteTensor(peer_rank).data_ptr();
//     EXPECT_NE(peer_ptr, nullptr);

//     float peer_value;
//     NVFUSER_CUDA_RT_SAFE_CALL(cudaMemcpy(
//         &peer_value, peer_ptr, sizeof(float), cudaMemcpyDeviceToHost));

//     float expected_value = static_cast<float>(peer_rank + 200);
//     EXPECT_FLOAT_EQ(peer_value, expected_value)
//         << "Rank " << rank << " reading from rank " << peer_rank
//         << " (PyTorch backend)";
//   }
// }
Unnecessary build script added

A new file fbuild.sh was added which appears to be a local development/build script with
hardcoded paths (e.g., /opt/hpcx/ucc). This should likely be removed from the PR as it's
not part of the feature implementation and contains machine-specific configuration.

#!/bin/bash

export CC=clang-20
export CXX=clang++-20
export LDFLAGS="-fuse-ld=mold"

export NVFUSER_BUILD_ENABLE_PCH

export UCC_HOME="/opt/hpcx/ucc"
export UCC_DIR="/opt/hpcx/ucc/lib/cmake/ucc"
export UCX_HOME="/opt/hpcx/ucx"
export UCX_DIR="/opt/hpcx/ucx/lib/cmake/ucx"

# export TORCH_CUDA_ARCH_LIST="9.0"

export NVFUSER_BUILD_WITH_UCC=1
export NVFUSER_BUILD_INSTALL_DIR=$BUILD_DIRECTORY/nvfuser
export NVFUSER_BUILD_DIR=$BUILD_DIRECTORY

# Enable debug mode, leave empty for non-debug compilation
export NVFUSER_BUILD_BUILD_TYPE=Debug
export RUN_CMAKE=""

pip install -v -e ./python --no-build-isolation

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 2, 2026

Greptile Summary

This PR integrates PyTorch's torch.distributed._symmetric_memory as an optional backend for SymmetricTensor, selectable via NVFUSER_ENABLE=symmetric_memory_backend(pytorch_nccl|pytorch_nvshmem|pytorch_cuda). It adds c10d::ProcessGroup registration to Communicator so PyTorch's symmetric-memory rendezvous can resolve the group by name, introduces a SymmetricMemoryBackend enum + option-string parser, and wires the PyTorch path through allocation (empty_strided_p2p), rendezvous, and remote tensor access while keeping the native VMM path as the default.

Key issues found:

  • Debug print in tests (std::cout << "Vishal chishta") left in SmallAllocation — must be removed before merging.
  • NCCL availability guard is over-broad: ensurePyTorchSymmMemBackend unconditionally checks isBackendAvailable(kNccl) and emits an error message hardcoded to "(nccl)" even when the active backend is PyTorchNvshmem or PyTorchCuda.
  • Static once_flag silently ignores the second backend: if the function is first called with PyTorchCuda (which skips set_backend), a later call with PyTorchNccl will never invoke set_backend("NCCL"), leaving PyTorch symmetric memory misconfigured.
  • setupMulticast silent success path: when torch_symm_handle_ is already set but multicast was not supported at rendezvous time, setupMulticast returns without error or setting is_multicast_setup_, then multicastPtr() throws — misleadingly blamed on the caller rather than the missing capability.
  • getSymmMemGroupKey formatting: function body is indented at column 0 and the file is missing a trailing newline.

Confidence Score: 2/5

  • Not safe to merge — accidental debug output in tests, logic errors in backend initialization, and a silent failure path in multicast setup need to be addressed first.
  • Several P1 logic bugs are present: the static once_flag can silently misconfigure the backend if called with different backends, the NCCL check incorrectly gates non-NCCL backends, and setupMulticast has a misleading silent-return path. A debug print also left in a test method will pollute all CI runs. The personal build script (file 1) was already flagged but remains unaddressed. The infrastructure additions (ProcessGroup registration, enum, option parsing) are structurally sound, but the integration layer in symmetric_tensor.cpp needs further hardening.
  • Primary attention needed on csrc/multidevice/symmetric_tensor.cpp (backend initialization logic, multicast guard) and tests/cpp/test_multidevice_symmetric_tensor.cpp (debug print, lack of CI-runnable PyTorch backend test).

Important Files Changed

Filename Overview
csrc/multidevice/symmetric_tensor.cpp Core implementation of PyTorch symmetric memory backend integration. Contains several logic issues: static once_flag silently ignores backend changes after first call, NCCL availability check incorrectly applied to non-NCCL backends with a wrong error message, and setupMulticast silently returns without error when torch handle is set but multicast is unsupported.
csrc/multidevice/communicator.cpp Adds ProcessGroup registration for NCCL backends and new getSymmMemGroupKey helper. The new function body is unindented (style) and the file is missing a trailing newline. Logic is otherwise sound — registered keys are consistent between getBackendForTeam and getSymmMemGroupKey since no prefix is passed.
csrc/multidevice/communicator.h Adds NVFUSER_CAN_REGISTER_C10D_PROCESS_GROUP compile-time guard, ProcessGroup storage, and getSymmMemGroupKey declaration. Changes are well-structured and backward-compatible.
csrc/multidevice/ipc_utils.h Introduces SymmetricMemoryBackend enum and declares getSymmetricMemoryBackend(). Straightforward and well-documented.
tests/cpp/test_multidevice_symmetric_tensor.cpp Adds backend-aware test skipping for ContiguousView, but leaves an accidental debug print ("Vishal chishta") in SmallAllocation that pollutes test output. PyTorch backend is tested only manually (commented-out test), leaving the new code paths without CI coverage.
1 Personal developer build script with machine-specific toolchain paths (clang-20, mold linker, /opt/hpcx). Should not be committed to the repository (already flagged in a previous review thread).

Sequence Diagram

sequenceDiagram
    participant Caller
    participant SymmetricTensor
    participant ensurePyTorchSymmMemBackend
    participant Communicator
    participant c10d_symm_mem as c10d::symmetric_memory

    Caller->>SymmetricTensor: allocate(sizes, dtype, device)
    SymmetricTensor->>ensurePyTorchSymmMemBackend: backend (PyTorchNccl/Nvshmem/Cuda)
    ensurePyTorchSymmMemBackend->>c10d_symm_mem: set_backend(name) [once]
    ensurePyTorchSymmMemBackend->>Communicator: getSymmMemGroupKey(kNccl)
    Communicator->>Communicator: getBackendForTeam(all_ranks, kNccl)
    Communicator-->>Communicator: register_process_group(team_key, pg)
    Communicator-->>ensurePyTorchSymmMemBackend: group_name
    ensurePyTorchSymmMemBackend->>c10d_symm_mem: resolve_process_group("0") [fallback alias]
    ensurePyTorchSymmMemBackend->>Communicator: barrier(kNccl)
    ensurePyTorchSymmMemBackend-->>SymmetricTensor: group_name
    SymmetricTensor->>c10d_symm_mem: empty_strided_p2p(sizes, strides, ...)
    c10d_symm_mem-->>SymmetricTensor: local_tensor (PyTorch-managed)
    SymmetricTensor-->>Caller: local_tensor

    Caller->>SymmetricTensor: setupRemoteHandles(tag)
    SymmetricTensor->>ensurePyTorchSymmMemBackend: backend
    ensurePyTorchSymmMemBackend-->>SymmetricTensor: group_name
    SymmetricTensor->>c10d_symm_mem: rendezvous(local_tensor, group_name)
    c10d_symm_mem-->>SymmetricTensor: torch_symm_handle_
    Note over SymmetricTensor: Sets is_multicast_setup_ if has_multicast_support()

    Caller->>SymmetricTensor: remoteTensor(rank)
    SymmetricTensor->>c10d_symm_mem: get_remote_tensor(rank, sizes, dtype)
    c10d_symm_mem-->>Caller: remote at::Tensor
Loading

Comments Outside Diff (1)

  1. csrc/multidevice/symmetric_tensor.cpp, line 620-631 (link)

    P1 setupMulticast silently succeeds even when multicast is not supported by the PyTorch handle

    When torch_symm_handle_ is already populated (e.g., from an earlier setupRemoteHandles call that found no multicast support, leaving is_multicast_setup_ == false), the function reaches the return at line 631 without ever throwing or setting is_multicast_setup_ = true. A subsequent call to multicastPtr() then hits the NVF_CHECK(is_multicast_setup_, "Multicast not setup") assertion, which is confusing because setupMulticast returned without error.

    When torch_symm_handle_ is set but is_multicast_setup_ is false, the function should explicitly fail:

    if (torch_symm_handle_) {
      NVF_CHECK(
          is_multicast_setup_,
          "PyTorch symmetric memory handle does not support multicast");
      return;
    }

Last reviewed commit: "Merge branch 'main' ..."

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

10 files reviewed, 4 comments

Edit Code Review Agent Settings | Greptile

fbuild.sh Outdated
Comment on lines +1 to +24
#!/bin/bash

export CC=clang-20
export CXX=clang++-20
export LDFLAGS="-fuse-ld=mold"

export NVFUSER_BUILD_ENABLE_PCH

export UCC_HOME="/opt/hpcx/ucc"
export UCC_DIR="/opt/hpcx/ucc/lib/cmake/ucc"
export UCX_HOME="/opt/hpcx/ucx"
export UCX_DIR="/opt/hpcx/ucx/lib/cmake/ucx"

# export TORCH_CUDA_ARCH_LIST="9.0"

export NVFUSER_BUILD_WITH_UCC=1
export NVFUSER_BUILD_INSTALL_DIR=$BUILD_DIRECTORY/nvfuser
export NVFUSER_BUILD_DIR=$BUILD_DIRECTORY

# Enable debug mode, leave empty for non-debug compilation
export NVFUSER_BUILD_BUILD_TYPE=Debug
export RUN_CMAKE=""

pip install -v -e ./python --no-build-isolation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personal developer build script committed to repository

This script contains machine-specific, hardcoded toolchain paths that are unlikely to work anywhere except the author's development machine:

  • clang-20 and clang++-20 — not a standard compiler version available broadly
  • -fuse-ld=mold — requires the mold linker to be installed
  • /opt/hpcx/ucc and /opt/hpcx/ucx — HPC-X installation path specific to the author's environment
  • $BUILD_DIRECTORY is used but never validated; if it is unset, NVFUSER_BUILD_INSTALL_DIR and NVFUSER_BUILD_DIR will silently be empty strings, likely breaking the build

This kind of personal convenience script should live outside version control (e.g., in a .gitignore-d directory or in the author's home directory). Committing it to the main repo risks confusing other contributors and cluttering the root directory.

Comment on lines +46 to +72
void ensurePyTorchSymmMemBackend(SymmetricMemoryBackend backend) {
static std::once_flag once;
std::call_once(once, [backend]() {
const char* name = nullptr;
switch (backend) {
case SymmetricMemoryBackend::PyTorchNccl:
name = "NCCL";
break;
case SymmetricMemoryBackend::PyTorchNvshmem:
name = "NVSHMEM";
break;
case SymmetricMemoryBackend::PyTorchCuda:
name = "CUDA";
break;
default:
NVF_ERROR(false, "Unexpected PyTorch symmetric memory backend");
}
c10d::symmetric_memory::set_backend(name);
Communicator& comm = Communicator::getInstance();
NVF_CHECK(comm.is_available(), "Communicator not available for symmetric memory");
c10d::symmetric_memory::set_group_info(
kPyTorchSymmMemGroupName,
static_cast<int>(comm.deviceId()),
static_cast<int>(comm.size()),
comm.getStore());
});
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NCCL backend initialization is incomplete — register_process_group is never called

ensurePyTorchSymmMemBackend calls set_group_info but never calls c10d::register_process_group. According to the comment added to communicator.h for getWorldBackendIntrusivePtr:

Returns the world backend as an intrusive_ptr so it can be registered with c10d::register_process_group (e.g. for PyTorch symmetric memory NCCL rendezvous, which resolves the group by name).

getWorldBackendIntrusivePtr was clearly introduced to supply the backend for this registration, yet the call to c10d::register_process_group is absent from ensurePyTorchSymmMemBackend. PyTorch's NCCL symmetric-memory rendezvous resolves the process group by name at the point it is called; without a prior register_process_group(kPyTorchSymmMemGroupName, ...), the NCCL backend path will fail to locate the group and throw at rendezvous time.

The missing call should be something like:

// After set_group_info, for NCCL backend:
c10d::register_process_group(
    kPyTorchSymmMemGroupName,
    comm.getWorldBackendIntrusivePtr(CommunicatorBackend::kNccl));

The fact that getWorldBackendIntrusivePtr was added in this exact PR but is never invoked strongly suggests this step was accidentally left out.

Comment on lines +125 to +163
// TEST_F(SymmetricTensorTest, PyTorchBackend_RemoteAccessCorrectness) {
// if (communicator_->size() == 1) {
// GTEST_SKIP() << "Skipping test for single device";
// }
// SymmetricMemoryBackend backend = getSymmetricMemoryBackend();
// if (backend == SymmetricMemoryBackend::Native) {
// GTEST_SKIP()
// << "PyTorch backend not selected; set NVFUSER_ENABLE=symmetric_memory_backend(pytorch_nccl) to run";
// }

// const int64_t rank = communicator_->deviceId();
// const int64_t world_size = communicator_->size();

// at::Tensor local_tensor = SymmetricTensor::allocate(
// {256, 512}, at::ScalarType::Float, communicator_->device());
// SymmetricTensor sym_tensor(local_tensor);

// EXPECT_TRUE(local_tensor.is_cuda());
// EXPECT_EQ(local_tensor.numel(), 256 * 512);

// float local_value = static_cast<float>(rank + 200);
// local_tensor.fill_(local_value);

// sym_tensor.setupRemoteHandles();

// for (int64_t peer_rank = 0; peer_rank < world_size; ++peer_rank) {
// void* peer_ptr = sym_tensor.remoteTensor(peer_rank).data_ptr();
// EXPECT_NE(peer_ptr, nullptr);

// float peer_value;
// NVFUSER_CUDA_RT_SAFE_CALL(cudaMemcpy(
// &peer_value, peer_ptr, sizeof(float), cudaMemcpyDeviceToHost));

// float expected_value = static_cast<float>(peer_rank + 200);
// EXPECT_FLOAT_EQ(peer_value, expected_value)
// << "Rank " << rank << " reading from rank " << peer_rank
// << " (PyTorch backend)";
// }
// }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Entire PyTorch backend correctness test is commented out

PyTorchBackend_RemoteAccessCorrectness is the only test that exercises the new PyTorch backend path end-to-end (allocation → rendezvous → remote access). Leaving it commented out means the three new backend variants (pytorch_nccl, pytorch_nvshmem, pytorch_cuda) have zero test coverage in CI.

The comment says it should be run manually with NVFUSER_ENABLE=symmetric_memory_backend(pytorch_nccl), but that means regressions in the PyTorch path will go undetected in normal CI runs.

If the test can't pass yet (e.g., because the NCCL register_process_group call is missing), that's a strong signal to fix the underlying issue rather than suppress the test. If the test is intentionally deferred, consider converting it into a proper GTEST_SKIP with an explanatory message so the intent is visible to reviewers and CI.

Comment on lines +150 to +152
std::vector<int64_t> strides(sizes.size());
strides.back() = 1;
for (int64_t i = (int64_t)strides.size() - 2; i >= 0; --i) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Undefined behavior when sizes is empty (0-dim tensor)

std::vector<int64_t> strides(sizes.size());
strides.back() = 1;   // UB if sizes is empty

std::vector::back() on an empty vector is undefined behaviour. The same guard-free pattern also exists in the native path further down in the same function (~line 225). While allocating a 0-dimensional symmetric tensor is unusual, the PyTorch path that was just added adds a new callsite where callers may pass {} as sizes. A simple check is sufficient:

NVF_CHECK(!sizes.empty(), "Cannot allocate a 0-dim symmetric tensor");

or initialise strides defensively (matching the standard row-major convention for 0-dim tensors, which is an empty strides vector) and skip the loop entirely when sizes is empty.

@nsarka
Copy link
Member

nsarka commented Mar 3, 2026

Sorry! I accidentally hit the button to merge main into the branch. Hopefully it's ok.

Comment on lines +46 to +72
void ensurePyTorchSymmMemBackend(SymmetricMemoryBackend backend) {
static std::once_flag once;
std::call_once(once, [backend]() {
const char* name = nullptr;
switch (backend) {
case SymmetricMemoryBackend::PyTorchNccl:
name = "NCCL";
break;
case SymmetricMemoryBackend::PyTorchNvshmem:
name = "NVSHMEM";
break;
case SymmetricMemoryBackend::PyTorchCuda:
name = "CUDA";
break;
default:
NVF_ERROR(false, "Unexpected PyTorch symmetric memory backend");
}
c10d::symmetric_memory::set_backend(name);
Communicator& comm = Communicator::getInstance();
NVF_CHECK(comm.is_available(), "Communicator not available for symmetric memory");
c10d::symmetric_memory::set_group_info(
kPyTorchSymmMemGroupName,
static_cast<int>(comm.deviceId()),
static_cast<int>(comm.size()),
comm.getStore());
});
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

std::call_once exception-safety leaves set_backend in a permanently broken state on retry

std::call_once resets its once_flag if the callable exits via an exception, allowing a subsequent call to retry. However, the callable here calls set_backend(name) before set_group_info(...). If set_backend succeeds but set_group_info subsequently throws (e.g., because the store is unavailable), once_flag is reset and the next allocate() call will attempt set_backend(name) a second time. PyTorch's symmetric memory layer is likely to throw on that second set_backend call (backend already configured), making it impossible to recover without restarting the process.

A straightforward mitigation is to separate the two calls into distinct phases or to wrap set_backend in its own protection:

// Separate once-flags for each idempotent step, or catch and suppress
// the "already set" error from set_backend on retry:
try {
  c10d::symmetric_memory::set_backend(name);
} catch (const std::exception& e) {
  // If the backend is already set to the correct name, treat as success.
  // Re-throw otherwise.
}
c10d::symmetric_memory::set_group_info(
    kPyTorchSymmMemGroupName, ...);

Alternatively, split the once_flag so set_backend has its own dedicated guard that truly runs at most once, while set_group_info can retry on failure.

Comment on lines +504 to +511
void* SymmetricTensor::multicastPtr() const {
#ifdef NVFUSER_DISTRIBUTED
if (py_symm_handle_) {
return py_symm_handle_->has_multicast_support()
? py_symm_handle_->get_multicast_ptr()
: nullptr;
}
#endif
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

multicastPtr() silently returns nullptr for PyTorch backend when multicast is not supported, which is inconsistent with the native path (which calls NVF_CHECK(is_multicast_setup_, "Multicast not setup")).

Any caller that does not check for nullptr before using the pointer will trigger a null pointer dereference / silent GPU fault rather than a clear diagnostic error.

Consider throwing or at least asserting instead of silently returning nullptr:

Suggested change
void* SymmetricTensor::multicastPtr() const {
#ifdef NVFUSER_DISTRIBUTED
if (py_symm_handle_) {
return py_symm_handle_->has_multicast_support()
? py_symm_handle_->get_multicast_ptr()
: nullptr;
}
#endif
void* SymmetricTensor::multicastPtr() const {
#ifdef NVFUSER_DISTRIBUTED
if (py_symm_handle_) {
NVF_CHECK(
py_symm_handle_->has_multicast_support(),
"Multicast not supported by the selected PyTorch symmetric memory backend.");
return py_symm_handle_->get_multicast_ptr();
}
#endif
NVF_CHECK(is_multicast_setup_, "Multicast not setup");
return mc_ptr_;
}

This brings the error contract in line with the native path, where multicastPtr() always either returns a valid pointer or throws.

Comment on lines +398 to +399
if (getSymmetricMemoryBackend() != SymmetricMemoryBackend::Native) {
ensurePyTorchSymmMemBackend(getSymmetricMemoryBackend());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getSymmetricMemoryBackend() is invoked twice in back-to-back lines, which redundantly re-parses the option string on each call. A single local variable should be used:

Suggested change
if (getSymmetricMemoryBackend() != SymmetricMemoryBackend::Native) {
ensurePyTorchSymmMemBackend(getSymmetricMemoryBackend());
SymmetricMemoryBackend backend = getSymmetricMemoryBackend();
if (backend != SymmetricMemoryBackend::Native) {
ensurePyTorchSymmMemBackend(backend);

Comment on lines +20 to +28
TEST_F(SymmetricTensorTest, GetSymmetricMemoryBackend_ReturnsValidBackend) {
SymmetricMemoryBackend backend = getSymmetricMemoryBackend();
EXPECT_TRUE(
backend == SymmetricMemoryBackend::Native ||
backend == SymmetricMemoryBackend::PyTorchNccl ||
backend == SymmetricMemoryBackend::PyTorchNvshmem ||
backend == SymmetricMemoryBackend::PyTorchCuda)
<< "getSymmetricMemoryBackend() returned an invalid backend value";
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GetSymmetricMemoryBackend_ReturnsValidBackend test is trivially true. Every branch of getSymmetricMemoryBackend() explicitly returns one of the four enum values listed in the EXPECT_TRUE condition, so there is no code path that could return a fifth value. This test can never fail and provides no meaningful coverage.

If the intent is to document the valid values, a static assertion in ipc_utils.cpp would be more appropriate. If the intent is to test that the env-var parsing correctly maps strings to enum values, the test should set up specific NVFUSER_ENABLE strings and assert the exact expected enum variant (e.g., set pytorch_nccl and assert PyTorchNccl).

Copy link
Collaborator

@samnordmann samnordmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! Some minor comments
Please add test, fix linter, and run the CI with !test command (comment directly on the PR)

- name: Run lintrunner

// Symmetric memory backend and option tests
// -----------------------------------------------------------------------------

TEST_F(SymmetricTensorTest, GetSymmetricMemoryBackend_ReturnsValidBackend) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not a useful test

}
}

// Same remote-access correctness as BasicAllocation but only runs when
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the only test but it is commented. Either remove it or un-comment it. An idea would be to reuse the pre-existing tests but to parametrize them with the new backends.

// - Native (default): Fuser's own CUDA VMM + IPC implementation; maintained.
// - PyTorch (Nccl, Nvshmem, Cuda): Use PyTorch's symmetric memory
// (torch.distributed._symmetric_memory) with the chosen transport backend.
// Select via NVFUSER_ENABLE=symmetric_memory_backend(pytorch_nccl|pytorch_nvshmem|pytorch_cuda).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the selection should also be about native and contain it as an option

// further fragment the memory. On the other hand, having our own implementation
// allows us to experiment more advanced features like contigous view creation.
// Backends (see SymmetricMemoryBackend in ipc_utils.h):
// - Native (default): Fuser's own CUDA VMM + IPC implementation; maintained.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// - Native (default): Fuser's own CUDA VMM + IPC implementation; maintained.
// - Native (default): Fuser's own CUDA VMM + IPC implementation.

Comment on lines +88 to +89
// When set, remote/multicast APIs delegate to PyTorch symmetric memory.
c10::intrusive_ptr<c10d::symmetric_memory::SymmetricMemory> py_symm_handle_;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

py_ prefix wrongly suggests python.
I am not sure to understand the comment

Suggested change
// When set, remote/multicast APIs delegate to PyTorch symmetric memory.
c10::intrusive_ptr<c10d::symmetric_memory::SymmetricMemory> py_symm_handle_;
c10::intrusive_ptr<c10d::symmetric_memory::SymmetricMemory> symm_handle_;

#ifdef NVFUSER_DISTRIBUTED
// PyTorch backend: perform rendezvous here (lazy, on first setupRemoteHandles).
if (getSymmetricMemoryBackend() != SymmetricMemoryBackend::Native) {
ensurePyTorchSymmMemBackend(getSymmetricMemoryBackend());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

has already been called in the constructor

Comment on lines +522 to +523
NVF_ERROR(
false,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
NVF_ERROR(
false,
NVF_THROW(

return store_.get();
}

#ifdef NVFUSER_DISTRIBUTED
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need guard here?


#ifdef NVFUSER_DISTRIBUTED
#include <torch/csrc/distributed/c10d/Backend.hpp>
#include <torch/csrc/distributed/c10d/Store.hpp>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not needed

Comment on lines +129 to +137
// Returns the store as an intrusive_ptr for use with PyTorch symmetric
// memory (c10d::symmetric_memory::set_group_info).
c10::intrusive_ptr<c10d::Store> getStore() const;

// Returns the world backend as an intrusive_ptr so it can be registered with
// c10d::register_process_group (e.g. for PyTorch symmetric memory NCCL
// rendezvous, which resolves the group by name).
c10::intrusive_ptr<c10d::Backend> getWorldBackendIntrusivePtr(
std::optional<CommunicatorBackend> backend = std::nullopt);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rather, change the signature of the existing getter method to return intrusive_ptr instead of raw pointer

Comment on lines +461 to +468
std::string Communicator::getSymmMemGroupKey(
std::optional<CommunicatorBackend> backend) {
std::vector<RankType> all_ranks(size_);
std::iota(all_ranks.begin(), all_ranks.end(), 0);
CommunicatorBackend b = backend.value_or(default_backend_);
(void)getBackendForTeam(all_ranks, b, "symm_mem_");
return getTeamKey(all_ranks, b);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getSymmMemGroupKey returns key without "symm_mem_" prefix — mismatch with registered process group

getBackendForTeam(all_ranks, b, "symm_mem_") registers the process group under the key "symm_mem_" + getTeamKey(all_ranks, b) (see the register_process_group call in that function). However, getSymmMemGroupKey then returns just getTeamKey(all_ranks, b) — without the "symm_mem_" prefix.

The returned key is subsequently used in ensurePyTorchSymmMemBackend as the group_name passed to both set_group_info and rendezvous. Newer NCCL builds resolve the process group by name at rendezvous time; they will look for a process group registered as "nccl0,1,..." but only "symm_mem_nccl0,1,..." exists, causing rendezvous to fail.

The current workaround that registers under "0" papers over this for older NCCL, but the mismatch will surface as soon as the TODO comment is resolved and older-NCCL special-casing is removed.

The return statement should return the full team_key including the prefix:

Suggested change
std::string Communicator::getSymmMemGroupKey(
std::optional<CommunicatorBackend> backend) {
std::vector<RankType> all_ranks(size_);
std::iota(all_ranks.begin(), all_ranks.end(), 0);
CommunicatorBackend b = backend.value_or(default_backend_);
(void)getBackendForTeam(all_ranks, b, "symm_mem_");
return getTeamKey(all_ranks, b);
}
std::string Communicator::getSymmMemGroupKey(
std::optional<CommunicatorBackend> backend) {
std::vector<RankType> all_ranks(size_);
std::iota(all_ranks.begin(), all_ranks.end(), 0);
CommunicatorBackend b = backend.value_or(default_backend_);
const std::string prefix = "symm_mem_";
(void)getBackendForTeam(all_ranks, b, prefix);
return prefix + getTeamKey(all_ranks, b);
}

Comment on lines +142 to +144
c10::intrusive_ptr<c10d::Store> getStore() const {
return c10::intrusive_ptr<c10d::Store>(store_);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getStore() uses non-idiomatic intrusive_ptr construction

c10::intrusive_ptr<c10d::Store>(store_) passes the raw TCPStore* obtained from store_ (via the implicit operator T* of intrusive_ptr) to a new intrusive_ptr<Store>. This calls the unsafe intrusive_ptr<T>(T*, bool) constructor that takes an already-retained raw pointer — but store_ is managed and this path risks a ref-count imbalance.

The idiomatic way is to let the intrusive_ptr copy-conversion handle it:

Suggested change
c10::intrusive_ptr<c10d::Store> getStore() const {
return c10::intrusive_ptr<c10d::Store>(store_);
}
c10::intrusive_ptr<c10d::Store> getStore() const {
return store_;
}

Comment on lines +405 to +418
if(is_multicast_setup_==false) {
SymmetricMemoryBackend backend = getSymmetricMemoryBackend();
if (backend != SymmetricMemoryBackend::Native) {
const std::string group_name = ensurePyTorchSymmMemBackend(backend);
torch_symm_handle_ = c10d::symmetric_memory::rendezvous(
local_tensor_, group_name);
are_remote_tensors_setup_ = true;
if (torch_symm_handle_->has_multicast_support()) {
is_multicast_setup_ = true;
mc_ptr_ = torch_symm_handle_->get_multicast_ptr();
}
return;
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 if(is_multicast_setup_==false) guard is dead code for PyTorch backend

is_multicast_setup_ is never set to true before setupRemoteHandles is called on the PyTorch path: setupMulticast returns unconditionally at line ~615 when torch_symm_handle_ is set, so is_multicast_setup_ remains false. The outer guard is therefore always true and provides no real protection.

The effect is that the rendezvous code is unreachable if any caller were to set is_multicast_setup_ = true first (e.g., through a future code path). The intent—"skip rendezvous if multicast is already fully set up"—is actually achieved by the are_remote_tensors_setup_ early-return at the top of the function, not by this inner guard.

Consider removing this redundant outer condition to make the control flow clearer:

#ifdef NVFUSER_DISTRIBUTED
  // PyTorch backend: perform rendezvous here (lazy, on first setupRemoteHandles).
  SymmetricMemoryBackend backend = getSymmetricMemoryBackend();
  if (backend != SymmetricMemoryBackend::Native) {
    const std::string group_name = ensurePyTorchSymmMemBackend(backend);
    torch_symm_handle_ = c10d::symmetric_memory::rendezvous(
        local_tensor_, group_name);
    are_remote_tensors_setup_ = true;
    if (torch_symm_handle_->has_multicast_support()) {
      is_multicast_setup_ = true;
      mc_ptr_ = torch_symm_handle_->get_multicast_ptr();
    }
    return;
  }
#endif

Comment on lines +537 to +541
NVF_THROW(
false,
"Contiguous view is not yet supported for PyTorch symmetric memory backend. "
"Use native backend for SymmetricContiguousView.");
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 NVF_THROW with false as first argument produces a garbled error message

NVF_THROW(...) is an unconditional throw whose variadic arguments are all concatenated into the error message via to_str(__VA_ARGS__). Passing false as the first argument does not act as a condition — it is serialised as part of the message (e.g. "0Contiguous view is not yet...") by to_str. This makes the resulting error message confusing and hard to read in diagnostics.

The same pattern is used again in getContiguousView (line 607–611).

Use NVF_THROW with only the message string, or use the established NVF_ERROR(false, "msg") pattern that is already used elsewhere in this file (e.g. line 74):

Suggested change
NVF_THROW(
false,
"Contiguous view is not yet supported for PyTorch symmetric memory backend. "
"Use native backend for SymmetricContiguousView.");
}
NVF_THROW(
"Contiguous view is not yet supported for PyTorch symmetric memory backend. "
"Use native backend for SymmetricContiguousView.");

Comment on lines +41 to +43
case SymmetricMemoryBackend::PyTorchCuda:
name = "CUDA";
break;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 set_backend is never called for the PyTorchCuda backend

For PyTorchNccl and PyTorchNvshmem, c10d::symmetric_memory::set_backend(name) is called inside the call_once lambda. For PyTorchCuda, name is assigned "CUDA" but set_backend is never invoked. If PyTorch's symmetric-memory layer requires an explicit set_backend call before allocating with a CUDA transport, every empty_strided_p2p call on the CUDA path will either use whatever backend was previously configured (potentially NCCL or NVSHMEM) or fail silently at rendezvous time.

If PyTorchCuda truly requires no set_backend call (e.g., because "CUDA" is the implicit default), please add a comment explaining this so future maintainers don't perceive it as an oversight. Otherwise, add the missing call:

case SymmetricMemoryBackend::PyTorchCuda:
  name = "CUDA";
  c10d::symmetric_memory::set_backend(name);
  break;

if (communicator_->size() == 1) {
GTEST_SKIP() << "Skipping test for single device";
}
std::cout << "Vishal chishta" << std::endl;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Debug print statement must be removed

std::cout << "Vishal chishta" << std::endl; is an accidental debug line that will pollute test output for all CI runs of SmallAllocation. This should be removed before merging.

Suggested change
std::cout << "Vishal chishta" << std::endl;

Comment on lines +53 to +56
if (backend != SymmetricMemoryBackend::Native) {
NVF_CHECK(
comm.isBackendAvailable(CommunicatorBackend::kNccl),
"NCCL backend is required for symmetric_memory_backend(nccl)");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 NCCL availability check incorrectly required for all PyTorch backends

isBackendAvailable(CommunicatorBackend::kNccl) is checked unconditionally for every non-Native backend — including PyTorchNvshmem and PyTorchCuda. If those backends don't actually require an NCCL process group (e.g., NVSHMEM uses its own transport), this check will spuriously reject them on systems where NCCL is unavailable.

Additionally, the error message hardcodes "(nccl)" even when the active backend is NVSHMEM or CUDA, which will confuse users:

"NCCL backend is required for symmetric_memory_backend(nccl)"
// fired even when NVFUSER_ENABLE=symmetric_memory_backend(pytorch_nvshmem)

Consider guarding the NCCL check only for PyTorchNccl, and adjusting the error message dynamically:

if (backend == SymmetricMemoryBackend::PyTorchNccl) {
  NVF_CHECK(
      comm.isBackendAvailable(CommunicatorBackend::kNccl),
      "NCCL backend is required for symmetric_memory_backend(pytorch_nccl)");
}

Comment on lines +29 to +47
static std::once_flag once;
std::call_once(once, [backend]() {
const char* name = nullptr;
switch (backend) {
case SymmetricMemoryBackend::PyTorchNccl:
name = "NCCL";
c10d::symmetric_memory::set_backend(name);
break;
case SymmetricMemoryBackend::PyTorchNvshmem:
name = "NVSHMEM";
c10d::symmetric_memory::set_backend(name);
break;
case SymmetricMemoryBackend::PyTorchCuda:
name = "CUDA";
break;
default:
NVF_ERROR(false, "Unexpected PyTorch symmetric memory backend");
}
});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Static once_flag binds to whichever backend is passed first — silently ignores later backends

once is a static std::once_flag, so set_backend(name) is called exactly once for the lifetime of the process. If the flag fires on the first call (e.g., PyTorchCuda), a later call with PyTorchNccl won't call set_backend("NCCL") at all — the wrong (or absent) backend will silently remain active.

In practice a single process shouldn't mix backends, but the current structure provides no error if it does. The typical guard is to also capture the name into a static and assert consistency on subsequent calls:

static std::string configured_name;
std::call_once(once, [backend, &configured_name]() {
  // ... set backend and populate configured_name
});
NVF_CHECK(
    configured_name == expected_name,
    "symmetric memory backend already configured as '", configured_name,
    "', cannot reconfigure to '", expected_name, "'");

Or, at minimum, document that mixing backends within a process is undefined behaviour.

Comment on lines +461 to +468
std::string Communicator::getSymmMemGroupKey(
std::optional<CommunicatorBackend> backend) {
std::vector<RankType> all_ranks(size_);
std::iota(all_ranks.begin(), all_ranks.end(), 0);
CommunicatorBackend b = backend.value_or(default_backend_);
(void)getBackendForTeam(all_ranks, b);
return getTeamKey(all_ranks, b);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 getSymmMemGroupKey body is incorrectly indented and missing trailing newline

The function body uses column-0 indentation, inconsistent with every other member function in this file. All statements should be indented at the standard 2-space level. Additionally, the file is missing a trailing newline (shown by \ No newline at end of file in the diff).

Suggested change
std::string Communicator::getSymmMemGroupKey(
std::optional<CommunicatorBackend> backend) {
std::vector<RankType> all_ranks(size_);
std::iota(all_ranks.begin(), all_ranks.end(), 0);
CommunicatorBackend b = backend.value_or(default_backend_);
(void)getBackendForTeam(all_ranks, b);
return getTeamKey(all_ranks, b);
}
std::string Communicator::getSymmMemGroupKey(
std::optional<CommunicatorBackend> backend) {
std::vector<RankType> all_ranks(size_);
std::iota(all_ranks.begin(), all_ranks.end(), 0);
CommunicatorBackend b = backend.value_or(default_backend_);
(void)getBackendForTeam(all_ranks, b);
return getTeamKey(all_ranks, b);
}

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants