Skip to content

[BUG]: DeviceMemoryResource should query driver for peer access state on non-owned pools #1720

@Andy-Jost

Description

@Andy-Jost

Is this a duplicate?

Type of Bug

Runtime Error

Component

cuda.core

Describe the bug

When DeviceMemoryResource is created without options (e.g., DeviceMemoryResource(dev)), it wraps the default device memory pool. This is a non-owned pool that is shared across all such references.

Currently, cuda.core initializes the internal _peer_accessible_by tracking variable to () (empty tuple), assuming no peer access. However, the actual driver-side peer access state of the default pool may differ if:

  1. Other code has modified peer access on the shared default pool
  2. Previous operations in the same process modified peer access and didn't clean up
  3. The Python tests use the shared pool and leave it in a modified state

This causes a mismatch between cuda.core's tracked state and the actual driver state, leading to:

  • Incorrect peer_accessible_by property values
  • Unexpected behavior when setting peer access (no-op if we think we're already in the target state)
  • Test failures that depend on a clean initial peer access state

How to Reproduce

from cuda.core import Device, DeviceMemoryResource

dev = Device(0)

# Create a DMR with the default pool and modify peer access
dmr1 = DeviceMemoryResource(dev)
dmr1.peer_accessible_by = (1,)  # Enable peer access for device 1

# Create another DMR with the same default pool
dmr2 = DeviceMemoryResource(dev)

# dmr2._peer_accessible_by is (), but actual driver state has peer access for device 1
print(dmr2.peer_accessible_by)  # Returns () -- WRONG! Should reflect actual state

Expected behavior

When wrapping a non-owned pool (the default device memory pool), DeviceMemoryResource should lazily query the CUDA driver to determine the actual peer access state.

This could be done using cuMemPoolGetAccess to query each peer device's access permissions on pool initialization.

Workaround

Use owned pools by specifying options:

from cuda.core import Device, DeviceMemoryResource, DeviceMemoryResourceOptions

dmr = DeviceMemoryResource(dev, DeviceMemoryResourceOptions())

This creates an owned pool with a known clean initial state.

Operating System

N/A (affects all platforms)

nvidia-smi output

N/A

Metadata

Metadata

Assignees

Labels

P1Medium priority - Should dobugSomething isn't workingcuda.coreEverything related to the cuda.core module

Type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions