-
Notifications
You must be signed in to change notification settings - Fork 253
Description
Is this a duplicate?
- I confirmed there appear to be no duplicate issues for this bug and that I agree to the Code of Conduct
Type of Bug
Runtime Error
Component
cuda.core
Describe the bug
When DeviceMemoryResource is created without options (e.g., DeviceMemoryResource(dev)), it wraps the default device memory pool. This is a non-owned pool that is shared across all such references.
Currently, cuda.core initializes the internal _peer_accessible_by tracking variable to () (empty tuple), assuming no peer access. However, the actual driver-side peer access state of the default pool may differ if:
- Other code has modified peer access on the shared default pool
- Previous operations in the same process modified peer access and didn't clean up
- The Python tests use the shared pool and leave it in a modified state
This causes a mismatch between cuda.core's tracked state and the actual driver state, leading to:
- Incorrect
peer_accessible_byproperty values - Unexpected behavior when setting peer access (no-op if we think we're already in the target state)
- Test failures that depend on a clean initial peer access state
How to Reproduce
from cuda.core import Device, DeviceMemoryResource
dev = Device(0)
# Create a DMR with the default pool and modify peer access
dmr1 = DeviceMemoryResource(dev)
dmr1.peer_accessible_by = (1,) # Enable peer access for device 1
# Create another DMR with the same default pool
dmr2 = DeviceMemoryResource(dev)
# dmr2._peer_accessible_by is (), but actual driver state has peer access for device 1
print(dmr2.peer_accessible_by) # Returns () -- WRONG! Should reflect actual stateExpected behavior
When wrapping a non-owned pool (the default device memory pool), DeviceMemoryResource should lazily query the CUDA driver to determine the actual peer access state.
This could be done using cuMemPoolGetAccess to query each peer device's access permissions on pool initialization.
Workaround
Use owned pools by specifying options:
from cuda.core import Device, DeviceMemoryResource, DeviceMemoryResourceOptions
dmr = DeviceMemoryResource(dev, DeviceMemoryResourceOptions())This creates an owned pool with a known clean initial state.
Operating System
N/A (affects all platforms)
nvidia-smi output
N/A