Fix PinnedMemoryResource IPC NUMA ID derivation by Andy-Jost · Pull Request #1699 · NVIDIA/cuda-python

Andy-Jost · 2026-02-27T01:11:13Z

Summary

Fixes [BUG]: IPC-enabled pinned pool uses a fixed host NUMA node #1603: PinnedMemoryResource(ipc_enabled=True) hardcoded host NUMA ID 0, causing failures on multi-NUMA systems when the active device is on a different NUMA node
Closes Add preferred_location_type option and query property to ManagedMemoryResource #1710: Adds preferred_location_type option to ManagedMemoryResourceOptions and preferred_location query property to ManagedMemoryResource
Adds numa_id option to PinnedMemoryResourceOptions for explicit NUMA node selection
Refactors _MemPool hierarchy to separate shared pool machinery from device-specific concerns

Changes

Bugfix (_pinned_memory_resource.pyx):

When ipc_enabled=True and numa_id is not specified, derives the NUMA node from the current device's host_numa_id attribute (requires active CUDA context)
Adds numa_id: int | None = None to PinnedMemoryResourceOptions
Adds numa_id property to PinnedMemoryResource
Removes _check_numa_nodes warning machinery in favor of proper NUMA node selection

ManagedMemoryResource preferred location (_managed_memory_resource.pyx):

Adds preferred_location_type: str | None field to ManagedMemoryResourceOptions with values "device", "host", "host_numa", or None (legacy behavior)
Adds preferred_location query property to ManagedMemoryResource
Adds _resolve_preferred_location cdef function with CUDA 13 compile-time guard
device_id property restored to backwards-compatible behavior (returns preferred device ordinal or -1)

Refactor (_MemPool hierarchy):

Moves _dev_id, device_id, and peer_accessible_by from _MemPool into DeviceMemoryResource
Eliminates _MemPoolOptions; pool initialization refactored into freestanding cdef functions (MP_init_create_pool, MP_init_current_pool, MP_raise_release_threshold)
Extracts __init__ bodies into inline cdef helpers (_DMR_init, _PMR_init, _MMR_init)
ManagedMemoryResource.device_id returns preferred device ordinal when location is a device, otherwise -1
PinnedMemoryResource.device_id always returns -1

Test Coverage

4 new tests covering the numa_id behavior matrix: default without IPC, default with IPC, explicit NUMA ID, negative NUMA ID error
6 new tests for ManagedMemoryResource preferred location: default, device, host, host_numa, validation errors, NUMA auto-resolve failure
skip_if_managed_memory_unsupported updated to handle CUDA 12 builds
Updated existing IPC tests to assert numa_id values
All existing memory and IPC tests pass (pinned, device, managed, peer access, IPC)

…ce-specific concerns Move _dev_id, device_id, and peer_accessible_by from _MemPool into DeviceMemoryResource. Eliminate _MemPoolOptions and refactor pool initialization into freestanding cdef functions (MP_init_create_pool, MP_init_current_pool, MP_raise_release_threshold) for cross-module visibility. Extract __init__ bodies into inline cdef helpers (_DMR_init, _PMR_init, _MMR_init) for consistency and shorter class definitions. Implements device_id as -1 for PinnedMemoryResource and ManagedMemoryResource since they are not device-bound. Made-with: Cursor

…archy

…IDIA#1603) PinnedMemoryResource(ipc_enabled=True) hardcoded host NUMA ID 0, causing failures on multi-NUMA systems where the active device is attached to a different NUMA node. Now derives the NUMA ID from the current device's host_numa_id attribute, and adds an explicit numa_id option for manual override. Removes the _check_numa_nodes warning machinery in favor of proper NUMA node selection. Made-with: Cursor

copy-pr-bot · 2026-02-27T01:11:17Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Andy-Jost · 2026-02-27T01:11:23Z

/ok to test e55a26b

github-actions · 2026-02-27T01:22:22Z

Doc Preview CI
🚀 View preview at https://nvidia.github.io/cuda-python/pr-preview/pr-1699/
https://nvidia.github.io/cuda-python/pr-preview/pr-1699/cuda-core/
https://nvidia.github.io/cuda-python/pr-preview/pr-1699/cuda-bindings/
https://nvidia.github.io/cuda-python/pr-preview/pr-1699/cuda-pathfinder/
Preview will be ready when the GitHub Pages deployment is complete.

leofang · 2026-02-27T07:18:36Z

cuda_core/cuda/core/_memory/_managed_memory_resource.pyx

+    @property
+    def device_id(self) -> int:
+        """Return -1. Managed memory migrates automatically and is not tied to a specific device."""
+        return -1


Q: Wouldn't this be a breaking change? In the old implementation, device_id was used to initialize _MemPool. _dev_id, which is used to back .device_id, but the new implementation returns .device_id = -1 unconditionally. I understand we meant to say the pages are migratable (not pinned), but maybe there is a better way to restore the capability of querying the preferred location?

I didn't mean for this to be reviewed yet. Now marked as Draft.

In the old API, device_id has an overloaded meaning. For DeviceMemoryResource, it refers to the device owning memory. In ManagedMemoryResource, it indicates the preferred location if that location is a device. That is not expressive enough because there are four possible types of preferred location:

A device

The host

A host NUMA node

None

I'm still considering options, but perhaps we should add a new property, preferred_location, for managed pools. There is, unfortunately, no uniform way to refer to all preferred locations, so we might need to resort to something like the following:

ManagedMemoryResource.preferred_location -> ('device_id', 0) # device 0 ('host', None) # host preferred ('numa_id', 0) # NUMA node 0 None # no preferred location

I'm not in a position to say whether we need to retain backwards compatibility for this attribute. Do you think so? If yes, then I can retain the original meaning, possibly with a deprecation warning.

I created #1710 to track this as a separate issue, but I'm including it in this PR.

I restored the old behavior where MMR.device_id returns the preferred location if it is a device and -1 otherwise. I'd prefer to unconditionally return -1 if the breaking change is tolerable. A middle ground is to issue a deprecation warning. What do you think @leofang?

Andy-Jost · 2026-03-02T21:42:07Z

/ok to test 11d22a2

Andy-Jost · 2026-03-02T21:47:51Z

/ok to test 96b663a

Andy-Jost · 2026-03-02T22:02:13Z

/ok to test ff4ee01

Andy-Jost · 2026-03-02T23:10:06Z

/ok to test 363ddb7

…yResource Extends ManagedMemoryResourceOptions with a preferred_location_type field ("device", "host", "host_numa", or None) enabling NUMA-aware managed memory pool placement. Adds ManagedMemoryResource.preferred_location property to query the resolved setting. Fully backwards-compatible: existing code using preferred_location alone continues to work unchanged. Made-with: Cursor

…archy

Andy-Jost · 2026-03-02T23:46:18Z

/ok to test 2dd3d1a

cpcloud

The __dealloc__ violates explicit Cython documentation.

Not a blocker, but can this be split up? It seems like it's coupling two changesets that don't need to be in the same PR.

cpcloud · 2026-03-03T22:01:46Z

cuda_core/cuda/core/_memory/_device_memory_resource.pyx

+
+    def __dealloc__(self):
+        try:
+            self.close()


You're not supposed to call methods in __dealloc__: https://cython.readthedocs.io/en/stable/src/userguide/special_methods.html#finalization-methods-dealloc-and-del

In particular, don’t call any other methods of the object or do anything which might cause the object to be resurrected. It’s best if you stick to just deallocating C data.

_peer_accessible_by is a Python object

_DMR_set_peer_accessible_by looks like it's modifying the object based on the way it's being invoked.

Sounds like a job for __del__ and also adding a context manager.

This was to work around a driver bug (5698116). When a mempool is destroyed and a new one allocated, if the new pool happens to reuse the previous memory address, it incorrectly inherits the peer access attributes of the destroyed pool. This code is all about clearing the peer access attributes on destruction to avoid that.

The real fix is in C++ under resource_handles.* and should only be applied to owned pools created by cuda.core. The changes here, in Python, were a misadventure. This was resetting peer access for the shared default pool, which is not being destroyed but just released.

cpcloud · 2026-03-03T22:04:11Z

cuda_core/cuda/core/_memory/_device_memory_resource.pyx

        return False


+cdef inline _DMR_set_peer_accessible_by(DeviceMemoryResource self, devices):


This seems like something that should almost certainly never be invoked in __dealloc__. It's reading from and writing to the Python object whose __dealloc__ method is being called.

Thanks, yeah this was a mess. I removed all peer-access-clearing logic from Python. The real problem was that certain tests were using the default memory pool, which is persistent and should keep its peer access settings. Those tests incorrectly assumed the beginning state should be "no peer access." I updated those to simply use fresh pools.

- Remove __dealloc__ and close() override from DeviceMemoryResource that cleared peer access before destruction. The C++ RAII deleter already handles this for owned pools (nvbug 5698116 workaround). For non-owned pools (default device pool), clearing peer access on handle disposal was incorrect behavior. - Update peer access tests to use owned pools (DeviceMemoryResourceOptions()) instead of default pools. Default pools are shared and may have stale peer access state from prior tests, causing test failures. Made-with: Cursor

Andy-Jost · 2026-03-04T17:55:56Z

/ok to test c7f85ff

Made-with: Cursor

Andy-Jost · 2026-03-04T18:42:51Z

/ok to test 4c36462

Andy-Jost · 2026-03-05T17:10:48Z

@cpcloud Please take another look

Andy-Jost added 3 commits February 26, 2026 16:47

Merge remote-tracking branch 'origin/main' into refactor-mempool-hier…

61810bd

…archy

Andy-Jost added this to the cuda.core v0.7.0 milestone Feb 27, 2026

Andy-Jost added bug Something isn't working cuda.core Everything related to the cuda.core module labels Feb 27, 2026

Andy-Jost self-assigned this Feb 27, 2026

leofang reviewed Feb 27, 2026

View reviewed changes

Andy-Jost marked this pull request as draft March 2, 2026 14:32

Andy-Jost force-pushed the refactor-mempool-hierarchy branch from 11d22a2 to 96b663a Compare March 2, 2026 21:47

Andy-Jost force-pushed the refactor-mempool-hierarchy branch from 96b663a to ff4ee01 Compare March 2, 2026 22:02

Andy-Jost marked this pull request as ready for review March 2, 2026 22:03

Andy-Jost force-pushed the refactor-mempool-hierarchy branch from ff4ee01 to 363ddb7 Compare March 2, 2026 23:10

Andy-Jost mentioned this pull request Mar 2, 2026

Add preferred_location_type option and query property to ManagedMemoryResource #1710

Open

Andy-Jost added 2 commits March 2, 2026 15:45

Merge remote-tracking branch 'origin/main' into refactor-mempool-hier…

2dd3d1a

…archy

Andy-Jost force-pushed the refactor-mempool-hierarchy branch from 363ddb7 to 2dd3d1a Compare March 2, 2026 23:46

cpcloud requested changes Mar 3, 2026

View reviewed changes

Merge origin/main into refactor-mempool-hierarchy

4c36462

Made-with: Cursor

Merge branch 'main' into refactor-mempool-hierarchy

629e142

Andy-Jost requested a review from cpcloud March 5, 2026 17:10

		return False


		cdef inline _DMR_set_peer_accessible_by(DeviceMemoryResource self, devices):

Conversation

Andy-Jost commented Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Test Coverage

Uh oh!

copy-pr-bot bot commented Feb 27, 2026

Uh oh!

Andy-Jost commented Feb 27, 2026

Uh oh!

github-actions bot commented Feb 27, 2026

Preview will be ready when the GitHub Pages deployment is complete.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Andy-Jost commented Mar 2, 2026

Uh oh!

Andy-Jost commented Mar 2, 2026

Uh oh!

Andy-Jost commented Mar 2, 2026

Uh oh!

Andy-Jost commented Mar 2, 2026

Uh oh!

Andy-Jost commented Mar 2, 2026

Uh oh!

cpcloud left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Andy-Jost commented Mar 4, 2026

Uh oh!

Andy-Jost commented Mar 4, 2026

Uh oh!

Andy-Jost commented Mar 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Andy-Jost commented Feb 27, 2026 •

edited

Loading

cpcloud left a comment •

edited

Loading