Add TMA TensorMapDescriptor support by rparolin · Pull Request #1687 · NVIDIA/cuda-python

rparolin · 2026-02-24T23:22:18Z

Summary

Add TensorMapDescriptor Cython class wrapping the CUDA driver's CUtensorMap for Hopper+ TMA (Tensor Memory Accelerator) bulk data movement
Support tiled and im2col descriptor creation via from_tiled() and from_im2col() class methods, with automatic dtype inference, stride computation, and validation
Integrate TensorMapDescriptor as a first-class kernel argument in _kernel_arg_handler.pyx
Add comprehensive tests (test_tensor_map.py) and an example (tma_tensor_map.py)

Closes #199
Closes #200

copy-pr-bot · 2026-02-24T23:22:22Z

Auto-sync is disabled for ready for review pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

Copilot

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 4 comments.

Comments suppressed due to low confidence (2)

cuda_core/pixi.toml:67

Removing the cu12 environment from this subproject can break the repository’s top-level pixi run -e cu12 test workflow, which runs pixi run --manifest-path cuda_core test under the propagated PIXI_ENVIRONMENT_NAME=cu12. If cu12 testing is still expected at the workspace level, consider keeping a solvable cu12 environment here (e.g., using conda-forge cuda-bindings/cuda-version constraints instead of the path dependency) or updating the workspace test tasks to avoid selecting a missing environment.


# NOTE: cu12 environment is intentionally omitted because the path dependency
# to ../cuda_bindings (v13.1) makes it unsolvable locally. For cu12 testing,
# use conda-forge packages or CI workflows.
[environments]
default = { features = [
    "cu13",
    "test",
    "cython-tests",
], solve-group = "default" }
cu13 = { features = ["cu13", "test", "cython-tests"], solve-group = "default" }

cuda_core/cuda/core/_tensor_map.pyx:461

c_pixel_box_lower / c_pixel_box_upper are declared as fixed-size int[3] but only the first n_spatial entries are written. If the driver implementation reads all 3 entries (the API supports up to 3 spatial dims), the remaining uninitialized values can make encoding nondeterministic. Initialize the full arrays (e.g., set all 3 to 0 first) before filling the active elements.

        cdef uint64_t[5] c_global_dim
        cdef uint64_t[4] c_global_strides
        cdef uint32_t[5] c_element_strides
        cdef int[3] c_pixel_box_lower  # max 3 spatial dims (rank 5 - 2)
        cdef int[3] c_pixel_box_upper
        cdef int i_c

        for i_c in range(rank):
            c_global_dim[i_c] = <uint64_t>shape[rank - 1 - i_c]
            c_element_strides[i_c] = <uint32_t>element_strides[rank - 1 - i_c]

        for i_c in range(rank - 1):
            c_global_strides[i_c] = <uint64_t>byte_strides[rank - 2 - i_c]

        # Reverse spatial dimensions for lower/upper corners
        for i_c in range(n_spatial):
            c_pixel_box_lower[i_c] = <int>pixel_box_lower_corner[n_spatial - 1 - i_c]
            c_pixel_box_upper[i_c] = <int>pixel_box_upper_corner[n_spatial - 1 - i_c]

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

rparolin · 2026-02-25T22:14:52Z

/ok to test

github-actions · 2026-02-25T22:26:45Z

Doc Preview CI
🚀 View preview at https://nvidia.github.io/cuda-python/pr-preview/pr-1687/
https://nvidia.github.io/cuda-python/pr-preview/pr-1687/cuda-core/
https://nvidia.github.io/cuda-python/pr-preview/pr-1687/cuda-bindings/
https://nvidia.github.io/cuda-python/pr-preview/pr-1687/cuda-pathfinder/
Preview will be ready when the GitHub Pages deployment is complete.

rparolin · 2026-02-25T23:08:13Z

/ok to test

rparolin · 2026-02-26T00:00:16Z

/ok to test

leofang

There is a coordinated effort between C++ and Python: #199 (comment). Can we please look into reusing the C++ implementation (mainly because @fbusato is a TMA expert) and avoid re-implementing it if possible?

fbusato · 2026-02-27T17:03:45Z

Fighting with poor documentation and bugs don't make me an expert :).
Anyway, we provide two main functionalities in this direction:

make_tma_descriptor() DLPack to cuTensorMap.
mdspan to DLPack and DLPack to mdspan.

The implementation of make_tma_descriptor is here https://github.com/NVIDIA/cccl/blob/main/libcudacxx/include/cuda/__tma/make_tma_descriptor.h. Please let me know if there are functionalities that need to be isolated for reuse.

Copilot

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 1 comment.

Comments suppressed due to low confidence (3)

cuda_core/tests/test_tensor_map.py:102

This test passes a raw Buffer from dev.allocate() with data_type=FLOAT32. Buffer exports via DLPack as an int8 tensor with shape=(n_bytes,), so the TMA encoder will treat shape[0] as a float32 element count unless the implementation compensates for this. That can create a descriptor that covers 4× more memory than the allocation and hide potential out-of-bounds issues. Prefer wrapping the buffer in _DeviceArray(buf, (1024,), dtype=np.float32) (or StridedMemoryView.from_buffer with the intended shape/dtype) so the descriptor is built from element-count dimensions matching the data type.

        buf = dev.allocate(1024 * 4)  # 1024 float32 elements
        desc = TensorMapDescriptor.from_tiled(
            buf,
            box_dim=(64,),
            data_type=TensorMapDataType.FLOAT32,
        )

cuda_core/tests/test_tensor_map.py:277

Same issue as test_from_tiled_1d: building a descriptor from a raw Buffer with data_type=FLOAT32 relies on the implementation translating the buffer's byte-length into a float32 element count. To avoid encoding a descriptor with incorrect global_dim, wrap buf1/buf2 in _DeviceArray(..., dtype=np.float32) (or a StridedMemoryView with the intended dtype/shape) before calling from_tiled() / replace_address().

    def test_replace_address(self, dev, skip_if_no_tma):
        buf1 = dev.allocate(1024 * 4)
        desc = TensorMapDescriptor.from_tiled(
            buf1,
            box_dim=(64,),
            data_type=TensorMapDataType.FLOAT32,
        )

cuda_core/cuda/core/_kernel_arg_handler.pyx:305

Support for passing TensorMapDescriptor as a kernel argument is added here, but there’s no test exercising the full path (ParamHolder → cuLaunchKernel) with a real TensorMapDescriptor argument. Given cuda_core/tests/test_launcher.py already validates scalar/buffer argument handling, consider adding a small integration test that launches a kernel taking a CUtensorMap by value and verifies it can be consumed (or at least that the kernel receives the expected 128-byte payload). This will protect against ABI/size/alignment regressions in the argument marshalling logic.

            elif arg_type is tensor_map_descriptor_type:
                prepare_tensor_map_arg(self.data, self.data_addresses, <TensorMapDescriptor>arg, i)
                continue
            elif arg_type is bool:
                prepare_arg[cpp_bool](self.data, self.data_addresses, arg, i)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…time Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Remove unused _alloc_device_tensor helper from tests - Add test for rank > 5 (6D tensor) to verify upper bound validation - Add NULL check for PyMem_Malloc in prepare_tensor_map_arg Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Move the replace_address() demonstration into its own self-contained example (tma_replace_address.py) so each file covers a single concept. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Reject CUDA device-local tensors from a different GPU while still allowing CUDA host and managed memory. Add regression tests for descriptor creation, replace_address, and the shared validation helper.

rparolin · 2026-03-11T16:26:53Z

/ok to test

cpcloud · 2026-03-12T18:50:19Z

@leofang Please review. This is now blocked on your review.

cpcloud · 2026-03-14T13:22:28Z

Latest update on this branch:

Pushed a final follow-up commit (9478eca7e8) to finish the remaining TensorMap review items.
Added TensorMapDescriptorOptions for the tiled path and wired StridedMemoryView.as_tensor_map(..., options=...) to accept either the dataclass or a mapping.
Kept dtype-like inputs (numpy / ml_dtypes) on the public API while using raw driver enum values internally, so the hot path no longer needs to carry Python IntEnum objects around.
Added cuda_core/cuda/core/_memoryview.pxd and typed _view_ref as StridedMemoryView, which lets the descriptor retain DLPack-backed views with explicit Cython typing.
Folded the last one-off validated-view helper back into replace_address() so the TensorMap code path is a bit smaller/clearer.
All previously open review threads on the PR are now resolved.

Verification on the current branch:

pixi run -e cu13 pytest tests/test_tensor_map.py -q -> 8 passed, 40 skipped
pixi run -e cu13 pytest tests -q -> 1783 passed, 188 skipped, 2 xfailed, 22 warnings

The warnings in the broader run were the existing pytest.mark.flaky registration warnings from the memory IPC tests.

Co-authored-by: Leo Fang <leo80042@gmail.com>

…uidance. Keep the public TMA entry point on StridedMemoryView and remove avoidable launch/build overhead so the reviewed API stays smaller without regressing local CUDA builds. Made-with: Cursor

Keep the example surface smaller and closer to CUDA C++ by showing barrier/TMA helpers and replace_address() in one place instead of duplicating raw PTX snippets. Made-with: Cursor

Use the toolkit include and optional cccl include roots when compiling the wrapper-based example so NVRTC can resolve cuda/barrier outside the test harness. Made-with: Cursor

Centralize the tiled descriptor arguments in an options object, keep dtype-like inputs on the public path while using raw driver values internally, and declare StridedMemoryView in a pxd so retained views stay typed without extra helper indirection. Made-with: Cursor

Remove the stale Cython-only `_require_view_device` definition left behind while porting the TensorMap fixes onto the PR head branch so the extension builds against the newer managed-memory-aware helper. Made-with: Cursor

Add the missing SPDX header on the new `_memoryview.pxd` file and keep the test module formatted the way `ruff format` expects so pre-commit.ci can clear on the live PR branch. Made-with: Cursor

Replace the last stale `TensorMapDescriptor.from_tiled()` call sites with `StridedMemoryView.as_tensor_map()` so the multi-device CI coverage exercises the constructor path that actually exists on this branch. Made-with: Cursor

cpcloud · 2026-03-14T15:12:58Z

@leofang Addressed all your comments. I think this is good to go.

Removed preview folders for the following PRs: - PR #1687 - PR #1777 - PR #1780

rparolin requested review from Copilot and leofang February 24, 2026 23:23

Copilot started reviewing on behalf of rparolin February 24, 2026 23:23 View session

This comment was marked as resolved.

Sign in to view

rparolin self-assigned this Feb 25, 2026

rparolin added this to the cuda.core v0.7.0 milestone Feb 25, 2026

rparolin requested a review from Copilot February 25, 2026 01:21

Copilot started reviewing on behalf of rparolin February 25, 2026 01:21 View session

Copilot AI reviewed Feb 25, 2026

View reviewed changes

Comment thread cuda_core/cuda/core/_tensor_map.pyx Outdated

Comment thread cuda_core/cuda/core/_tensor_map.pyx Outdated

Comment thread cuda_core/cuda/core/_tensor_map.pyx Outdated

Comment thread cuda_core/cuda/core/_tensor_map.pyx Outdated

rparolin assigned cpcloud Feb 26, 2026

leofang requested changes Feb 27, 2026

View reviewed changes

rparolin added the cuda.core Everything related to the cuda.core module label Mar 2, 2026

rparolin requested a review from Copilot March 3, 2026 23:56

Copilot started reviewing on behalf of rparolin March 3, 2026 23:56 View session

Copilot AI reviewed Mar 4, 2026

View reviewed changes

Comment thread cuda_core/cuda/core/_tensor_map.pyx

rparolin and others added 9 commits March 6, 2026 21:26

initial commit

e3e1899

tma wide

77a3c8e

clean up

19c4a0f

Add comments to prepare_tensor_map_arg explaining allocation and life…

35a04b9

…time Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Address Copilot review feedback

bb19e4f

- Remove unused _alloc_device_tensor helper from tests - Add test for rank > 5 (6D tensor) to verify upper bound validation - Add NULL check for PyMem_Malloc in prepare_tensor_map_arg Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Split TMA example into two focused files

23a8900

Move the replace_address() demonstration into its own self-contained example (tma_replace_address.py) so each file covers a single concept. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

pre-commit

0a1b720

adding stride meta data to gpu allocated memory

44fbdcf

im2col fixes

bdf39a2

cpcloud requested a review from leofang March 7, 2026 16:54

Handle TensorMap device validation by DLPack type

eef1c7a

Reject CUDA device-local tensors from a different GPU while still allowing CUDA host and managed memory. Add regression tests for descriptor creation, replace_address, and the shared validation helper.

rparolin requested a review from cpcloud March 11, 2026 16:29

cpcloud approved these changes Mar 12, 2026

View reviewed changes

rparolin enabled auto-merge (squash) March 12, 2026 22:25

leofang disabled auto-merge March 13, 2026 20:22

leofang requested changes Mar 13, 2026

View reviewed changes

rparolin and others added 12 commits March 14, 2026 10:28

Merge branch 'main' into rparolin/tma_feature

99ff204

formatting change

d6c311a

Update cuda_core/cuda/core/_cpp/tensor_map_cccl.h

9673bcf

Co-authored-by: Leo Fang <leo80042@gmail.com>

Update cuda_core/examples/tma_replace_address.py

ae86192

Co-authored-by: Leo Fang <leo80042@gmail.com>

Update cuda_core/cuda/core/__init__.py

232b621

Co-authored-by: Leo Fang <leo80042@gmail.com>

Align TensorMap creation and launch behavior with the latest review g…

358d975

…uidance. Keep the public TMA entry point on StridedMemoryView and remove avoidable launch/build overhead so the reviewed API stays smaller without regressing local CUDA builds. Made-with: Cursor

Consolidate the TMA examples around the libcudacxx wrappers.

e67e9d3

Keep the example surface smaller and closer to CUDA C++ by showing barrier/TMA helpers and replace_address() in one place instead of duplicating raw PTX snippets. Made-with: Cursor

Teach the TMA example where to find libcudacxx headers.

9ff8d0f

Use the toolkit include and optional cccl include roots when compiling the wrapper-based example so NVRTC can resolve cuda/barrier outside the test harness. Made-with: Cursor

Keep the rebased TensorMap validation helper consistent.

ad1c800

Remove the stale Cython-only `_require_view_device` definition left behind while porting the TensorMap fixes onto the PR head branch so the extension builds against the newer managed-memory-aware helper. Made-with: Cursor

Apply the pre-commit fixes for the rebased TensorMap branch.

a1203ac

Add the missing SPDX header on the new `_memoryview.pxd` file and keep the test module formatted the way `ruff format` expects so pre-commit.ci can clear on the live PR branch. Made-with: Cursor

Keep the TensorMap multi-GPU tests on the view-based API.

3c9e32d

Replace the last stale `TensorMapDescriptor.from_tiled()` call sites with `StridedMemoryView.as_tensor_map()` so the multi-device CI coverage exercises the constructor path that actually exists on this branch. Made-with: Cursor

cpcloud force-pushed the rparolin/tma_feature branch from 062434a to 3c9e32d Compare March 14, 2026 14:29

cpcloud requested a review from leofang March 16, 2026 19:15

rparolin enabled auto-merge (squash) March 17, 2026 17:53

rparolin disabled auto-merge March 17, 2026 17:54

rparolin merged commit 10fcae6 into NVIDIA:main Mar 17, 2026
86 checks passed

github-actions Bot pushed a commit that referenced this pull request Mar 18, 2026

Clean up PR preview folders for 3 closed/merged PRs

1325591

Removed preview folders for the following PRs: - PR #1687 - PR #1777 - PR #1780

Andy-Jost mentioned this pull request Mar 23, 2026

Add dlpack as a pixi host dependency for cuda_core #1809

Merged

Conversation

rparolin commented Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Uh oh!

copy-pr-bot Bot commented Feb 24, 2026

Uh oh!

This comment was marked as resolved.

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rparolin commented Feb 25, 2026

Uh oh!

github-actions Bot commented Feb 25, 2026

Preview will be ready when the GitHub Pages deployment is complete.

Uh oh!

rparolin commented Feb 25, 2026

Uh oh!

rparolin commented Feb 26, 2026

Uh oh!

leofang left a comment

Choose a reason for hiding this comment

Uh oh!

fbusato commented Feb 27, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

rparolin commented Mar 11, 2026

Uh oh!

cpcloud commented Mar 12, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cpcloud commented Mar 14, 2026

Uh oh!

cpcloud commented Mar 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

rparolin commented Feb 24, 2026 •

edited

Loading