Skip to content

Add TMA TensorMapDescriptor support#1687

Open
rparolin wants to merge 10 commits intoNVIDIA:mainfrom
rparolin:rparolin/tma_feature
Open

Add TMA TensorMapDescriptor support#1687
rparolin wants to merge 10 commits intoNVIDIA:mainfrom
rparolin:rparolin/tma_feature

Conversation

@rparolin
Copy link
Collaborator

@rparolin rparolin commented Feb 24, 2026

Summary

  • Add TensorMapDescriptor Cython class wrapping the CUDA driver's CUtensorMap for Hopper+ TMA (Tensor Memory Accelerator) bulk data movement
  • Support tiled and im2col descriptor creation via from_tiled() and from_im2col() class methods, with automatic dtype inference, stride computation, and validation
  • Integrate TensorMapDescriptor as a first-class kernel argument in _kernel_arg_handler.pyx
  • Add comprehensive tests (test_tensor_map.py) and an example (tma_tensor_map.py)

Closes #199
Closes #200

@copy-pr-bot
Copy link
Contributor

copy-pr-bot bot commented Feb 24, 2026

Auto-sync is disabled for ready for review pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

This comment was marked as resolved.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 4 comments.

Comments suppressed due to low confidence (2)

cuda_core/pixi.toml:67

  • Removing the cu12 environment from this subproject can break the repository’s top-level pixi run -e cu12 test workflow, which runs pixi run --manifest-path cuda_core test under the propagated PIXI_ENVIRONMENT_NAME=cu12. If cu12 testing is still expected at the workspace level, consider keeping a solvable cu12 environment here (e.g., using conda-forge cuda-bindings/cuda-version constraints instead of the path dependency) or updating the workspace test tasks to avoid selecting a missing environment.

# NOTE: cu12 environment is intentionally omitted because the path dependency
# to ../cuda_bindings (v13.1) makes it unsolvable locally. For cu12 testing,
# use conda-forge packages or CI workflows.
[environments]
default = { features = [
    "cu13",
    "test",
    "cython-tests",
], solve-group = "default" }
cu13 = { features = ["cu13", "test", "cython-tests"], solve-group = "default" }

cuda_core/cuda/core/_tensor_map.pyx:461

  • c_pixel_box_lower / c_pixel_box_upper are declared as fixed-size int[3] but only the first n_spatial entries are written. If the driver implementation reads all 3 entries (the API supports up to 3 spatial dims), the remaining uninitialized values can make encoding nondeterministic. Initialize the full arrays (e.g., set all 3 to 0 first) before filling the active elements.
        cdef uint64_t[5] c_global_dim
        cdef uint64_t[4] c_global_strides
        cdef uint32_t[5] c_element_strides
        cdef int[3] c_pixel_box_lower  # max 3 spatial dims (rank 5 - 2)
        cdef int[3] c_pixel_box_upper
        cdef int i_c

        for i_c in range(rank):
            c_global_dim[i_c] = <uint64_t>shape[rank - 1 - i_c]
            c_element_strides[i_c] = <uint32_t>element_strides[rank - 1 - i_c]

        for i_c in range(rank - 1):
            c_global_strides[i_c] = <uint64_t>byte_strides[rank - 2 - i_c]

        # Reverse spatial dimensions for lower/upper corners
        for i_c in range(n_spatial):
            c_pixel_box_lower[i_c] = <int>pixel_box_lower_corner[n_spatial - 1 - i_c]
            c_pixel_box_upper[i_c] = <int>pixel_box_upper_corner[n_spatial - 1 - i_c]


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +270 to +272
view = _get_validated_view(tensor)
desc._source_ref = tensor

Copy link

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TensorMapDescriptor stores _source_ref = tensor, but when tensor is a DLPack producer the pointer/metadata lifetime is governed by the DLPack capsule returned by __dlpack__(). Since the temporary StridedMemoryView (which holds the capsule and calls the deleter in __dealloc__) is not retained, the capsule can be released immediately, potentially invalidating globalAddress for exporters where the capsule owns the backing allocation. Store a strong reference to the StridedMemoryView (or at least its metadata capsule) instead of (or in addition to) the original tensor object.

Copilot uses AI. Check for mistakes.
@rparolin
Copy link
Collaborator Author

/ok to test

@github-actions
Copy link

@rparolin
Copy link
Collaborator Author

/ok to test

1 similar comment
@rparolin
Copy link
Collaborator Author

/ok to test

Copy link
Member

@leofang leofang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a coordinated effort between C++ and Python: #199 (comment). Can we please look into reusing the C++ implementation (mainly because @fbusato is a TMA expert) and avoid re-implementing it if possible?

@fbusato
Copy link

fbusato commented Feb 27, 2026

Fighting with poor documentation and bugs don't make me an expert :).
Anyway, we provide two main functionalities in this direction:

The implementation of make_tma_descriptor is here https://github.com/NVIDIA/cccl/blob/main/libcudacxx/include/cuda/__tma/make_tma_descriptor.h. Please let me know if there are functionalities that need to be isolated for reuse.

@rparolin rparolin added the cuda.core Everything related to the cuda.core module label Mar 2, 2026
@rparolin rparolin requested a review from Copilot March 3, 2026 23:56
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 1 comment.

Comments suppressed due to low confidence (3)

cuda_core/tests/test_tensor_map.py:102

  • This test passes a raw Buffer from dev.allocate() with data_type=FLOAT32. Buffer exports via DLPack as an int8 tensor with shape=(n_bytes,), so the TMA encoder will treat shape[0] as a float32 element count unless the implementation compensates for this. That can create a descriptor that covers 4× more memory than the allocation and hide potential out-of-bounds issues. Prefer wrapping the buffer in _DeviceArray(buf, (1024,), dtype=np.float32) (or StridedMemoryView.from_buffer with the intended shape/dtype) so the descriptor is built from element-count dimensions matching the data type.
        buf = dev.allocate(1024 * 4)  # 1024 float32 elements
        desc = TensorMapDescriptor.from_tiled(
            buf,
            box_dim=(64,),
            data_type=TensorMapDataType.FLOAT32,
        )

cuda_core/tests/test_tensor_map.py:277

  • Same issue as test_from_tiled_1d: building a descriptor from a raw Buffer with data_type=FLOAT32 relies on the implementation translating the buffer's byte-length into a float32 element count. To avoid encoding a descriptor with incorrect global_dim, wrap buf1/buf2 in _DeviceArray(..., dtype=np.float32) (or a StridedMemoryView with the intended dtype/shape) before calling from_tiled() / replace_address().
    def test_replace_address(self, dev, skip_if_no_tma):
        buf1 = dev.allocate(1024 * 4)
        desc = TensorMapDescriptor.from_tiled(
            buf1,
            box_dim=(64,),
            data_type=TensorMapDataType.FLOAT32,
        )

cuda_core/cuda/core/_kernel_arg_handler.pyx:305

  • Support for passing TensorMapDescriptor as a kernel argument is added here, but there’s no test exercising the full path (ParamHolder → cuLaunchKernel) with a real TensorMapDescriptor argument. Given cuda_core/tests/test_launcher.py already validates scalar/buffer argument handling, consider adding a small integration test that launches a kernel taking a CUtensorMap by value and verifies it can be consumed (or at least that the kernel receives the expected 128-byte payload). This will protect against ABI/size/alignment regressions in the argument marshalling logic.
            elif arg_type is tensor_map_descriptor_type:
                prepare_tensor_map_arg(self.data, self.data_addresses, <TensorMapDescriptor>arg, i)
                continue
            elif arg_type is bool:
                prepare_arg[cpp_bool](self.data, self.data_addresses, arg, i)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.


element_strides = _validate_element_strides(element_strides, rank)

cdef int elem_size = _TMA_DATA_TYPE_SIZE[tma_dt]
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shape = view.shape is taken directly from the source tensor, but elem_size/data_type can be overridden. If the underlying view dtype itemsize differs from _TMA_DATA_TYPE_SIZE[tma_dt] (common for Buffer, which DLPack-exports as 1-byte int8 with shape=(n_bytes,)), c_global_dim/c_global_strides will be encoded in the wrong units and can lead to out-of-bounds accesses at runtime. Consider either (a) validating that an explicit data_type has the same byte size as view.dtype.itemsize (allowing same-sized variants like TFLOAT32/FTZ), or (b) special-casing 1D Buffer so global_dim is computed as buf.size // elem_size (and requiring divisibility). Apply the same rule to from_im2col* as well.

Suggested change
cdef int elem_size = _TMA_DATA_TYPE_SIZE[tma_dt]
cdef int elem_size = _TMA_DATA_TYPE_SIZE[tma_dt]
view_itemsize = view.itemsize
if view_itemsize != elem_size:
raise ValueError(
"data_type element size ({elem_size} bytes) does not match "
f"underlying tensor element size ({view_itemsize} bytes)")

Copilot uses AI. Check for mistakes.
rparolin and others added 9 commits March 6, 2026 21:26
…time

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Remove unused _alloc_device_tensor helper from tests
- Add test for rank > 5 (6D tensor) to verify upper bound validation
- Add NULL check for PyMem_Malloc in prepare_tensor_map_arg

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Move the replace_address() demonstration into its own self-contained
example (tma_replace_address.py) so each file covers a single concept.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…alidated views alive to avoid DLPack-backed pointer lifetime hazards.

Add explicit tiled element-stride coverage and acknowledge the DLPack include-layout compatibility follow-up in NVIDIA/cccl#7871.

Made-with: Cursor
@cpcloud cpcloud force-pushed the rparolin/tma_feature branch from f4875f6 to 96a3e84 Compare March 7, 2026 02:28
@cpcloud
Copy link
Contributor

cpcloud commented Mar 7, 2026

/ok to test

Comment on lines +353 to +355
elif isinstance(arg, tensor_map_descriptor_type):
prepare_tensor_map_arg(self.data, self.data_addresses, <TensorMapDescriptor>arg, i)
continue
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't remember off top of my head why we had to repeat the checks in this block, I believe it has to do with backward compatibility. Since TMA support is new, we don't need to add it here.

!*_impl.cpp
!cuda_bindings/cuda/bindings/_lib/param_packer.cpp
!cuda_bindings/cuda/bindings/_bindings/loader.cpp
!cuda_core/cuda/core/_cpp/*.cpp
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we ignore them...?

Comment on lines +72 to +78
TensorMapDataType,
TensorMapDescriptor,
TensorMapIm2ColWideMode,
TensorMapInterleave,
TensorMapL2Promotion,
TensorMapOOBFill,
TensorMapSwizzle,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Critical:

  1. Only expose TensorMapDescriptor to under cuda.core
  2. Since StridedMemoryView already parses DLPack content, we expose StridedMemoryView.as_tensor_map (parallel to as_mdspan in WIP: Support passing SMV as a kernel argument of type cuda::std::mdspan #1387) and return a TensorMapDescriptor
  3. I don't think we should offer alternative constructors, at least for now, because otherwise we shoot ourselves in the foot when DevTechs come and complain about perf like in [BUG]: CUDA API calls through cuda-bindings 3x slower than direct CUDA C++ API calls #659. If we really, really want to do it, we must provide a TensorMapOptions encapsulating Interleave, Swizzle, etc, and expose Device.create_tensor_map(options: TensorMapOptions). Right now, this does not follow our general design pattern, and makes cuda.core namespace crowded.
  4. Make sure TensorMapDescriptor tracks the current device/context like streams and events. It is CUDA context-dependent. (Hence it makes sense to offer Device.create_tensor_map.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

prog = Program(
code,
code_type="c++",
options=ProgramOptions(std="c++17", arch=f"sm_{arch_str}"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this is simpler

Suggested change
options=ProgramOptions(std="c++17", arch=f"sm_{arch_str}"),
options=ProgramOptions(std="c++17", arch=f"sm_{dev.arch}"),

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cuda.core Everything related to the cuda.core module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Design the TensorMap object EPIC: Support TMA descriptor

5 participants