Conversation
|
Auto-sync is disabled for ready for review pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 8 out of 8 changed files in this pull request and generated 4 comments.
Comments suppressed due to low confidence (2)
cuda_core/pixi.toml:67
- Removing the
cu12environment from this subproject can break the repository’s top-levelpixi run -e cu12 testworkflow, which runspixi run --manifest-path cuda_core testunder the propagatedPIXI_ENVIRONMENT_NAME=cu12. If cu12 testing is still expected at the workspace level, consider keeping a solvablecu12environment here (e.g., using conda-forgecuda-bindings/cuda-versionconstraints instead of the path dependency) or updating the workspace test tasks to avoid selecting a missing environment.
# NOTE: cu12 environment is intentionally omitted because the path dependency
# to ../cuda_bindings (v13.1) makes it unsolvable locally. For cu12 testing,
# use conda-forge packages or CI workflows.
[environments]
default = { features = [
"cu13",
"test",
"cython-tests",
], solve-group = "default" }
cu13 = { features = ["cu13", "test", "cython-tests"], solve-group = "default" }
cuda_core/cuda/core/_tensor_map.pyx:461
c_pixel_box_lower/c_pixel_box_upperare declared as fixed-sizeint[3]but only the firstn_spatialentries are written. If the driver implementation reads all 3 entries (the API supports up to 3 spatial dims), the remaining uninitialized values can make encoding nondeterministic. Initialize the full arrays (e.g., set all 3 to 0 first) before filling the active elements.
cdef uint64_t[5] c_global_dim
cdef uint64_t[4] c_global_strides
cdef uint32_t[5] c_element_strides
cdef int[3] c_pixel_box_lower # max 3 spatial dims (rank 5 - 2)
cdef int[3] c_pixel_box_upper
cdef int i_c
for i_c in range(rank):
c_global_dim[i_c] = <uint64_t>shape[rank - 1 - i_c]
c_element_strides[i_c] = <uint32_t>element_strides[rank - 1 - i_c]
for i_c in range(rank - 1):
c_global_strides[i_c] = <uint64_t>byte_strides[rank - 2 - i_c]
# Reverse spatial dimensions for lower/upper corners
for i_c in range(n_spatial):
c_pixel_box_lower[i_c] = <int>pixel_box_lower_corner[n_spatial - 1 - i_c]
c_pixel_box_upper[i_c] = <int>pixel_box_upper_corner[n_spatial - 1 - i_c]
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| view = _get_validated_view(tensor) | ||
| desc._source_ref = tensor | ||
|
|
There was a problem hiding this comment.
TensorMapDescriptor stores _source_ref = tensor, but when tensor is a DLPack producer the pointer/metadata lifetime is governed by the DLPack capsule returned by __dlpack__(). Since the temporary StridedMemoryView (which holds the capsule and calls the deleter in __dealloc__) is not retained, the capsule can be released immediately, potentially invalidating globalAddress for exporters where the capsule owns the backing allocation. Store a strong reference to the StridedMemoryView (or at least its metadata capsule) instead of (or in addition to) the original tensor object.
|
/ok to test |
|
|
/ok to test |
1 similar comment
|
/ok to test |
leofang
left a comment
There was a problem hiding this comment.
There is a coordinated effort between C++ and Python: #199 (comment). Can we please look into reusing the C++ implementation (mainly because @fbusato is a TMA expert) and avoid re-implementing it if possible?
|
Fighting with poor documentation and bugs don't make me an expert :).
The implementation of |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 8 out of 8 changed files in this pull request and generated 1 comment.
Comments suppressed due to low confidence (3)
cuda_core/tests/test_tensor_map.py:102
- This test passes a raw
Bufferfromdev.allocate()withdata_type=FLOAT32.Bufferexports via DLPack as an int8 tensor with shape=(n_bytes,), so the TMA encoder will treatshape[0]as a float32 element count unless the implementation compensates for this. That can create a descriptor that covers 4× more memory than the allocation and hide potential out-of-bounds issues. Prefer wrapping the buffer in_DeviceArray(buf, (1024,), dtype=np.float32)(orStridedMemoryView.from_bufferwith the intended shape/dtype) so the descriptor is built from element-count dimensions matching the data type.
buf = dev.allocate(1024 * 4) # 1024 float32 elements
desc = TensorMapDescriptor.from_tiled(
buf,
box_dim=(64,),
data_type=TensorMapDataType.FLOAT32,
)
cuda_core/tests/test_tensor_map.py:277
- Same issue as
test_from_tiled_1d: building a descriptor from a rawBufferwithdata_type=FLOAT32relies on the implementation translating the buffer's byte-length into a float32 element count. To avoid encoding a descriptor with incorrectglobal_dim, wrapbuf1/buf2in_DeviceArray(..., dtype=np.float32)(or aStridedMemoryViewwith the intended dtype/shape) before callingfrom_tiled()/replace_address().
def test_replace_address(self, dev, skip_if_no_tma):
buf1 = dev.allocate(1024 * 4)
desc = TensorMapDescriptor.from_tiled(
buf1,
box_dim=(64,),
data_type=TensorMapDataType.FLOAT32,
)
cuda_core/cuda/core/_kernel_arg_handler.pyx:305
- Support for passing
TensorMapDescriptoras a kernel argument is added here, but there’s no test exercising the full path (ParamHolder → cuLaunchKernel) with a realTensorMapDescriptorargument. Givencuda_core/tests/test_launcher.pyalready validates scalar/buffer argument handling, consider adding a small integration test that launches a kernel taking aCUtensorMapby value and verifies it can be consumed (or at least that the kernel receives the expected 128-byte payload). This will protect against ABI/size/alignment regressions in the argument marshalling logic.
elif arg_type is tensor_map_descriptor_type:
prepare_tensor_map_arg(self.data, self.data_addresses, <TensorMapDescriptor>arg, i)
continue
elif arg_type is bool:
prepare_arg[cpp_bool](self.data, self.data_addresses, arg, i)
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
| element_strides = _validate_element_strides(element_strides, rank) | ||
|
|
||
| cdef int elem_size = _TMA_DATA_TYPE_SIZE[tma_dt] |
There was a problem hiding this comment.
shape = view.shape is taken directly from the source tensor, but elem_size/data_type can be overridden. If the underlying view dtype itemsize differs from _TMA_DATA_TYPE_SIZE[tma_dt] (common for Buffer, which DLPack-exports as 1-byte int8 with shape=(n_bytes,)), c_global_dim/c_global_strides will be encoded in the wrong units and can lead to out-of-bounds accesses at runtime. Consider either (a) validating that an explicit data_type has the same byte size as view.dtype.itemsize (allowing same-sized variants like TFLOAT32/FTZ), or (b) special-casing 1D Buffer so global_dim is computed as buf.size // elem_size (and requiring divisibility). Apply the same rule to from_im2col* as well.
| cdef int elem_size = _TMA_DATA_TYPE_SIZE[tma_dt] | |
| cdef int elem_size = _TMA_DATA_TYPE_SIZE[tma_dt] | |
| view_itemsize = view.itemsize | |
| if view_itemsize != elem_size: | |
| raise ValueError( | |
| "data_type element size ({elem_size} bytes) does not match " | |
| f"underlying tensor element size ({view_itemsize} bytes)") |
…time Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Remove unused _alloc_device_tensor helper from tests - Add test for rank > 5 (6D tensor) to verify upper bound validation - Add NULL check for PyMem_Malloc in prepare_tensor_map_arg Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Move the replace_address() demonstration into its own self-contained example (tma_replace_address.py) so each file covers a single concept. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…alidated views alive to avoid DLPack-backed pointer lifetime hazards. Add explicit tiled element-stride coverage and acknowledge the DLPack include-layout compatibility follow-up in NVIDIA/cccl#7871. Made-with: Cursor
f4875f6 to
96a3e84
Compare
|
/ok to test |
| elif isinstance(arg, tensor_map_descriptor_type): | ||
| prepare_tensor_map_arg(self.data, self.data_addresses, <TensorMapDescriptor>arg, i) | ||
| continue |
There was a problem hiding this comment.
I don't remember off top of my head why we had to repeat the checks in this block, I believe it has to do with backward compatibility. Since TMA support is new, we don't need to add it here.
| !*_impl.cpp | ||
| !cuda_bindings/cuda/bindings/_lib/param_packer.cpp | ||
| !cuda_bindings/cuda/bindings/_bindings/loader.cpp | ||
| !cuda_core/cuda/core/_cpp/*.cpp |
| TensorMapDataType, | ||
| TensorMapDescriptor, | ||
| TensorMapIm2ColWideMode, | ||
| TensorMapInterleave, | ||
| TensorMapL2Promotion, | ||
| TensorMapOOBFill, | ||
| TensorMapSwizzle, |
There was a problem hiding this comment.
Critical:
- Only expose
TensorMapDescriptorto undercuda.core - Since
StridedMemoryViewalready parses DLPack content, we exposeStridedMemoryView.as_tensor_map(parallel toas_mdspanin WIP: Support passing SMV as a kernel argument of typecuda::std::mdspan#1387) and return aTensorMapDescriptor - I don't think we should offer alternative constructors, at least for now, because otherwise we shoot ourselves in the foot when DevTechs come and complain about perf like in [BUG]: CUDA API calls through
cuda-bindings3x slower than direct CUDA C++ API calls #659. If we really, really want to do it, we must provide aTensorMapOptionsencapsulatingInterleave,Swizzle, etc, and exposeDevice.create_tensor_map(options: TensorMapOptions). Right now, this does not follow our general design pattern, and makescuda.corenamespace crowded. - Make sure
TensorMapDescriptortracks the current device/context like streams and events. It is CUDA context-dependent. (Hence it makes sense to offerDevice.create_tensor_map.)
| prog = Program( | ||
| code, | ||
| code_type="c++", | ||
| options=ProgramOptions(std="c++17", arch=f"sm_{arch_str}"), |
There was a problem hiding this comment.
nit: this is simpler
| options=ProgramOptions(std="c++17", arch=f"sm_{arch_str}"), | |
| options=ProgramOptions(std="c++17", arch=f"sm_{dev.arch}"), |
Summary
TensorMapDescriptorCython class wrapping the CUDA driver'sCUtensorMapfor Hopper+ TMA (Tensor Memory Accelerator) bulk data movementfrom_tiled()andfrom_im2col()class methods, with automatic dtype inference, stride computation, and validationTensorMapDescriptoras a first-class kernel argument in_kernel_arg_handler.pyxtest_tensor_map.py) and an example (tma_tensor_map.py)Closes #199
Closes #200