Add explicit CUDA graph construction API#1729
Add explicit CUDA graph construction API#1729Andy-Jost wants to merge 25 commits intoNVIDIA:mainfrom
Conversation
…work Rename cuda/core/_graph.py to cuda/core/_graph/__init__.py to create a package that will house the explicit graph construction module alongside the existing stream-capture-based implementation. Ref: NVIDIA#1317 Made-with: Cursor
Implement explicit CUDA graph construction API as an alternative to stream capture: - GraphDef: wraps CUgraph with instantiate(), debug_dot_print(), nodes(), and edges() methods - Node: fluent interface for building graphs with launch(), alloc(), free(), and join() methods - GraphAllocOptions: dataclass for allocation options (device, memory_type, peer_access) - Add __repr__, __eq__, __hash__ to GraphDef and Node for debugging and use in collections - Add pred/succ properties to Node for graph traversal - Refactor GraphDebugPrintOptions._to_flags() to share logic between GraphBuilder and GraphDef Made-with: Cursor
…zed tests Introduce AllocNode, KernelNode, EmptyNode, FreeNode subclasses with properties populated from the CUDA driver API. AllocNode exposes dptr, bytesize, device_id, memory_type, peer_access, and options; KernelNode exposes grid, block, shmem_size, kernel, and config. Node.pred/succ results are cached with automatic invalidation in builder methods. Restructure test_explicit.py around GraphSpec (topology) and NodeSpec (type + expected attributes) so that adding a new node type requires only a builder function and one _NODE_SPECS entry. Move object protocol tests to test_object_protocols.py for all node subclasses including FreeNode and KernelNode. Made-with: Cursor
Extract fill value parsing (int/bytes/buffer protocol → value + element size) from Buffer.fill() into cpdef _parse_fill_value in cuda_utils so it can be reused by both Buffer.fill() and Node.memset(). Add MemsetNode class with properties: dptr, value, element_size, width, height, pitch. Node.memset() builder supports 1D and 2D memset with element sizes 1, 2, and 4. Tests cover all element sizes, 2D memset, instantiate-and-execute, and object protocols. Made-with: Cursor
Implements event record/wait graph nodes with full test coverage. Adds non-owning create_event_handle_ref to RAII layer and Event.from_handle() / Event._from_raw_handle() for reconstructing Event objects from raw CUevent handles managed by the graph. Made-with: Cursor
GraphDef now exposes alloc, free, memset, launch, record_event, wait_event, and join directly. The virtual root node becomes an internal implementation detail (_entry). Also renames Event._from_raw_handle to Event._from_handle for consistency. Made-with: Cursor
…hDef.handle - Fix stale 'root' references to 'entry' in docstrings, comments, repr - Add Node.handle property (returns CUgraphNode as int, None for entry) - GraphDef.handle now uses as_py() for cleaner conversion - Update reprs to show domain-relevant payload instead of ambiguous handles: EmptyNode shows pred count, KernelNode shows grid/block, AllocNode/FreeNode/MemsetNode show dptr and params, EventRecord/WaitNode show event handle Made-with: Cursor
Simple 1D memcpy interface: Node.memcpy(dst, src, size) auto-detects host vs device memory via cuPointerGetAttribute, falling back to device type for graph-allocated pointers. Includes MemcpyNode subclass with dst/src/size properties, GraphDef.memcpy forwarding, execution test verifying data correctness, and object protocol coverage. Made-with: Cursor
Node.embed(child) clones a GraphDef as a sub-graph node. Adds create_graph_handle_ref for non-owning graph handles (child graph is owned by the node, not the wrapper). ChildGraphNode exposes child_graph property and shows subnode count in repr. Made-with: Cursor
Implements host callback graph nodes supporting two modes: - Python callable: GIL acquired via trampoline, nullary callbacks - ctypes CFUNCTYPE: raw C function pointer with optional user_data (bytes copied to graph-managed buffer, or raw int passthrough) Uses CUDA user objects to tie callback/data lifetime to the graph. Made-with: Cursor
create_graph_handle_ref now takes a parent GraphHandle, keeping the parent graph alive while any child/branch graph handle exists. This prevents use-after-free when a ChildGraphNode outlives its parent GraphDef. Made-with: Cursor
Implement conditional node hierarchy with Condition wrapper class, builder methods (if_cond, if_else, while_loop, switch), and branch graph access via non-owning GraphDef handles. Pre-CUDA 13.2 driver reconstruction falls back to ConditionalNode base class. Made-with: Cursor
Use cuGraphNodeGetParams (Python driver API) to recover the exact ConditionalNode subclass (IfNode, IfElseNode, WhileNode, SwitchNode) when reconstructing from the driver. Falls back to the generic ConditionalNode base on pre-13.2 drivers. Made-with: Cursor
Add __all__, from __future__ import annotations (replacing TYPE_CHECKING), return type annotations on all public methods and properties, and reorder imports per the 5-group convention. Made-with: Cursor
cuGraphAddNode replaces the phGraph_out pointer with its own internal array rather than writing into the caller-provided buffer. Read body graph handles from params.conditional.phGraph_out[i] after the call instead of from a pre-allocated vector. Add three integration tests exercising all 14 explicit-graph node types: heat diffusion (WhileNode, ChildGraphNode, EventNodes, ...), bisection root finder (IfElseNode, IfNode), and switch dispatch (SwitchNode). Made-with: Cursor
…uction Made-with: Cursor # Conflicts: # cuda_core/cuda/core/_cpp/resource_handles.cpp # cuda_core/cuda/core/_cpp/resource_handles.hpp # cuda_core/cuda/core/_resource_handles.pyx # cuda_core/cuda/core/_utils/cuda_utils.pyx
test_explicit_lifetime.py verifies the RAII parent-capture mechanism in create_graph_handle_ref prevents dangling references when parent GraphDef objects are deleted while child/body graph handles remain. test_explicit_errors.py covers input validation (type checks for conditional methods, invalid memset values, null free, cross-graph condition misuse), edge cases (join variants, multiple instantiation, unmatched alloc), and boundary condition execution (while-loop zero iterations, if-cond false, switch out-of-range). Made-with: Cursor
…ate Event factories - Introduce NodeHandle (shared_ptr<CUgraphNode> with NodeBox) to tie node lifetime to owning graph, replacing raw CUgraphNode in Node objects - Attach EventHandle/KernelHandle copies as CUDA user objects to graphs, preventing dangling references when Python wrappers are GC'd - Consolidate Event factories to _init and _from_handle(EventHandle) - Inline as_cu() calls throughout _graphdef.pyx - Add lifetime tests validating event, kernel, and child-graph survival Made-with: Cursor
…arithmetic Event no longer caches timing_disabled, busy_waited, ipc_enabled, device_id, or h_context as Python-side fields. All metadata lives on EventBox (C++ anonymous namespace) and is accessed through overloaded get_box() + getter functions (get_event_timing_disabled, etc.). The Event class now holds only _h_event and _ipc_descriptor. Made-with: Cursor
Introduce HandleRegistry<Key, Handle> class template for mapping raw CUDA handles back to their owning shared_ptr. create_event_handle_ref now checks the registry first, recovering full metadata when the event is already managed. Add tests verifying metadata preservation through reconstruction and GC. Made-with: Cursor
Introduce HandleRegistry<Key, Handle, Hash> class template for mapping raw CUDA handles back to their owning shared_ptr. Event registry enables create_event_handle_ref to recover full metadata when the driver returns a CUevent we already manage. Refactor IPC pointer cache to use the same template with a separate mutex for atomic check-then-import. Add tests verifying event metadata preservation through reconstruction and GC. Made-with: Cursor
…ntics Restructure create_kernel_handle to register directly in a HandleRegistry<CUkernel, KernelHandle>, and simplify create_kernel_handle_ref to lookup-or-ref (dropping the LibraryHandle parameter). Add get_kernel_library accessor for KernelBox metadata. Kernel.from_handle now recovers the owning handle automatically for cuda.core-created kernels, cross-checks caller-supplied ObjectCode on mismatch, and retains foreign ObjectCode via _keepalive. Rename Kernel._from_obj to _from_handle for consistency with the project. Made-with: Cursor
…inology Avoids ambiguity with potential future node types in other domains. Renames NodeBox, create_node_handle, and node_get_graph accordingly. Made-with: Cursor
214d6c7 to
73ba7fe
Compare
| def fill(self, value: int | BufferProtocol, *, stream: Stream | GraphBuilder): | ||
| """Fill this buffer with a repeating byte pattern. |
There was a problem hiding this comment.
Buffer.fill was reworked to extract the input processing, so memset nodes could share that.
| struct EventBox { | ||
| CUevent resource; | ||
| bool timing_disabled; | ||
| bool busy_waited; | ||
| bool ipc_enabled; | ||
| int device_id; | ||
| ContextHandle h_context; |
There was a problem hiding this comment.
These properties are set at event creation time and cannot be queried through the driver API. Moreover, graph-attached events are returned from the driver as plain CUevent handles, and reconstructing the Cython Event object from one of those would lose this information.
The solution is to move the property metadata into C++ and set up a reverse look-up so that the driver-returned CUevent can be used to retrieve the managing shared_ptr, which holds this EventBox.
Graph-attached kernels are handled similarly.
| cdef void _attach_user_object( | ||
| cydriver.CUgraph graph, void* ptr, | ||
| cydriver.CUhostFn destroy) except *: | ||
| """Create a CUDA user object and transfer ownership to the graph. | ||
|
|
||
| On success the graph owns the resource (via MOVE semantics). | ||
| On failure the destroy callback is invoked to clean up ptr, | ||
| then a CUDAError is raised — callers need no try/except. | ||
| """ | ||
| cdef cydriver.CUuserObject user_obj = NULL | ||
| cdef cydriver.CUresult ret | ||
| with nogil: | ||
| ret = cydriver.cuUserObjectCreate( | ||
| &user_obj, ptr, destroy, 1, | ||
| cydriver.CU_USER_OBJECT_NO_DESTRUCTOR_SYNC) | ||
| if ret == cydriver.CUDA_SUCCESS: | ||
| ret = cydriver.cuGraphRetainUserObject( | ||
| graph, user_obj, 1, cydriver.CU_GRAPH_USER_OBJECT_MOVE) | ||
| if ret != cydriver.CUDA_SUCCESS: | ||
| cydriver.cuUserObjectRelease(user_obj, 1) | ||
| if ret != cydriver.CUDA_SUCCESS: | ||
| if user_obj == NULL: | ||
| destroy(ptr) | ||
| HANDLE_RETURN(ret) |
There was a problem hiding this comment.
Resources like host functions, event handles, and kernel handles are placed into CUDA user objects, which are like capsules. This way, the lifetimes are properly tied to graphs they appear in, and they follow graphs through cloning steps.
| @dataclass | ||
| class GraphAllocOptions: | ||
| """Options for graph memory allocation nodes. | ||
|
|
||
| Attributes | ||
| ---------- | ||
| device : int or Device, optional | ||
| The device on which to allocate memory. If None (default), | ||
| uses the current CUDA context's device. | ||
| memory_type : str, optional | ||
| Type of memory to allocate. One of: | ||
|
|
||
| - ``"device"`` (default): Pinned device memory, optimal for GPU kernels. | ||
| - ``"host"``: Pinned host memory, accessible from both host and device. | ||
| Useful for graphs containing host callback nodes. Note: may not be | ||
| supported on all systems/drivers. | ||
| - ``"managed"``: Managed/unified memory that automatically migrates | ||
| between host and device. Useful for mixed host/device access patterns. | ||
|
|
||
| peer_access : list of int or Device, optional | ||
| List of devices that should have read-write access to the | ||
| allocated memory. If None (default), only the allocating | ||
| device has access. | ||
|
|
||
| Notes | ||
| ----- | ||
| - IPC (inter-process communication) is not supported for graph | ||
| memory allocation nodes per CUDA documentation. | ||
| - The allocation uses the device's default memory pool. | ||
| """ | ||
|
|
||
| device: int | Device | None = None | ||
| memory_type: str = "device" | ||
| peer_access: list | None = None |
There was a problem hiding this comment.
I made this options class to match the way memory resources are created. But other graph operations just take arguments, and there are only three arguments wrapped here. I'm not sure having this this is better than just adding these arguments to the alloc() function.
| def alloc(self, size_t size, options: GraphAllocOptions | None = None) -> AllocNode: | ||
| """Add an entry-point memory allocation node (no dependencies). | ||
|
|
||
| See :meth:`Node.alloc` for full documentation. | ||
| """ | ||
| return self._entry.alloc(size, options) |
There was a problem hiding this comment.
I would rather use *args, **kwargs in these forwarding functions, but I have a feeling someone would complain about IDE integration.
Summary
GraphDefand aNodeclass hierarchy with 14 concrete node types:EmptyNode,KernelNode,AllocNode,FreeNode,MemsetNode,MemcpyNode,ChildGraphNode,HostCallbackNode,EventRecordNode,EventWaitNode,IfNode,IfElseNode,WhileNode,SwitchNodeHandleRegistrytemplate), enabling owning handle recovery and full metadata round-tripping on reconstructionEventBoxwith accessor functions, eliminating metadata loss on handle reconstructionfrom_handlesimplified: registry-first lookup for cuda.core-created kernels,_keepalivereference for foreign kernelsChanges
_graphdef.pyx/_graphdef.pxd: Full explicit graph API —GraphDef,Nodehierarchy,Condition, fluent builder methods,__repr__,__eq__/__hash__/__weakref__protocols, CUDA User Object attachment for event/kernel lifetimeresource_handles.{hpp,cpp}:GraphHandlewith owning/non-owning variants and parent-capture for child graphs;GraphNodeHandle(RAII node handle with parent graph reference);HandleRegistry<Key, Handle, Hash>template for thread-safe reverse lookup; Event/Kernel registries withweak_ptrtracking;EventBoxmetadata fields andget_event_*accessors;get_kernel_libraryaccessor; IPC cache refactored to useHandleRegistry_resource_handles.{pxd,pyx}: Cython declarations for graph node handles, event/kernel registries, and accessor functions_event.{pxd,pyx}: Consolidated factory methods (_init,_from_handle); metadata properties delegate toEventBoxvia C++ accessors_module.{pxd,pyx}:Kernel.from_handlewith registry lookup and_keepalivefor foreign kernels; renamed_from_objto_from_handleTest Coverage
test_explicit.py: 176 unit tests (topology, node types, attributes, execution)test_explicit_integration.py: 6 integration tests (heat diffusion, bisection, switch dispatch) covering all 14 node typestest_explicit_lifetime.py: Lifetime tests for child graphs, conditional body graphs, events, kernels, registry handle recovery, and graph node reconstructiontest_explicit_errors.py: Error handling and edge-case teststest_module.py: Library mismatch warning and foreign kernelfrom_handleteststest_object_protocols.py: Protocol tests for new types