Add explicit CUDA graph construction API by Andy-Jost · Pull Request #1729 · NVIDIA/cuda-python

Andy-Jost · 2026-03-06T01:22:00Z

Summary

Implements explicit (non-stream-capture) CUDA graph construction via GraphDef and a Node class hierarchy with 14 concrete node types: EmptyNode, KernelNode, AllocNode, FreeNode, MemsetNode, MemcpyNode, ChildGraphNode, HostCallbackNode, EventRecordNode, EventWaitNode, IfNode, IfElseNode, WhileNode, SwitchNode
RAII lifetime management for graph resources (events, kernels) via CUDA User Objects, ensuring resources survive graph cloning and GC
Reverse-lookup registries for Event and Kernel handles (HandleRegistry template), enabling owning handle recovery and full metadata round-tripping on reconstruction
Event metadata (timing, IPC, busy-wait, device, context) stored in C++ EventBox with accessor functions, eliminating metadata loss on handle reconstruction
Kernel from_handle simplified: registry-first lookup for cuda.core-created kernels, _keepalive reference for foreign kernels

Changes

_graphdef.pyx / _graphdef.pxd: Full explicit graph API — GraphDef, Node hierarchy, Condition, fluent builder methods, __repr__, __eq__/__hash__/__weakref__ protocols, CUDA User Object attachment for event/kernel lifetime
resource_handles.{hpp,cpp}: GraphHandle with owning/non-owning variants and parent-capture for child graphs; GraphNodeHandle (RAII node handle with parent graph reference); HandleRegistry<Key, Handle, Hash> template for thread-safe reverse lookup; Event/Kernel registries with weak_ptr tracking; EventBox metadata fields and get_event_* accessors; get_kernel_library accessor; IPC cache refactored to use HandleRegistry
_resource_handles.{pxd,pyx}: Cython declarations for graph node handles, event/kernel registries, and accessor functions
_event.{pxd,pyx}: Consolidated factory methods (_init, _from_handle); metadata properties delegate to EventBox via C++ accessors
_module.{pxd,pyx}: Kernel.from_handle with registry lookup and _keepalive for foreign kernels; renamed _from_obj to _from_handle

Test Coverage

test_explicit.py: 176 unit tests (topology, node types, attributes, execution)
test_explicit_integration.py: 6 integration tests (heat diffusion, bisection, switch dispatch) covering all 14 node types
test_explicit_lifetime.py: Lifetime tests for child graphs, conditional body graphs, events, kernels, registry handle recovery, and graph node reconstruction
test_explicit_errors.py: Error handling and edge-case tests
test_module.py: Library mismatch warning and foreign kernel from_handle tests
test_object_protocols.py: Protocol tests for new types
Conditional node tests require CC >= 9.0 (skipped on older hardware)

…work Rename cuda/core/_graph.py to cuda/core/_graph/__init__.py to create a package that will house the explicit graph construction module alongside the existing stream-capture-based implementation. Ref: NVIDIA#1317 Made-with: Cursor

Implement explicit CUDA graph construction API as an alternative to stream capture: - GraphDef: wraps CUgraph with instantiate(), debug_dot_print(), nodes(), and edges() methods - Node: fluent interface for building graphs with launch(), alloc(), free(), and join() methods - GraphAllocOptions: dataclass for allocation options (device, memory_type, peer_access) - Add __repr__, __eq__, __hash__ to GraphDef and Node for debugging and use in collections - Add pred/succ properties to Node for graph traversal - Refactor GraphDebugPrintOptions._to_flags() to share logic between GraphBuilder and GraphDef Made-with: Cursor

…zed tests Introduce AllocNode, KernelNode, EmptyNode, FreeNode subclasses with properties populated from the CUDA driver API. AllocNode exposes dptr, bytesize, device_id, memory_type, peer_access, and options; KernelNode exposes grid, block, shmem_size, kernel, and config. Node.pred/succ results are cached with automatic invalidation in builder methods. Restructure test_explicit.py around GraphSpec (topology) and NodeSpec (type + expected attributes) so that adding a new node type requires only a builder function and one _NODE_SPECS entry. Move object protocol tests to test_object_protocols.py for all node subclasses including FreeNode and KernelNode. Made-with: Cursor

Extract fill value parsing (int/bytes/buffer protocol → value + element size) from Buffer.fill() into cpdef _parse_fill_value in cuda_utils so it can be reused by both Buffer.fill() and Node.memset(). Add MemsetNode class with properties: dptr, value, element_size, width, height, pitch. Node.memset() builder supports 1D and 2D memset with element sizes 1, 2, and 4. Tests cover all element sizes, 2D memset, instantiate-and-execute, and object protocols. Made-with: Cursor

Implements event record/wait graph nodes with full test coverage. Adds non-owning create_event_handle_ref to RAII layer and Event.from_handle() / Event._from_raw_handle() for reconstructing Event objects from raw CUevent handles managed by the graph. Made-with: Cursor

GraphDef now exposes alloc, free, memset, launch, record_event, wait_event, and join directly. The virtual root node becomes an internal implementation detail (_entry). Also renames Event._from_raw_handle to Event._from_handle for consistency. Made-with: Cursor

…hDef.handle - Fix stale 'root' references to 'entry' in docstrings, comments, repr - Add Node.handle property (returns CUgraphNode as int, None for entry) - GraphDef.handle now uses as_py() for cleaner conversion - Update reprs to show domain-relevant payload instead of ambiguous handles: EmptyNode shows pred count, KernelNode shows grid/block, AllocNode/FreeNode/MemsetNode show dptr and params, EventRecord/WaitNode show event handle Made-with: Cursor

Simple 1D memcpy interface: Node.memcpy(dst, src, size) auto-detects host vs device memory via cuPointerGetAttribute, falling back to device type for graph-allocated pointers. Includes MemcpyNode subclass with dst/src/size properties, GraphDef.memcpy forwarding, execution test verifying data correctness, and object protocol coverage. Made-with: Cursor

Node.embed(child) clones a GraphDef as a sub-graph node. Adds create_graph_handle_ref for non-owning graph handles (child graph is owned by the node, not the wrapper). ChildGraphNode exposes child_graph property and shows subnode count in repr. Made-with: Cursor

Implements host callback graph nodes supporting two modes: - Python callable: GIL acquired via trampoline, nullary callbacks - ctypes CFUNCTYPE: raw C function pointer with optional user_data (bytes copied to graph-managed buffer, or raw int passthrough) Uses CUDA user objects to tie callback/data lifetime to the graph. Made-with: Cursor

create_graph_handle_ref now takes a parent GraphHandle, keeping the parent graph alive while any child/branch graph handle exists. This prevents use-after-free when a ChildGraphNode outlives its parent GraphDef. Made-with: Cursor

Implement conditional node hierarchy with Condition wrapper class, builder methods (if_cond, if_else, while_loop, switch), and branch graph access via non-owning GraphDef handles. Pre-CUDA 13.2 driver reconstruction falls back to ConditionalNode base class. Made-with: Cursor

Use cuGraphNodeGetParams (Python driver API) to recover the exact ConditionalNode subclass (IfNode, IfElseNode, WhileNode, SwitchNode) when reconstructing from the driver. Falls back to the generic ConditionalNode base on pre-13.2 drivers. Made-with: Cursor

Add __all__, from __future__ import annotations (replacing TYPE_CHECKING), return type annotations on all public methods and properties, and reorder imports per the 5-group convention. Made-with: Cursor

cuGraphAddNode replaces the phGraph_out pointer with its own internal array rather than writing into the caller-provided buffer. Read body graph handles from params.conditional.phGraph_out[i] after the call instead of from a pre-allocated vector. Add three integration tests exercising all 14 explicit-graph node types: heat diffusion (WhileNode, ChildGraphNode, EventNodes, ...), bisection root finder (IfElseNode, IfNode), and switch dispatch (SwitchNode). Made-with: Cursor

…uction Made-with: Cursor # Conflicts: # cuda_core/cuda/core/_cpp/resource_handles.cpp # cuda_core/cuda/core/_cpp/resource_handles.hpp # cuda_core/cuda/core/_resource_handles.pyx # cuda_core/cuda/core/_utils/cuda_utils.pyx

copy-pr-bot · 2026-03-06T01:22:04Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

test_explicit_lifetime.py verifies the RAII parent-capture mechanism in create_graph_handle_ref prevents dangling references when parent GraphDef objects are deleted while child/body graph handles remain. test_explicit_errors.py covers input validation (type checks for conditional methods, invalid memset values, null free, cross-graph condition misuse), edge cases (join variants, multiple instantiation, unmatched alloc), and boundary condition execution (while-loop zero iterations, if-cond false, switch out-of-range). Made-with: Cursor

…ate Event factories - Introduce NodeHandle (shared_ptr<CUgraphNode> with NodeBox) to tie node lifetime to owning graph, replacing raw CUgraphNode in Node objects - Attach EventHandle/KernelHandle copies as CUDA user objects to graphs, preventing dangling references when Python wrappers are GC'd - Consolidate Event factories to _init and _from_handle(EventHandle) - Inline as_cu() calls throughout _graphdef.pyx - Add lifetime tests validating event, kernel, and child-graph survival Made-with: Cursor

…arithmetic Event no longer caches timing_disabled, busy_waited, ipc_enabled, device_id, or h_context as Python-side fields. All metadata lives on EventBox (C++ anonymous namespace) and is accessed through overloaded get_box() + getter functions (get_event_timing_disabled, etc.). The Event class now holds only _h_event and _ipc_descriptor. Made-with: Cursor

Introduce HandleRegistry<Key, Handle> class template for mapping raw CUDA handles back to their owning shared_ptr. create_event_handle_ref now checks the registry first, recovering full metadata when the event is already managed. Add tests verifying metadata preservation through reconstruction and GC. Made-with: Cursor

Introduce HandleRegistry<Key, Handle, Hash> class template for mapping raw CUDA handles back to their owning shared_ptr. Event registry enables create_event_handle_ref to recover full metadata when the driver returns a CUevent we already manage. Refactor IPC pointer cache to use the same template with a separate mutex for atomic check-then-import. Add tests verifying event metadata preservation through reconstruction and GC. Made-with: Cursor

…ntics Restructure create_kernel_handle to register directly in a HandleRegistry<CUkernel, KernelHandle>, and simplify create_kernel_handle_ref to lookup-or-ref (dropping the LibraryHandle parameter). Add get_kernel_library accessor for KernelBox metadata. Kernel.from_handle now recovers the owning handle automatically for cuda.core-created kernels, cross-checks caller-supplied ObjectCode on mismatch, and retains foreign ObjectCode via _keepalive. Rename Kernel._from_obj to _from_handle for consistency with the project. Made-with: Cursor

…inology Avoids ambiguity with potential future node types in other domains. Renames NodeBox, create_node_handle, and node_get_graph accordingly. Made-with: Cursor

…uction

Andy-Jost · 2026-03-06T15:56:01Z

cuda_core/cuda/core/_memory/_buffer.pyx

    def fill(self, value: int | BufferProtocol, *, stream: Stream | GraphBuilder):
        """Fill this buffer with a repeating byte pattern.


Buffer.fill was reworked to extract the input processing, so memset nodes could share that.

Andy-Jost · 2026-03-06T22:36:46Z

cuda_core/cuda/core/_cpp/resource_handles.cpp

 struct EventBox {
    CUevent resource;
+    bool timing_disabled;
+    bool busy_waited;
+    bool ipc_enabled;
+    int device_id;
+    ContextHandle h_context;


These properties are set at event creation time and cannot be queried through the driver API. Moreover, graph-attached events are returned from the driver as plain CUevent handles, and reconstructing the Cython Event object from one of those would lose this information.

The solution is to move the property metadata into C++ and set up a reverse look-up so that the driver-returned CUevent can be used to retrieve the managing shared_ptr, which holds this EventBox.

Graph-attached kernels are handled similarly.

Andy-Jost · 2026-03-06T22:44:10Z

cuda_core/cuda/core/_graph/_graphdef.pyx

+cdef void _attach_user_object(
+        cydriver.CUgraph graph, void* ptr,
+        cydriver.CUhostFn destroy) except *:
+    """Create a CUDA user object and transfer ownership to the graph.
+
+    On success the graph owns the resource (via MOVE semantics).
+    On failure the destroy callback is invoked to clean up ptr,
+    then a CUDAError is raised — callers need no try/except.
+    """
+    cdef cydriver.CUuserObject user_obj = NULL
+    cdef cydriver.CUresult ret
+    with nogil:
+        ret = cydriver.cuUserObjectCreate(
+            &user_obj, ptr, destroy, 1,
+            cydriver.CU_USER_OBJECT_NO_DESTRUCTOR_SYNC)
+        if ret == cydriver.CUDA_SUCCESS:
+            ret = cydriver.cuGraphRetainUserObject(
+                graph, user_obj, 1, cydriver.CU_GRAPH_USER_OBJECT_MOVE)
+            if ret != cydriver.CUDA_SUCCESS:
+                cydriver.cuUserObjectRelease(user_obj, 1)
+    if ret != cydriver.CUDA_SUCCESS:
+        if user_obj == NULL:
+            destroy(ptr)
+        HANDLE_RETURN(ret)


Resources like host functions, event handles, and kernel handles are placed into CUDA user objects, which are like capsules. This way, the lifetimes are properly tied to graphs they appear in, and they follow graphs through cloning steps.

Andy-Jost · 2026-03-06T22:46:17Z

cuda_core/cuda/core/_graph/_graphdef.pyx

+@dataclass
+class GraphAllocOptions:
+    """Options for graph memory allocation nodes.
+
+    Attributes
+    ----------
+    device : int or Device, optional
+        The device on which to allocate memory. If None (default),
+        uses the current CUDA context's device.
+    memory_type : str, optional
+        Type of memory to allocate. One of:
+
+        - ``"device"`` (default): Pinned device memory, optimal for GPU kernels.
+        - ``"host"``: Pinned host memory, accessible from both host and device.
+          Useful for graphs containing host callback nodes. Note: may not be
+          supported on all systems/drivers.
+        - ``"managed"``: Managed/unified memory that automatically migrates
+          between host and device. Useful for mixed host/device access patterns.
+
+    peer_access : list of int or Device, optional
+        List of devices that should have read-write access to the
+        allocated memory. If None (default), only the allocating
+        device has access.
+
+    Notes
+    -----
+    - IPC (inter-process communication) is not supported for graph
+      memory allocation nodes per CUDA documentation.
+    - The allocation uses the device's default memory pool.
+    """
+
+    device: int | Device | None = None
+    memory_type: str = "device"
+    peer_access: list | None = None


I made this options class to match the way memory resources are created. But other graph operations just take arguments, and there are only three arguments wrapped here. I'm not sure having this this is better than just adding these arguments to the alloc() function.

Andy-Jost · 2026-03-06T22:47:59Z

cuda_core/cuda/core/_graph/_graphdef.pyx

+    def alloc(self, size_t size, options: GraphAllocOptions | None = None) -> AllocNode:
+        """Add an entry-point memory allocation node (no dependencies).
+
+        See :meth:`Node.alloc` for full documentation.
+        """
+        return self._entry.alloc(size, options)


I would rather use *args, **kwargs in these forwarding functions, but I have a feeling someone would complain about IDE integration.

Andy-Jost added 17 commits March 3, 2026 15:09

Added GraphHandle to RAII module.

ee55795

Apply developer guide styling to _graphdef.pyx

d993f9c

Add __all__, from __future__ import annotations (replacing TYPE_CHECKING), return type annotations on all public methods and properties, and reorder imports per the 5-group convention. Made-with: Cursor

Andy-Jost added this to the cuda.core v0.7.0 milestone Mar 6, 2026

Andy-Jost added P0 High priority - Must do! feature New feature or request cuda.core Everything related to the cuda.core module labels Mar 6, 2026

Andy-Jost self-assigned this Mar 6, 2026

Andy-Jost added 7 commits March 6, 2026 07:45

Rename NodeHandle to GraphNodeHandle for consistency with driver term…

b55782a

…inology Avoids ambiguity with potential future node types in other domains. Renames NodeBox, create_node_handle, and node_get_graph accordingly. Made-with: Cursor

Merge remote-tracking branch 'origin/main' into explicit-graph-constr…

73ba7fe

…uction

Andy-Jost force-pushed the explicit-graph-construction branch from 214d6c7 to 73ba7fe Compare March 6, 2026 23:12

Andy-Jost commented Mar 6, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add explicit CUDA graph construction API#1729

Add explicit CUDA graph construction API#1729
Andy-Jost wants to merge 25 commits intoNVIDIA:mainfrom
Andy-Jost:explicit-graph-construction

Andy-Jost commented Mar 6, 2026 •

edited

Loading

Uh oh!

copy-pr-bot bot commented Mar 6, 2026

Uh oh!

Andy-Jost Mar 6, 2026

Uh oh!

Andy-Jost Mar 6, 2026

Uh oh!

Andy-Jost Mar 6, 2026 •

edited

Loading

Uh oh!

Andy-Jost Mar 6, 2026

Uh oh!

Andy-Jost Mar 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		def fill(self, value: int \| BufferProtocol, *, stream: Stream \| GraphBuilder):
		"""Fill this buffer with a repeating byte pattern.

Conversation

Andy-Jost commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Test Coverage

Uh oh!

copy-pr-bot bot commented Mar 6, 2026

Uh oh!

Andy-Jost Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

Andy-Jost Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

Andy-Jost Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Andy-Jost Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

Andy-Jost Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Andy-Jost commented Mar 6, 2026 •

edited

Loading

Andy-Jost Mar 6, 2026 •

edited

Loading