Skip to content

Add explicit CUDA graph construction API#1729

Draft
Andy-Jost wants to merge 25 commits intoNVIDIA:mainfrom
Andy-Jost:explicit-graph-construction
Draft

Add explicit CUDA graph construction API#1729
Andy-Jost wants to merge 25 commits intoNVIDIA:mainfrom
Andy-Jost:explicit-graph-construction

Conversation

@Andy-Jost
Copy link
Contributor

@Andy-Jost Andy-Jost commented Mar 6, 2026

Summary

  • Implements explicit (non-stream-capture) CUDA graph construction via GraphDef and a Node class hierarchy with 14 concrete node types: EmptyNode, KernelNode, AllocNode, FreeNode, MemsetNode, MemcpyNode, ChildGraphNode, HostCallbackNode, EventRecordNode, EventWaitNode, IfNode, IfElseNode, WhileNode, SwitchNode
  • RAII lifetime management for graph resources (events, kernels) via CUDA User Objects, ensuring resources survive graph cloning and GC
  • Reverse-lookup registries for Event and Kernel handles (HandleRegistry template), enabling owning handle recovery and full metadata round-tripping on reconstruction
  • Event metadata (timing, IPC, busy-wait, device, context) stored in C++ EventBox with accessor functions, eliminating metadata loss on handle reconstruction
  • Kernel from_handle simplified: registry-first lookup for cuda.core-created kernels, _keepalive reference for foreign kernels

Changes

  • _graphdef.pyx / _graphdef.pxd: Full explicit graph API — GraphDef, Node hierarchy, Condition, fluent builder methods, __repr__, __eq__/__hash__/__weakref__ protocols, CUDA User Object attachment for event/kernel lifetime
  • resource_handles.{hpp,cpp}: GraphHandle with owning/non-owning variants and parent-capture for child graphs; GraphNodeHandle (RAII node handle with parent graph reference); HandleRegistry<Key, Handle, Hash> template for thread-safe reverse lookup; Event/Kernel registries with weak_ptr tracking; EventBox metadata fields and get_event_* accessors; get_kernel_library accessor; IPC cache refactored to use HandleRegistry
  • _resource_handles.{pxd,pyx}: Cython declarations for graph node handles, event/kernel registries, and accessor functions
  • _event.{pxd,pyx}: Consolidated factory methods (_init, _from_handle); metadata properties delegate to EventBox via C++ accessors
  • _module.{pxd,pyx}: Kernel.from_handle with registry lookup and _keepalive for foreign kernels; renamed _from_obj to _from_handle

Test Coverage

  • test_explicit.py: 176 unit tests (topology, node types, attributes, execution)
  • test_explicit_integration.py: 6 integration tests (heat diffusion, bisection, switch dispatch) covering all 14 node types
  • test_explicit_lifetime.py: Lifetime tests for child graphs, conditional body graphs, events, kernels, registry handle recovery, and graph node reconstruction
  • test_explicit_errors.py: Error handling and edge-case tests
  • test_module.py: Library mismatch warning and foreign kernel from_handle tests
  • test_object_protocols.py: Protocol tests for new types
  • Conditional node tests require CC >= 9.0 (skipped on older hardware)

Andy-Jost added 17 commits March 3, 2026 15:09
…work

Rename cuda/core/_graph.py to cuda/core/_graph/__init__.py to create a
package that will house the explicit graph construction module alongside
the existing stream-capture-based implementation.

Ref: NVIDIA#1317
Made-with: Cursor
Implement explicit CUDA graph construction API as an alternative to
stream capture:

- GraphDef: wraps CUgraph with instantiate(), debug_dot_print(),
  nodes(), and edges() methods
- Node: fluent interface for building graphs with launch(), alloc(),
  free(), and join() methods
- GraphAllocOptions: dataclass for allocation options (device,
  memory_type, peer_access)
- Add __repr__, __eq__, __hash__ to GraphDef and Node for debugging
  and use in collections
- Add pred/succ properties to Node for graph traversal
- Refactor GraphDebugPrintOptions._to_flags() to share logic between
  GraphBuilder and GraphDef

Made-with: Cursor
…zed tests

Introduce AllocNode, KernelNode, EmptyNode, FreeNode subclasses with
properties populated from the CUDA driver API. AllocNode exposes dptr,
bytesize, device_id, memory_type, peer_access, and options; KernelNode
exposes grid, block, shmem_size, kernel, and config. Node.pred/succ
results are cached with automatic invalidation in builder methods.

Restructure test_explicit.py around GraphSpec (topology) and NodeSpec
(type + expected attributes) so that adding a new node type requires
only a builder function and one _NODE_SPECS entry. Move object protocol
tests to test_object_protocols.py for all node subclasses including
FreeNode and KernelNode.

Made-with: Cursor
Extract fill value parsing (int/bytes/buffer protocol → value + element
size) from Buffer.fill() into cpdef _parse_fill_value in cuda_utils so
it can be reused by both Buffer.fill() and Node.memset().

Add MemsetNode class with properties: dptr, value, element_size, width,
height, pitch. Node.memset() builder supports 1D and 2D memset with
element sizes 1, 2, and 4. Tests cover all element sizes, 2D memset,
instantiate-and-execute, and object protocols.

Made-with: Cursor
Implements event record/wait graph nodes with full test coverage.
Adds non-owning create_event_handle_ref to RAII layer and
Event.from_handle() / Event._from_raw_handle() for reconstructing
Event objects from raw CUevent handles managed by the graph.

Made-with: Cursor
GraphDef now exposes alloc, free, memset, launch, record_event,
wait_event, and join directly. The virtual root node becomes an
internal implementation detail (_entry). Also renames
Event._from_raw_handle to Event._from_handle for consistency.

Made-with: Cursor
…hDef.handle

- Fix stale 'root' references to 'entry' in docstrings, comments, repr
- Add Node.handle property (returns CUgraphNode as int, None for entry)
- GraphDef.handle now uses as_py() for cleaner conversion
- Update reprs to show domain-relevant payload instead of ambiguous handles:
  EmptyNode shows pred count, KernelNode shows grid/block,
  AllocNode/FreeNode/MemsetNode show dptr and params,
  EventRecord/WaitNode show event handle

Made-with: Cursor
Simple 1D memcpy interface: Node.memcpy(dst, src, size) auto-detects
host vs device memory via cuPointerGetAttribute, falling back to device
type for graph-allocated pointers. Includes MemcpyNode subclass with
dst/src/size properties, GraphDef.memcpy forwarding, execution test
verifying data correctness, and object protocol coverage.

Made-with: Cursor
Node.embed(child) clones a GraphDef as a sub-graph node. Adds
create_graph_handle_ref for non-owning graph handles (child graph
is owned by the node, not the wrapper). ChildGraphNode exposes
child_graph property and shows subnode count in repr.

Made-with: Cursor
Implements host callback graph nodes supporting two modes:
- Python callable: GIL acquired via trampoline, nullary callbacks
- ctypes CFUNCTYPE: raw C function pointer with optional user_data
  (bytes copied to graph-managed buffer, or raw int passthrough)

Uses CUDA user objects to tie callback/data lifetime to the graph.

Made-with: Cursor
create_graph_handle_ref now takes a parent GraphHandle, keeping the
parent graph alive while any child/branch graph handle exists. This
prevents use-after-free when a ChildGraphNode outlives its parent
GraphDef.

Made-with: Cursor
Implement conditional node hierarchy with Condition wrapper class,
builder methods (if_cond, if_else, while_loop, switch), and branch
graph access via non-owning GraphDef handles. Pre-CUDA 13.2 driver
reconstruction falls back to ConditionalNode base class.

Made-with: Cursor
Use cuGraphNodeGetParams (Python driver API) to recover the exact
ConditionalNode subclass (IfNode, IfElseNode, WhileNode, SwitchNode)
when reconstructing from the driver. Falls back to the generic
ConditionalNode base on pre-13.2 drivers.

Made-with: Cursor
Add __all__, from __future__ import annotations (replacing
TYPE_CHECKING), return type annotations on all public methods and
properties, and reorder imports per the 5-group convention.

Made-with: Cursor
cuGraphAddNode replaces the phGraph_out pointer with its own internal
array rather than writing into the caller-provided buffer.  Read body
graph handles from params.conditional.phGraph_out[i] after the call
instead of from a pre-allocated vector.

Add three integration tests exercising all 14 explicit-graph node types:
heat diffusion (WhileNode, ChildGraphNode, EventNodes, ...),
bisection root finder (IfElseNode, IfNode), and switch dispatch
(SwitchNode).

Made-with: Cursor
…uction

Made-with: Cursor

# Conflicts:
#	cuda_core/cuda/core/_cpp/resource_handles.cpp
#	cuda_core/cuda/core/_cpp/resource_handles.hpp
#	cuda_core/cuda/core/_resource_handles.pyx
#	cuda_core/cuda/core/_utils/cuda_utils.pyx
@Andy-Jost Andy-Jost added this to the cuda.core v0.7.0 milestone Mar 6, 2026
@Andy-Jost Andy-Jost added P0 High priority - Must do! feature New feature or request cuda.core Everything related to the cuda.core module labels Mar 6, 2026
@Andy-Jost Andy-Jost self-assigned this Mar 6, 2026
@copy-pr-bot
Copy link
Contributor

copy-pr-bot bot commented Mar 6, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

test_explicit_lifetime.py verifies the RAII parent-capture mechanism
in create_graph_handle_ref prevents dangling references when parent
GraphDef objects are deleted while child/body graph handles remain.

test_explicit_errors.py covers input validation (type checks for
conditional methods, invalid memset values, null free, cross-graph
condition misuse), edge cases (join variants, multiple instantiation,
unmatched alloc), and boundary condition execution (while-loop zero
iterations, if-cond false, switch out-of-range).

Made-with: Cursor
…ate Event factories

- Introduce NodeHandle (shared_ptr<CUgraphNode> with NodeBox) to tie
  node lifetime to owning graph, replacing raw CUgraphNode in Node objects
- Attach EventHandle/KernelHandle copies as CUDA user objects to graphs,
  preventing dangling references when Python wrappers are GC'd
- Consolidate Event factories to _init and _from_handle(EventHandle)
- Inline as_cu() calls throughout _graphdef.pyx
- Add lifetime tests validating event, kernel, and child-graph survival

Made-with: Cursor
…arithmetic

Event no longer caches timing_disabled, busy_waited, ipc_enabled,
device_id, or h_context as Python-side fields. All metadata lives on
EventBox (C++ anonymous namespace) and is accessed through overloaded
get_box() + getter functions (get_event_timing_disabled, etc.).
The Event class now holds only _h_event and _ipc_descriptor.

Made-with: Cursor
Introduce HandleRegistry<Key, Handle> class template for mapping raw
CUDA handles back to their owning shared_ptr. create_event_handle_ref
now checks the registry first, recovering full metadata when the event
is already managed. Add tests verifying metadata preservation through
reconstruction and GC.

Made-with: Cursor
Introduce HandleRegistry<Key, Handle, Hash> class template for mapping
raw CUDA handles back to their owning shared_ptr. Event registry enables
create_event_handle_ref to recover full metadata when the driver returns
a CUevent we already manage. Refactor IPC pointer cache to use the same
template with a separate mutex for atomic check-then-import. Add tests
verifying event metadata preservation through reconstruction and GC.

Made-with: Cursor
…ntics

Restructure create_kernel_handle to register directly in a
HandleRegistry<CUkernel, KernelHandle>, and simplify
create_kernel_handle_ref to lookup-or-ref (dropping the LibraryHandle
parameter). Add get_kernel_library accessor for KernelBox metadata.

Kernel.from_handle now recovers the owning handle automatically for
cuda.core-created kernels, cross-checks caller-supplied ObjectCode on
mismatch, and retains foreign ObjectCode via _keepalive. Rename
Kernel._from_obj to _from_handle for consistency with the project.

Made-with: Cursor
…inology

Avoids ambiguity with potential future node types in other domains.
Renames NodeBox, create_node_handle, and node_get_graph accordingly.

Made-with: Cursor
@Andy-Jost Andy-Jost force-pushed the explicit-graph-construction branch from 214d6c7 to 73ba7fe Compare March 6, 2026 23:12
Comment on lines 250 to 251
def fill(self, value: int | BufferProtocol, *, stream: Stream | GraphBuilder):
"""Fill this buffer with a repeating byte pattern.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Buffer.fill was reworked to extract the input processing, so memset nodes could share that.

Comment on lines 365 to +371
struct EventBox {
CUevent resource;
bool timing_disabled;
bool busy_waited;
bool ipc_enabled;
int device_id;
ContextHandle h_context;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These properties are set at event creation time and cannot be queried through the driver API. Moreover, graph-attached events are returned from the driver as plain CUevent handles, and reconstructing the Cython Event object from one of those would lose this information.

The solution is to move the property metadata into C++ and set up a reverse look-up so that the driver-returned CUevent can be used to retrieve the managing shared_ptr, which holds this EventBox.

Graph-attached kernels are handled similarly.

Comment on lines +127 to +150
cdef void _attach_user_object(
cydriver.CUgraph graph, void* ptr,
cydriver.CUhostFn destroy) except *:
"""Create a CUDA user object and transfer ownership to the graph.

On success the graph owns the resource (via MOVE semantics).
On failure the destroy callback is invoked to clean up ptr,
then a CUDAError is raised — callers need no try/except.
"""
cdef cydriver.CUuserObject user_obj = NULL
cdef cydriver.CUresult ret
with nogil:
ret = cydriver.cuUserObjectCreate(
&user_obj, ptr, destroy, 1,
cydriver.CU_USER_OBJECT_NO_DESTRUCTOR_SYNC)
if ret == cydriver.CUDA_SUCCESS:
ret = cydriver.cuGraphRetainUserObject(
graph, user_obj, 1, cydriver.CU_GRAPH_USER_OBJECT_MOVE)
if ret != cydriver.CUDA_SUCCESS:
cydriver.cuUserObjectRelease(user_obj, 1)
if ret != cydriver.CUDA_SUCCESS:
if user_obj == NULL:
destroy(ptr)
HANDLE_RETURN(ret)
Copy link
Contributor Author

@Andy-Jost Andy-Jost Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resources like host functions, event handles, and kernel handles are placed into CUDA user objects, which are like capsules. This way, the lifetimes are properly tied to graphs they appear in, and they follow graphs through cloning steps.

Comment on lines +238 to +271
@dataclass
class GraphAllocOptions:
"""Options for graph memory allocation nodes.

Attributes
----------
device : int or Device, optional
The device on which to allocate memory. If None (default),
uses the current CUDA context's device.
memory_type : str, optional
Type of memory to allocate. One of:

- ``"device"`` (default): Pinned device memory, optimal for GPU kernels.
- ``"host"``: Pinned host memory, accessible from both host and device.
Useful for graphs containing host callback nodes. Note: may not be
supported on all systems/drivers.
- ``"managed"``: Managed/unified memory that automatically migrates
between host and device. Useful for mixed host/device access patterns.

peer_access : list of int or Device, optional
List of devices that should have read-write access to the
allocated memory. If None (default), only the allocating
device has access.

Notes
-----
- IPC (inter-process communication) is not supported for graph
memory allocation nodes per CUDA documentation.
- The allocation uses the device's default memory pool.
"""

device: int | Device | None = None
memory_type: str = "device"
peer_access: list | None = None
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made this options class to match the way memory resources are created. But other graph operations just take arguments, and there are only three arguments wrapped here. I'm not sure having this this is better than just adding these arguments to the alloc() function.

Comment on lines +314 to +319
def alloc(self, size_t size, options: GraphAllocOptions | None = None) -> AllocNode:
"""Add an entry-point memory allocation node (no dependencies).

See :meth:`Node.alloc` for full documentation.
"""
return self._entry.alloc(size, options)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather use *args, **kwargs in these forwarding functions, but I have a feeling someone would complain about IDE integration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cuda.core Everything related to the cuda.core module feature New feature or request P0 High priority - Must do!

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant