Skip to content

feat: add TRT-RTX native CUDA graph support#4187

Draft
tp5uiuc wants to merge 5 commits intopytorch:mainfrom
tp5uiuc:feat/trtrtx-cudagraphs
Draft

feat: add TRT-RTX native CUDA graph support#4187
tp5uiuc wants to merge 5 commits intopytorch:mainfrom
tp5uiuc:feat/trtrtx-cudagraphs

Conversation

@tp5uiuc
Copy link
Copy Markdown
Contributor

@tp5uiuc tp5uiuc commented Apr 14, 2026

Description

Add cuda_graph_strategy compilation setting and automatic RTX-native CUDA graph integration for the Python runtime path (PythonTorchTensorRTModule).

TensorRT-RTX has native CUDA graph support via IRuntimeConfig.cuda_graph_strategy, where the JIT compiler handles capture/replay/invalidation internally. This is superior to manual torch.cuda.CUDAGraph() capture on RTX because:

  • Manual capture freezes fallback kernels; lazy-compiled specialized kernels can never replace them
  • Runtime allocation or data-dependent shapes can cause cudaStreamBeginCapture to fail
  • The JIT compiler automatically manages graph staleness (shape changes, pointer changes, kernel readiness)

Key changes

  • New cuda_graph_strategy setting on CompilationSettings ("disabled" / "whole_graph_capture")
  • Mapped to trt.CudaGraphStrategy on IRuntimeConfig (same pattern as dynamic_shapes_kernel_specialization_strategy)
  • SUBGRAPH mode (set_cudagraphs_mode(True)): On RTX, always use RTX-native CUDA graphs — manual capture is bypassed. If cuda_graph_strategy was not explicitly set, the runtime overrides to whole_graph_capture and warns.
  • WHOLE_GRAPH mode (enable_cudagraphs() with mixed TRT + PyTorch ops): Validates all TRT engines are monolithically capturable via context.is_stream_capturable(stream) and strategy != "lazy". If capturable, proceeds with outer monolithic capture (RTX-native disabled per-engine). If not capturable, raises RuntimeError.
  • _is_monolithic_capturable() — runtime check combining stream capturability and kernel specialization strategy
  • _enable_rtx_native_cudagraphs() — recreates execution context with WHOLE_GRAPH_CAPTURE
  • _check_monolithic_capturability() in CudaGraphsTorchTensorRTModule for mixed graph validation

Behavior matrix

Graph type cudagraph mode RTX? Behavior
TRT-only SUBGRAPH Yes RTX-native always (override if needed)
TRT-only SUBGRAPH No Manual capture (existing)
Mixed WHOLE_GRAPH Yes + capturable Monolithic capture; RTX-native disabled per-engine
Mixed WHOLE_GRAPH Yes + NOT capturable RuntimeError
Mixed WHOLE_GRAPH No Monolithic capture (existing)
Any No cudagraphs + strategy set Yes RTX-native runs transparently

Depends on #4180 (runtime cache) and #4184 (dynamic shapes strategy).

Type of change

  • New feature (non-breaking change which adds functionality)

Checklist:

  • My code follows the style guidelines of this project (You can use the linters)
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas and hacks
  • I have made corresponding changes to the documentation
  • I have added tests to verify my fix or my feature
  • New and existing unit tests pass locally with my changes
  • I have added the relevant labels to my PR in so that relevant reviewers are notified

tp5uiuc and others added 5 commits April 10, 2026 13:16
Add runtime cache support for TensorRT-RTX JIT compilation results,
replacing the timing cache which is not used by RTX (no autotuning).

Changes:
- Skip timing cache creation/saving for TensorRT-RTX in _TRTInterpreter
- Add RUNTIME_CACHE_PATH default and runtime_cache_path setting
- Wire up IRuntimeCache in PythonTorchTensorRTModule (setup, load, save)
- Persist runtime cache to disk with filelock for concurrent access safety
- Thread runtime_cache_path through all compile functions
- Add unit tests (12 tests) and E2E model tests (6 tests)
- Update docstrings and RST documentation

Fixes pytorch#3817

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Version provided by upstream torch; no pin needed.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Expose IRuntimeConfig.setDynamicShapesKernelSpecializationStrategy()
through the Torch-TensorRT Python API. Users can now control how
shape-specialized kernels are compiled at runtime for dynamic shapes
on TensorRT-RTX via the new `dynamic_shapes_kernel_specialization_strategy`
compilation setting ("lazy", "eager", or "none").

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Address review feedback: compile with torchtrt.Input min/opt/max
ranges so dynamic shapes are actually exercised.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add cuda_graph_strategy compilation setting and automatic RTX-native
CUDA graph integration for the Python runtime path.

Key changes:
- New cuda_graph_strategy setting ("disabled" / "whole_graph_capture")
  on CompilationSettings, mapped to trt.CudaGraphStrategy on
  IRuntimeConfig (same pattern as dynamic_shapes_kernel_specialization)
- In SUBGRAPH cudagraph mode on RTX, always use RTX-native CUDA graphs
  (manual torch.cuda.CUDAGraph capture is not safe due to lazy kernel
  specialization and potential runtime allocation)
- _is_monolithic_capturable() check using context.is_stream_capturable()
  and strategy != "lazy" for WHOLE_GRAPH mode safety validation
- _enable_rtx_native_cudagraphs() for runtime context recreation
- _check_monolithic_capturability() in CudaGraphsTorchTensorRTModule
  for mixed TRT + PyTorch graph validation
- Comprehensive unit tests covering all code paths

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@meta-cla meta-cla bot added the cla signed label Apr 14, 2026
@github-actions github-actions bot added documentation Improvements or additions to documentation component: tests Issues re: Tests component: conversion Issues re: Conversion stage component: core Issues re: The core compiler component: build system Issues re: Build system component: api [Python] Issues re: Python API component: runtime component: dynamo Issues relating to the `torch.compile` or `torch._dynamo.export` paths labels Apr 14, 2026
@github-actions github-actions bot requested a review from zewenli98 April 14, 2026 11:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend: TensorRT-RTX cla signed component: api [Python] Issues re: Python API component: build system Issues re: Build system component: conversion Issues re: Conversion stage component: core Issues re: The core compiler component: dynamo Issues relating to the `torch.compile` or `torch._dynamo.export` paths component: runtime component: tests Issues re: Tests documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants