Add option to use CUDA graphs with TRT for RF-DETR Object Detection in `inference_models` by mkaic · Pull Request #1938 · roboflow/inference

mkaic · 2026-01-23T06:15:28Z

What does this PR do?

Adds a use_cuda_graph flag to RFDetrForObjectDetectionTRT.forward which enables capturing the CUDA graph and replaying it, unlocking a nice ~10% FPS speedup (observed range was anything from 7% to 12% depending on how saturated the GPU is) on rfdetr-nano.

After the CUDA graph is captured in execute_trt_engine, it and other state related to it is packaged into a TRTCudaGraphState dataclass which is returned back up to the RFDetrForObjectDetectionTRT instance, where it is cached. This lets subsequent forward passes simply replay the graph instead of recapturing it.

Because this PR changes the return signature of infer_with_trt_engine, there are a lot of files with 1-or-2 lines of changes where I'm just trying to prevent unpacking errors.

Type of Change

New feature (non-breaking change that adds functionality)

Testing

I have tested this change locally
I have added/updated tests for this change

Test details:
I added an integration test at inference_models/tests/integration_tests/models/test_rfdetr_predictions_trt.py

I also added a profiling script at inference_models/development/profiling/profile_rfdetr_trt_cudagraphs.py benchmarks the speed of RFDetrForObjectDetectionTRT both with and without CUDA graph. Here are some example results (some variance was observed here which depended on GPU cooldown):

==================================================
RF-DETR Nano Object Detection
10,000 iterations at 384x384 (testing on random noise) on an L4 GPU
Forward pass FPS (no CUDA graphs): 574.9
Forward pass FPS (CUDA graphs):    640.0
Speedup: 1.11x
==================================================

I intended to also write a similar script for the rfdetr-seg-* models, but as far as I can tell, the seg models don't have inference_models packages yet, other than rfdetr-seg-preview, and that only has torch and onnx packages. I did have Claude make me a script that downloads the onnx package, uses one of the development compilation convenience functions to compile it to TRT, and then runs my benchmark... but that script is an utter mess (to the point that it's not even included in the PR), and I'm getting the impression that support for TRT as a backend for RF-DETR-Seg might be a work in progress in inference_models? The results from the script nonetheless seem to confirm a speedup for seg models too:

==================================================
RF-DETR Seg Preview (compiled from ONNX to TRT fp16)
10,000 iterations at 384x384 (testing on random noise) on an L4 GPU
Forward pass FPS (no CUDA graphs): 278.6
Forward pass FPS (CUDA graphs):    299.0
Speedup: 1.07x
==================================================

EDIT 2026-02-11: Added a VRAM-profiling script inference_models/development/profiling/profile_cudagraph_vram.py to test what CPU and GPU memory usage look like when caching multiple shapes. Below are three plots of VRAM and system RAM usage in different scenarios.

Looping through batchsizes 1..16 in ascending order with cache capacity set to 16.

Looping through batchsizes 1..16 in random order with cache capacity set to 16.

Looping through batchsizes 1..16 in random order with cache capacity set to 8, causing rolling cache evictions.

I also added a test to check that cache eviction works the way I would expect it to, in test_yolov8_object_detection_predictions_trt.py.

Checklist

My code follows the style guidelines of this project
I have performed a self-review of my own code
I have commented my code where necessary, particularly in hard-to-understand areas
My changes generate no new warnings or errors
I have updated the documentation accordingly (if applicable)

Additional Context

N/A

…tTRT class

…seem to be TRT packages for them yet.

…ts to True and reference in RFDETR TRT classes

mkaic · 2026-02-10T00:49:33Z

I have added LRU caching with (shape, dtype, device) keys for CUDA graph execution, and added a test for this caching in test_yolov8_object_detection_predictions_trt.py. While I was at it, I also added a test to make sure outputs are equivalent with use_cuda_graph=True and use_cuda_graph=False for rfdetr-seg-nano since previously I only included a test for the object detection RF-DETR.

I have run GPU trt_extra integration tests through GitHub actions on this branch and can confirm that they pass.

…ow/inference into feature/rfdetr-trt-use-cudagraphs

mkaic added 9 commits January 23, 2026 01:14

pass TRT graph state up and dwon call stack and acache in RFDetrObjDe…

6f6c44e

…tTRT class

actually passing it up and down the stack

549ca10

three-branch solution

6412efe

avoid breaking things due to chagne in infer_with_trt_engine API

08888e3

update unpacking in the rest of the TRT.py files

adda4aa

clean up profiling script

97fdcf0

remove tqdm from profiling script

470addb

format

8cca264

allow flag to be passed to rfdetr-seg models even though there don't …

5b7d0a5

…seem to be TRT packages for them yet.

mkaic requested review from PawelPeczek-Roboflow, grzegorz-roboflow, hansent, probicheaux and yeldarby as code owners January 23, 2026 06:15

mkaic and others added 15 commits January 23, 2026 16:41

reduce number of diffed files

a27ae37

Merge branch 'main' into feature/rfdetr-trt-use-cudagraphs

f1a6afb

don't rename existing function

04c015a

add proper integration test and simplify profiling script

ac50a1a

Merge branch 'main' into feature/rfdetr-trt-use-cudagraphs

c1c1329

profile how long it takes to capture cuda graph

9512229

Merge branch 'main' into feature/rfdetr-trt-use-cudagraphs

d81b7d5

Merge branch 'main' into feature/rfdetr-trt-use-cudagraphs

1ebe492

add LRU (shape, device, dtype) caching for CUDA graphs

d5b51f9

add USE_CUDA_GRAPHS_FOR_TRT_BACKEND environment variable which defaul…

dbd45f9

…ts to True and reference in RFDETR TRT classes

fix bug in profiling script

9502b8e

Merge branch 'main' into feature/rfdetr-trt-use-cudagraphs

320fdef

use yolov8 with dynamic batch size to test shape caching for CUDA graphs

cb70538

add instance seg tests

6b1d430

update conftest

7c23300

mkaic and others added 15 commits February 10, 2026 01:07

add batch-size-cycling profiling for TRT cudagraphs with yolov8

a27c80c

Merge branch 'feature/rfdetr-trt-use-cudagraphs' of github.com:robofl…

14a45ea

…ow/inference into feature/rfdetr-trt-use-cudagraphs

fix failing test

212b2d6

first stab at responding to Pawel's feedback

4204f4f

working on memory profiling for cudagraphs

51f191c

simplify memory profiling script

a80a572

tweaks

845fabd

update tests to work with the new cache

3294cae

Merge branch 'main' into feature/rfdetr-trt-use-cudagraphs

31bb420

thanks for the PR review, Claude

bbb2540

see effect of cache size on vram profile script

4eb23fc

reduce default cache size to 16 after seeing memory usage

aa87393

make style

b5c1f6b

update default and fix profiling script

a386f3b

fix imports in trt tests

5f4d3ea

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Add option to use CUDA graphs with TRT for RF-DETR Object Detection in `inference_models`#1938

Add option to use CUDA graphs with TRT for RF-DETR Object Detection in `inference_models`#1938
mkaic wants to merge 39 commits intomainfrom
feature/rfdetr-trt-use-cudagraphs

mkaic commented Jan 23, 2026 •

edited

Loading

Uh oh!

mkaic commented Feb 10, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Conversation

mkaic commented Jan 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Type of Change

Testing

Checklist

Additional Context

Uh oh!

mkaic commented Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mkaic commented Jan 23, 2026 •

edited

Loading

mkaic commented Feb 10, 2026 •

edited

Loading