Run the ExecuTorch TensorRT delegate on a caller-selected CUDA stream (green-context support) by shoumikhin · Pull Request #4314 · pytorch/TensorRT

shoumikhin · 2026-05-30T14:18:05Z

What problem does this solve?

The ExecuTorch TensorRT delegate used to create its own private CUDA stream and run every inference on it. That left an application with no way to make the TensorRT engine run on a specific CUDA stream or context of its choosing.

This matters most for CUDA green contexts — a CUDA feature that hands a piece of work a slice of the GPU's compute units (SMs) instead of the whole GPU, so you can run several models side by side with predictable performance. To keep an engine inside a green context, its work has to run on a stream that belongs to that green context. With a delegate-owned stream, that was impossible.

What this changes

You can now tell the delegate which CUDA stream to run on, with a small RAII helper, CudaStreamGuard. Scope it around your inference call and the engine runs on your stream. If you don't use it, nothing changes — the delegate runs on the per-thread default stream, exactly as before.

This gives the libtorch-free ExecuTorch runtime the same "run on the caller's stream" capability the libtorch TensorRT runtime got in #4232.

Usage example

#include <torch_tensorrt/executorch/TensorRTBackend.h>
namespace et = torch_tensorrt::executorch_backend;

// 1. Create a CUDA stream on the context you want the engine to run on.
//    For SM partitioning, create it on a CUDA green context:
cudaStream_t stream;
cuGreenCtxStreamCreate(&stream, green_ctx, CU_STREAM_NON_BLOCKING, /*priority=*/0);

// 2. Run inference with that stream selected:
{
  et::CudaStreamGuard guard(stream);  // selects `stream` for this thread
  method->execute();                  // the TensorRT engine runs on `stream`
}                                     // selection is restored automatically here

// 3. For outputs that stay on the GPU, sync your stream before reading them:
cudaStreamSynchronize(stream);

The engine's kernels (and any host<->device copies it needs) run on stream, so a green-context stream keeps them inside that context's SM partition.

How it works (in plain terms)

The delegate no longer owns a stream. While a CudaStreamGuard is active on the calling thread, the engine's GPU work runs on the stream you provided.
A CUDA stream remembers which context it was created on, so just running on a green-context stream keeps the work inside that partition — you don't have to make the green context "current" yourself.
No guard, no change. Without a guard, the delegate runs on cudaStreamPerThread and waits for the work to finish before returning, exactly like before, so existing code is unaffected.
When you do provide a stream and all outputs stay on the GPU, the delegate leaves the work queued on your stream (no forced wait) so it composes efficiently with your other GPU work; you synchronize when you read the results. If any input or output had to be copied to/from CPU, it waits so your data is ready on return.

Notes

A given method runs single-threaded (one inference at a time), and the provided stream must be on the engine's GPU.
Verified on an H100: with a green-context stream selected, the engine's kernels run only on that context's SM partition.

…tream The delegate created and owned a private CUDA stream in init() and ran every enqueueV3() on it, so an application could not place inference on a specific CUDA stream or context (for example a CUDA green context for SM partitioning). Let the caller select the stream instead, bringing the libtorch-free ExecuTorch runtime the same caller-stream capability the libtorch TensorRT runtime has (pytorch#4232): - Add a scoped CudaStreamGuard (mirroring c10::cuda::CUDAStreamGuard) to select, per calling thread, the CUDA stream the delegate runs TensorRT on. With no guard active the delegate runs on cudaStreamPerThread. - execute() runs enqueueV3() and the staging copies on the selected stream; init() no longer creates a stream and the delegate owns none. - To confine inference to a CUDA green context's SM partition the caller scopes a guard with a stream created on that green context (cuGreenCtxStreamCreate); the partition confinement travels with the stream, so the green context need not be made current. cudaStreamPerThread is invalid while a green context is current (cudaErrorInvalidResourceHandle), so a green-context caller must scope a guard. - cudaSetDevice() is applied only when the engine's device differs from the current device and is restored on exit, so it no longer clobbers a context the caller established. - execute() leaves device-resident outputs enqueued (no end sync) only while a guard is active; the default path and host-staged outputs still synchronize before returning, preserving existing behavior. The caller synchronizes the selected stream when it reads device-resident results. No dependency on the libtorch Torch-TensorRT runtime or libtorch is added.

meta-cla Bot added the cla signed label May 30, 2026

github-actions Bot added the component: api [C++] Issues re: C++ API label May 30, 2026

github-actions Bot requested a review from narendasan May 30, 2026 14:18

shoumikhin force-pushed the fix/et-trt-caller-cuda-stream branch 3 times, most recently from ef539e2 to ec63f3f Compare May 31, 2026 05:58

shoumikhin force-pushed the fix/et-trt-caller-cuda-stream branch from ec63f3f to 2fe2c7a Compare May 31, 2026 06:21

narendasan requested a review from lanluo-nvidia June 1, 2026 21:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Run the ExecuTorch TensorRT delegate on a caller-selected CUDA stream (green-context support)#4314

Run the ExecuTorch TensorRT delegate on a caller-selected CUDA stream (green-context support)#4314
shoumikhin wants to merge 1 commit into
pytorch:mainfrom
shoumikhin:fix/et-trt-caller-cuda-stream

shoumikhin commented May 30, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

shoumikhin commented May 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What problem does this solve?

What this changes

Usage example

How it works (in plain terms)

Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

shoumikhin commented May 30, 2026 •

edited

Loading