Skip to content

Run the ExecuTorch TensorRT delegate on a caller-selected CUDA stream (green-context support)#4314

Draft
shoumikhin wants to merge 1 commit into
pytorch:mainfrom
shoumikhin:fix/et-trt-caller-cuda-stream
Draft

Run the ExecuTorch TensorRT delegate on a caller-selected CUDA stream (green-context support)#4314
shoumikhin wants to merge 1 commit into
pytorch:mainfrom
shoumikhin:fix/et-trt-caller-cuda-stream

Conversation

@shoumikhin
Copy link
Copy Markdown
Contributor

@shoumikhin shoumikhin commented May 30, 2026

What problem does this solve?

The ExecuTorch TensorRT delegate used to create its own private CUDA stream and run every inference on it. That left an application with no way to make the TensorRT engine run on a specific CUDA stream or context of its choosing.

This matters most for CUDA green contexts — a CUDA feature that hands a piece of work a slice of the GPU's compute units (SMs) instead of the whole GPU, so you can run several models side by side with predictable performance. To keep an engine inside a green context, its work has to run on a stream that belongs to that green context. With a delegate-owned stream, that was impossible.

What this changes

You can now tell the delegate which CUDA stream to run on, with a small RAII helper, CudaStreamGuard. Scope it around your inference call and the engine runs on your stream. If you don't use it, nothing changes — the delegate runs on the per-thread default stream, exactly as before.

This gives the libtorch-free ExecuTorch runtime the same "run on the caller's stream" capability the libtorch TensorRT runtime got in #4232.

Usage example

#include <torch_tensorrt/executorch/TensorRTBackend.h>
namespace et = torch_tensorrt::executorch_backend;

// 1. Create a CUDA stream on the context you want the engine to run on.
//    For SM partitioning, create it on a CUDA green context:
cudaStream_t stream;
cuGreenCtxStreamCreate(&stream, green_ctx, CU_STREAM_NON_BLOCKING, /*priority=*/0);

// 2. Run inference with that stream selected:
{
  et::CudaStreamGuard guard(stream);  // selects `stream` for this thread
  method->execute();                  // the TensorRT engine runs on `stream`
}                                     // selection is restored automatically here

// 3. For outputs that stay on the GPU, sync your stream before reading them:
cudaStreamSynchronize(stream);

The engine's kernels (and any host<->device copies it needs) run on stream, so a green-context stream keeps them inside that context's SM partition.

How it works (in plain terms)

  • The delegate no longer owns a stream. While a CudaStreamGuard is active on the calling thread, the engine's GPU work runs on the stream you provided.
  • A CUDA stream remembers which context it was created on, so just running on a green-context stream keeps the work inside that partition — you don't have to make the green context "current" yourself.
  • No guard, no change. Without a guard, the delegate runs on cudaStreamPerThread and waits for the work to finish before returning, exactly like before, so existing code is unaffected.
  • When you do provide a stream and all outputs stay on the GPU, the delegate leaves the work queued on your stream (no forced wait) so it composes efficiently with your other GPU work; you synchronize when you read the results. If any input or output had to be copied to/from CPU, it waits so your data is ready on return.

Notes

  • A given method runs single-threaded (one inference at a time), and the provided stream must be on the engine's GPU.
  • Verified on an H100: with a green-context stream selected, the engine's kernels run only on that context's SM partition.

@meta-cla meta-cla Bot added the cla signed label May 30, 2026
@github-actions github-actions Bot added the component: api [C++] Issues re: C++ API label May 30, 2026
@github-actions github-actions Bot requested a review from narendasan May 30, 2026 14:18
@shoumikhin shoumikhin force-pushed the fix/et-trt-caller-cuda-stream branch 3 times, most recently from ef539e2 to ec63f3f Compare May 31, 2026 05:58
…tream

The delegate created and owned a private CUDA stream in init() and ran every
enqueueV3() on it, so an application could not place inference on a specific
CUDA stream or context (for example a CUDA green context for SM partitioning).

Let the caller select the stream instead, bringing the libtorch-free ExecuTorch
runtime the same caller-stream capability the libtorch TensorRT runtime has
(pytorch#4232):

- Add a scoped CudaStreamGuard (mirroring c10::cuda::CUDAStreamGuard) to select,
  per calling thread, the CUDA stream the delegate runs TensorRT on. With no
  guard active the delegate runs on cudaStreamPerThread.
- execute() runs enqueueV3() and the staging copies on the selected stream;
  init() no longer creates a stream and the delegate owns none.
- To confine inference to a CUDA green context's SM partition the caller scopes a
  guard with a stream created on that green context (cuGreenCtxStreamCreate); the
  partition confinement travels with the stream, so the green context need not be
  made current. cudaStreamPerThread is invalid while a green context is current
  (cudaErrorInvalidResourceHandle), so a green-context caller must scope a guard.
- cudaSetDevice() is applied only when the engine's device differs from the
  current device and is restored on exit, so it no longer clobbers a context the
  caller established.
- execute() leaves device-resident outputs enqueued (no end sync) only while a
  guard is active; the default path and host-staged outputs still synchronize
  before returning, preserving existing behavior. The caller synchronizes the
  selected stream when it reads device-resident results.

No dependency on the libtorch Torch-TensorRT runtime or libtorch is added.
@shoumikhin shoumikhin force-pushed the fix/et-trt-caller-cuda-stream branch from ec63f3f to 2fe2c7a Compare May 31, 2026 06:21
@narendasan narendasan requested a review from lanluo-nvidia June 1, 2026 21:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant