Run the ExecuTorch TensorRT delegate on a caller-selected CUDA stream (green-context support)#4314
Draft
shoumikhin wants to merge 1 commit into
Draft
Run the ExecuTorch TensorRT delegate on a caller-selected CUDA stream (green-context support)#4314shoumikhin wants to merge 1 commit into
shoumikhin wants to merge 1 commit into
Conversation
ef539e2 to
ec63f3f
Compare
…tream The delegate created and owned a private CUDA stream in init() and ran every enqueueV3() on it, so an application could not place inference on a specific CUDA stream or context (for example a CUDA green context for SM partitioning). Let the caller select the stream instead, bringing the libtorch-free ExecuTorch runtime the same caller-stream capability the libtorch TensorRT runtime has (pytorch#4232): - Add a scoped CudaStreamGuard (mirroring c10::cuda::CUDAStreamGuard) to select, per calling thread, the CUDA stream the delegate runs TensorRT on. With no guard active the delegate runs on cudaStreamPerThread. - execute() runs enqueueV3() and the staging copies on the selected stream; init() no longer creates a stream and the delegate owns none. - To confine inference to a CUDA green context's SM partition the caller scopes a guard with a stream created on that green context (cuGreenCtxStreamCreate); the partition confinement travels with the stream, so the green context need not be made current. cudaStreamPerThread is invalid while a green context is current (cudaErrorInvalidResourceHandle), so a green-context caller must scope a guard. - cudaSetDevice() is applied only when the engine's device differs from the current device and is restored on exit, so it no longer clobbers a context the caller established. - execute() leaves device-resident outputs enqueued (no end sync) only while a guard is active; the default path and host-staged outputs still synchronize before returning, preserving existing behavior. The caller synchronizes the selected stream when it reads device-resident results. No dependency on the libtorch Torch-TensorRT runtime or libtorch is added.
ec63f3f to
2fe2c7a
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What problem does this solve?
The ExecuTorch TensorRT delegate used to create its own private CUDA stream and run every inference on it. That left an application with no way to make the TensorRT engine run on a specific CUDA stream or context of its choosing.
This matters most for CUDA green contexts — a CUDA feature that hands a piece of work a slice of the GPU's compute units (SMs) instead of the whole GPU, so you can run several models side by side with predictable performance. To keep an engine inside a green context, its work has to run on a stream that belongs to that green context. With a delegate-owned stream, that was impossible.
What this changes
You can now tell the delegate which CUDA stream to run on, with a small RAII helper,
CudaStreamGuard. Scope it around your inference call and the engine runs on your stream. If you don't use it, nothing changes — the delegate runs on the per-thread default stream, exactly as before.This gives the libtorch-free ExecuTorch runtime the same "run on the caller's stream" capability the libtorch TensorRT runtime got in #4232.
Usage example
The engine's kernels (and any host<->device copies it needs) run on
stream, so a green-context stream keeps them inside that context's SM partition.How it works (in plain terms)
CudaStreamGuardis active on the calling thread, the engine's GPU work runs on the stream you provided.cudaStreamPerThreadand waits for the work to finish before returning, exactly like before, so existing code is unaffected.Notes