[None][perf] Improve Qwen3-VL Preprocessing Perf#15598
Conversation
uvloop is a libuv-backed asyncio event loop with significantly lower per-await overhead than the cpython default. Installing it at import time, before any module retrieves a running loop, affects every coroutine in the server. Falls back to the default loop when uvloop is not installed. Signed-off-by: Aswin Visva <31215515+aswinvisva@users.noreply.github.com>
📝 WalkthroughWalkthroughAdds dedicated ChangesMedia I/O & Async Threading
VLM Processor Fixes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Suggested reviewers
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 6
🧹 Nitpick comments (1)
tensorrt_llm/inputs/media_io.py (1)
50-55: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winAnnotate
_media_load_runto preserve caller result types.Every async media load now goes through this helper, but the untyped signature erases
_MediaTtoAny. UseParamSpec/TypeVarso type checking followsload_bytes/load_base64/load_file.Suggested typing shape
+from collections.abc import Callable +from typing import ParamSpec + +_P = ParamSpec("_P") +_R = TypeVar("_R") + -async def _media_load_run(fn, *args, **kwargs): +async def _media_load_run( + fn: Callable[_P, _R], *args: _P.args, **kwargs: _P.kwargs +) -> _R:As per coding guidelines, “Always annotate functions. Make the return type
Noneif the function does not return anything.”🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tensorrt_llm/inputs/media_io.py` around lines 50 - 55, The helper `_media_load_run` is currently untyped, which causes the caller-specific media result type to degrade to Any across `load_bytes`, `load_base64`, and `load_file`. Update `_media_load_run` to use `ParamSpec` for the callable arguments and a `TypeVar` for the return value so the async wrappers preserve their concrete `_MediaT` result types. Keep the typing consistent with the existing media-loading helpers in `media_io.py` and ensure the annotation reflects that the function returns the awaited result from `run_in_executor`.Source: Coding guidelines
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tensorrt_llm/_torch/models/modeling_qwen2vl.py`:
- Around line 100-103: The cache-key fallback in the key-building logic is
catching too broadly in the function that computes
`repr(sorted(kwargs.items()))`, which can mask unrelated bugs and violates the
repo’s exception-handling rule. Narrow the `except Exception` in that cache-key
construction path to only the specific failure(s) expected from sorting or repr
generation, and keep the fallback to `orig(*args, **kwargs)` only for those
concrete cases so other processor regressions still fail visibly.
In `@tensorrt_llm/_torch/models/modeling_qwen3vl.py`:
- Around line 262-265: The Qwen3VL multimodal preprocessing in the `proc_kwargs`
merge still allows `mm_processor_kwargs` to override `do_sample_frames`, which
can re-enable HF resampling. Update the `modeling_qwen3vl` merge logic so
`do_sample_frames` remains fixed to False and is never copied from
`mm_processor_kwargs`, while still preserving the other allowed keys like
`num_frames` and `fps`.
In `@tensorrt_llm/inputs/media_io.py`:
- Around line 450-459: The frame return contract for the default format path is
inconsistent: `VideoData.frames` and `get_num_tokens_per_video` still assume PIL
images or torch tensors, but `media_io.py` now returns HWC uint8 ndarrays from
the `format="pt"` flow. Update the shared `VideoData` type/doc and any consumers
like `get_num_tokens_per_video` to recognize numpy frames (using shape-based
dimensions instead of `.height`/`.width`), or change the
`load_video`/frame-return path so `format="pt"` continues to yield tensors and
add a separate explicit numpy format.
- Around line 389-397: The fallback video-loading path that uses
tempfile.NamedTemporaryFile(delete=False) leaves behind temporary .mp4 files;
update the media I/O flow so the temp path is unlinked after all processing is
finished, not immediately after opening the VideoCapture. Make the cleanup
happen after any optional audio extraction that still depends on the video
filename, and apply the same fix in both fallback sites in media_io.py so the
byte-loaded video tempfiles are always removed.
- Around line 383-384: Narrow the fallback in the backend-probing block of
media_io.py: the broad Exception handler around the OpenCV API/registry probe is
masking unrelated bugs. Update the exception handling in that probe to only
catch the expected availability failures, specifically AttributeError and
cv2.error when backend queries fail, while keeping the tempfile fallback logic
unchanged.
In `@tensorrt_llm/serve/openai_server.py`:
- Around line 115-118: Validate TRTLLM_INPUT_PROC_WORKERS before constructing
_INPUT_PROC_EXECUTOR in openai_server.py so startup fails with a clear
operator-friendly message instead of crashing on zero, negative, or non-integer
values. Update the _INPUT_PROC_WORKERS initialization to reject invalid values
and ensure ThreadPoolExecutor only receives a positive worker count, keeping the
fix near the existing module-level executor setup.
---
Nitpick comments:
In `@tensorrt_llm/inputs/media_io.py`:
- Around line 50-55: The helper `_media_load_run` is currently untyped, which
causes the caller-specific media result type to degrade to Any across
`load_bytes`, `load_base64`, and `load_file`. Update `_media_load_run` to use
`ParamSpec` for the callable arguments and a `TypeVar` for the return value so
the async wrappers preserve their concrete `_MediaT` result types. Keep the
typing consistent with the existing media-loading helpers in `media_io.py` and
ensure the annotation reflects that the function returns the awaited result from
`run_in_executor`.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 446e6a41-d8ed-4df8-9012-a296acd412b6
📒 Files selected for processing (5)
tensorrt_llm/_torch/models/modeling_qwen2vl.pytensorrt_llm/_torch/models/modeling_qwen3vl.pytensorrt_llm/commands/serve.pytensorrt_llm/inputs/media_io.pytensorrt_llm/serve/openai_server.py
| # No stream-buffered backend — fall back to a tempfile. | ||
| _tmp = tempfile.NamedTemporaryFile(delete=False, suffix=".mp4") | ||
| try: | ||
| _tmp.write(bytes(video)) | ||
| _tmp.flush() | ||
| finally: | ||
| _tmp.close() | ||
| video = _tmp.name | ||
| vidcap = cv2.VideoCapture(video) |
There was a problem hiding this comment.
🩺 Stability & Availability | 🟠 Major | ⚡ Quick win
Clean up fallback tempfiles after audio extraction.
The fallback uses NamedTemporaryFile(delete=False) but never unlinks it. On OpenCV builds without a stream-buffered backend, each byte-loaded video leaves an .mp4 in /tmp; cleanup must run after optional audio extraction because that path may still need video as the temp path.
Suggested cleanup shape
+ _tmp_path = None
_video_buf = None
...
finally:
_tmp.close()
- video = _tmp.name
+ _tmp_path = _tmp.name
+ video = _tmp_path
vidcap = cv2.VideoCapture(video)
...
- audio = None
- if extract_audio:
+ try:
+ audio = None
+ if extract_audio:
...
- return VideoData(frames=loaded_frames, metadata=metadata, audio=audio)
+ return VideoData(frames=loaded_frames, metadata=metadata, audio=audio)
+ finally:
+ if _tmp_path is not None:
+ try:
+ os.unlink(_tmp_path)
+ except FileNotFoundError:
+ passAlso applies to: 476-484
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tensorrt_llm/inputs/media_io.py` around lines 389 - 397, The fallback
video-loading path that uses tempfile.NamedTemporaryFile(delete=False) leaves
behind temporary .mp4 files; update the media I/O flow so the temp path is
unlinked after all processing is finished, not immediately after opening the
VideoCapture. Make the cleanup happen after any optional audio extraction that
still depends on the video filename, and apply the same fix in both fallback
sites in media_io.py so the byte-loaded video tempfiles are always removed.
…Executor The cpython default asyncio.to_thread executor is shared across the whole server, so blocking media decodes (bytes / base64 / file) contend for thread slots with unrelated to_thread() callers (URL validation, etc.). Add a dedicated ThreadPoolExecutor and route the three decode call sites in BaseMediaIO.async_load through it. Sized via TRTLLM_MEDIA_LOAD_WORKERS (default 8). Signed-off-by: Aswin Visva <31215515+aswinvisva@users.noreply.github.com>
…ed ThreadPoolExecutor Same rationale as the dedicated media-load executor in inputs/media_io.py: the HF input processor / preprocess function is CPU-bound and was sharing thread slots with unrelated to_thread() callers on the default asyncio executor. Add a dedicated pool sized via TRTLLM_INPUT_PROC_WORKERS (default 8) and route both the chat and completion endpoints' preprocess calls through it. Signed-off-by: Aswin Visva <31215515+aswinvisva@users.noreply.github.com>
Two coupled changes on the video decode hot path: 1. VideoMediaIO.load_bytes passes the mp4 bytes directly to the decoder instead of writing them to a temp file first. _load_video_by_cv2 opens cv2.VideoCapture over a BytesIO via the stream-buffered backend. A tempfile fallback is retained for cv2 builds without a stream-buffered backend. 2. _load_video_by_cv2 returns uint8 HWC numpy frames (one per sampled index) instead of pre-converted torch float32 CHW tensors. The downstream HF processor's do_rescale=True path rescales and permutes the stacked (N, H, W, 3) array in one vectorized op, so the per-frame Python torch conversion was redundant work. When extract_audio=True is set on the in-memory path, the bytes are re-wrapped in a fresh BytesIO before being handed off to extract_audio_from_video (the BytesIO consumed by cv2 is at end-of-stream). Signed-off-by: Aswin Visva <31215515+aswinvisva@users.noreply.github.com>
…e IO loader The IO loader (_load_video_by_cv2) already samples to the target frame count before the HF processor sees the data. Passing fps/num_frames re-enters the HF processor's sample_frames branch, which either re-samples identical frames or takes a slower code path because the metadata it relies on doesn't match the frame count. Pass do_sample_frames=False instead so the HF processor treats the input as pre-sampled and takes the faster path. Caller-supplied kwargs other than fps/num_frames are preserved. Signed-off-by: Aswin Visva <31215515+aswinvisva@users.noreply.github.com>
ProcessorMixin._merge_kwargs (transformers) is pure but runs on every processor call. When all requests pass the same kwargs (the common deployment case), caching by signature converts a per-call merge into an O(1) lookup after the first call. Implemented as a wrapper installed on the processor instance at construction time, so it doesn't require any change to transformers. Cache key is the repr of sorted kwargs items; values are deep-copied on get and put because callers mutate the returned dict. Signed-off-by: Aswin Visva <31215515+aswinvisva@users.noreply.github.com>
08b42ad to
2b7db68
Compare
uvloop is installed as the default asyncio event loop policy at server startup (see tensorrt_llm/commands/serve.py). Declaring it as a hard dependency ensures the policy is actually active in production; the serve.py call site falls back to the default loop on ImportError, so this requirement is the switch that turns the optimization on. Signed-off-by: Aswin Visva <31215515+aswinvisva@users.noreply.github.com>
Signed-off-by: Aswin Visva <31215515+aswinvisva@users.noreply.github.com>
|
/bot run |
|
PR_Github #55605 [ run ] triggered by Bot. Commit: |
Signed-off-by: Aswin Visva <31215515+aswinvisva@users.noreply.github.com>
| PAD_INDEX = -100 # NOTE: refer to https://github.com/huggingface/transformers/blob/main/src/transformers/models/qwen2_5_vl/modular_qwen2_5_vl.py#L269 | ||
|
|
||
|
|
||
| def _install_merge_kwargs_cache(processor) -> None: |
There was a problem hiding this comment.
After re-evaluating the perf benchmarks with everything included, this may be removable - it doesn't seem to make a huge difference when removing it. Will test in Jet and report back
|
/bot run |
|
PR_Github #55618 [ run ] triggered by Bot. Commit: |
|
PR_Github #55605 [ run ] completed with state |
|
PR_Github #55618 [ run ] completed with state
|
|
|
||
|
|
||
| def _install_merge_kwargs_cache(processor) -> None: | ||
| """Memoize ``processor._merge_kwargs`` by input kwargs signature. |
There was a problem hiding this comment.
Nit: switch to single backticks on changed lines.
| # it back to True would re-enter the HF sample_frames branch on | ||
| # already-sampled input and break the contract this patch establishes. | ||
| proc_kwargs = {"do_sample_frames": False} | ||
| for _k, _v in dict(mm_processor_kwargs).items(): |
There was a problem hiding this comment.
Are the underscores to escape linting rules?
| # already sampled to the target frame count, so letting a caller flip | ||
| # it back to True would re-enter the HF sample_frames branch on | ||
| # already-sampled input and break the contract this patch establishes. | ||
| proc_kwargs = {"do_sample_frames": False} |
There was a problem hiding this comment.
Hm...
- are we sure the sampling is the same in media IO vs in the processor?
- is it possible to gate this on whether the
num_frames/fpsare different than what we already sampled?
| # in this process creates a loop. libuv-backed event loops have significantly | ||
| # lower per-await overhead than the cpython default; this affects every | ||
| # coroutine in the server. | ||
| uvloop.install() |
There was a problem hiding this comment.
Just checking: is there a reason to be doing this (which runs at the module-level) rather than something like uvloop.run(main()) as in the official snippet?
We could for example replace the call to asyncio.run here.
| try: | ||
| value = int(raw) | ||
| except ValueError: | ||
| logger.warning("%s=%r is not an integer; using default %d", name, raw, default) |
There was a problem hiding this comment.
Our logger is not a standard python logging.Logger instance, so this printf style formatting does not work. Use f-strings instead.
| open. Built-in backends (the FFMPEG path in the PyPI wheels) are always | ||
| safe to use. | ||
| """ | ||
| import cv2 |
There was a problem hiding this comment.
Nit: might want to mirror the comment on line 415. You could also consider borrowing from this (sadly unmerged) approach using lazy module loading: #13736
| # ``_video_buf`` because cv2 keeps a non-owning view into it. | ||
| # (c) ``video`` is mp4 bytes but no stream-buffered backend -> spill | ||
| # to a tempfile and open it. The path is unlinked at end of fn. | ||
| _video_buf: Optional[BytesIO] = None |
There was a problem hiding this comment.
Out of curiosity, why the undescored variable names?
| if _tmp_path is not None: | ||
| try: | ||
| os.unlink(_tmp_path) | ||
| except OSError: |
There was a problem hiding this comment.
Hm, I don't know how I feel about this. Could we move some of the logic in lines 426-439 to the caller VideoMediaIO.load_bytes? We could use a contextlib.ExitStack to conditionally push a tempfile.NamedTemporaryFile(..., delete=True) into it.
If not, I would still vote to use it here to rely on NamedTemporaryFile to do the cleanup, it's just a bit less readable in that we add one more indent level for the context manager.
| ) | ||
|
|
||
|
|
||
| async def _media_load_run(fn, *args, **kwargs): |
There was a problem hiding this comment.
Nit: I think _load_media might be better?
| # inputs/media_io.py: isolate this CPU-bound work from unrelated | ||
| # to_thread() callers on the default asyncio pool and allow independent | ||
| # tuning. | ||
| _INPUT_PROC_WORKERS = _read_positive_int_env("TRTLLM_INPUT_PROC_WORKERS", 8) |
There was a problem hiding this comment.
Could this + the media loading value be passed in as CLI args from trtllm-serve instead? It's more discoverable that way, and we can get a lot of validation for free:
@click.option("--input-processing-workers", type=click.IntRange(min=1))
Summary by CodeRabbit
New Features
Bug Fixes
Description
A few improvements to the preprocessing speed of Qwen3-VL (and other models should benefit too)
[uvloop](https://uvloop.readthedocs.io/)asyncio event loop backend - already default in vllm and is 2-4 faster which benefits us when event loop is under a huge queue of pending events - this is just a simple drop-in replacement that makes the largest diff_load_video_by_cv2via python list comprehension (callstorch.from_numpyfor every request) - instead return the uint8 numpy frames and let the HF processor's do_rescale=True path do the rescale + permute as a single vectorized op on the stacked (N, H, W, 3) arraydo_sample_frames=Falsevia mm_processing_kwargs so we don't duplicate the work (Media IO already does this work upstream)merge_kwargsin preprocessing_qwen3vl. This function is a pure function (i.e same args produce same output), and is called for every input processor invocation - and was taking 76ms under load. Simply memoize the output for the args passedOverall perf improvements:
benchmark: 30s 374×374
libx264mp4 video (moving_shapessynth) + 50 text tokens → ISL ≈ 2.3k–13.1k depending on fps, OSL ∈ {1, 100}, fps ∈ {1, 2, 4, 8}, concurrency ∈ {1, 8, 16, 32, 64, 128, 256}Cosmos3-Super-Reasoner
Cosmos3-Nano-Reasoner
Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
If PR introduces API changes, an appropriate PR label is added - either
api-compatibleorapi-breaking. Forapi-breaking, includeBREAKINGin the PR title.Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
To see a list of available CI bot commands, please comment
/bot help.