| title | Multimodal Model Serving |
|---|---|
| subtitle | Deploy multimodal models with image, video, and audio support in Dynamo |
Dynamo supports multimodal inference across multiple LLM backends, enabling models to process images, video, and audio alongside text.
**Security Requirement**: Multimodal processing must be explicitly enabled at startup. See the relevant backend documentation ([vLLM](multimodal-vllm.md), [SGLang](multimodal-sglang.md), [TRT-LLM](multimodal-trtllm.md)) for the necessary flags. This prevents unintended processing of multimodal data from untrusted sources.---
title: Sample flow for an aggregated VLM serving scenario
---
flowchart TD
A[Request] --> B{KV cache hit?}
B -->|Yes| C[Use KV]
B -->|No| D{Embedding cache hit?}
D -->|Yes| E[Load embedding]
D -->|No| F[Run encoder]
F --> G[save to cache]
G --> H["PREFILL (image tokens + text tokens → KV cache)"]
E --> H
C --> I[DECODE]
H --> I
I --> J[Response]
Dynamo provides support for improving latency and throughput for vision-and-language workloads through the following features, that can be used together or separately, depending on your workload characteristics:
| Feature | Description |
|---|---|
| Embedding Cache | CPU-side LRU cache that skips re-encoding repeated images |
| Encoder Disaggregation | Separate vision encoder worker for independent scaling |
| Multimodal KV Routing | MM-aware KV cache routing for optimal worker selection |
| Stack | Image | Video | Audio |
|---|---|---|---|
| vLLM | ✅ | 🧪 | 🧪 |
| TRT-LLM | ✅ | ❌ | ❌ |
| SGLang | ✅ | 🧪 | ❌ |
Status: ✅ Supported | 🧪 Experimental | ❌ Not supported
Reference implementations for deploying multimodal models:
Detailed deployment guides, configuration, and examples for each backend: