Skip to content

Latest commit

 

History

History
65 lines (51 loc) · 3.13 KB

File metadata and controls

65 lines (51 loc) · 3.13 KB
title Multimodal Model Serving
subtitle Deploy multimodal models with image, video, and audio support in Dynamo

Dynamo supports multimodal inference across multiple LLM backends, enabling models to process images, video, and audio alongside text.

**Security Requirement**: Multimodal processing must be explicitly enabled at startup. See the relevant backend documentation ([vLLM](multimodal-vllm.md), [SGLang](multimodal-sglang.md), [TRT-LLM](multimodal-trtllm.md)) for the necessary flags. This prevents unintended processing of multimodal data from untrusted sources.
---
title: Sample flow for an aggregated VLM serving scenario
---
flowchart TD
    A[Request] --> B{KV cache hit?}
    B -->|Yes| C[Use KV]
    B -->|No| D{Embedding cache hit?}
    D -->|Yes| E[Load embedding]
    D -->|No| F[Run encoder]
    F --> G[save to cache]
    G --> H["PREFILL (image tokens + text tokens → KV cache)"]
    E --> H
    C --> I[DECODE]
    H --> I
    I --> J[Response]
Loading

Key Features

Dynamo provides support for improving latency and throughput for vision-and-language workloads through the following features, that can be used together or separately, depending on your workload characteristics:

Feature Description
Embedding Cache CPU-side LRU cache that skips re-encoding repeated images
Encoder Disaggregation Separate vision encoder worker for independent scaling
Multimodal KV Routing MM-aware KV cache routing for optimal worker selection

Support Matrix

Stack Image Video Audio
vLLM 🧪 🧪
TRT-LLM
SGLang 🧪

Status: ✅ Supported | 🧪 Experimental | ❌ Not supported

Example Workflows

Reference implementations for deploying multimodal models:

Backend Documentation

Detailed deployment guides, configuration, and examples for each backend: