Multimodal Model Launch Configuration

LightLLM supports inference for various multimodal models. Below, using InternVL as an example, we explain the launch commands for multimodal services.

Basic Launch Command

INTERNVL_IMAGE_LENGTH=256 \
LOADWORKER=12 \
python -m lightllm.server.api_server \
--port 8080 \
--tp 2 \
--model_dir ${MODEL_PATH} \
--mem_fraction 0.8 \
--trust_remote_code

Core Parameter Description

Environment Variables

INTERNVL_IMAGE_LENGTH: Set the image token length for InternVL model, default is 256
LOADWORKER: Set the number of worker processes for model loading

Basic Service Parameters

--port 8080: API server listening port
--tp 2: Tensor parallelism degree
--model_dir: InternVL model file path
--mem_fraction 0.8: GPU memory usage ratio
--trust_remote_code: Allow loading custom model code

Advanced Configuration Parameters

--visual_infer_batch_size 2 \
--cache_capacity 500 \
--visual_dp dp_size \
--visual_tp tp_size

--visual_infer_batch_size 2: Visual inference batch size
--cache_capacity 500: Image embedding cache capacity
--visual_dp 2: Visual model data parallelism degree
--visual_tp 2: Visual model tensor parallelism degree

Note

To ensure equal memory load on each GPU, visual_dp * visual_tp = tp is required. For example, if tp=2, then visual_dp=1, visual_tp=2.

ViT Deployment Methods

ViT TP (Tensor Parallel)

Default usage
--visual_tp tp_size enables tensor parallelism

ViT DP (Data Parallel)

Distribute different image batches to multiple GPUs
Each GPU runs a complete ViT model copy
--visual_dp dp_size enables data parallelism

Image Caching Mechanism

LightLLM caches embeddings of input images. In multi-turn conversations, if the images are the same, cached embeddings can be used directly, avoiding repeated inference.

--cache_capacity: Controls the number of cached image embeds
Matching based on image MD5 hash value
Uses LRU (Least Recently Used) eviction mechanism
Hit image cache can directly skip ViT inference

Testing

import json
import requests
import base64

def run(query, uris):
    images = []
    for uri in uris:
        if uri.startswith("http"):
            images.append({"type": "url", "data": uri})
        else:
            with open(uri, 'rb') as fin:
                b64 = base64.b64encode(fin.read()).decode("utf-8")
            images.append({'type': "base64", "data": b64})

    data = {
        "inputs": query,
        "parameters": {
            "max_new_tokens": 200,
            # The space before <|endoftext|> is important,
            # the server will remove the first bos_token_id,
            # but QWen tokenizer does not has bos_token_id
            "stop_sequences": [" <|endoftext|>", " <|im_start|>", " <|im_end|>"],
        },
        "multimodal_params": {
            "images": images,
        }
    }

    url = "http://127.0.0.1:8000/generate"
    headers = {'Content-Type': 'application/json'}
    response = requests.post(url, headers=headers, data=json.dumps(data))
    return response

query = """
<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
<img></img>
What is this?<|im_end|>
<|im_start|>assistant
"""

response = run(
    uris = [
        "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"
    ],
    query = query
)

if response.status_code == 200:
    print(f"Result: {response.json()}")
else:
    print(f"Error: {response.status_code}, {response.text}")

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multimodal Model Launch Configuration

Basic Launch Command

Core Parameter Description

Environment Variables

Basic Service Parameters

Advanced Configuration Parameters

ViT Deployment Methods

ViT TP (Tensor Parallel)

ViT DP (Data Parallel)

Image Caching Mechanism

Testing

FilesExpand file tree

multimodal.rst

Latest commit

History

multimodal.rst

File metadata and controls

Multimodal Model Launch Configuration

Basic Launch Command

Core Parameter Description

Environment Variables

Basic Service Parameters

Advanced Configuration Parameters

ViT Deployment Methods

ViT TP (Tensor Parallel)

ViT DP (Data Parallel)

Image Caching Mechanism

Testing