Skip to content
103 changes: 28 additions & 75 deletions demos/vlm_npu/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,58 +11,19 @@ It is targeted on client machines equipped with NPU accelerator.
## Prerequisites

**Model preparation**: Python 3.10 or higher with pip and HuggingFace account
**OVMS 2025.1 or higher**

**Model Server deployment**: Installed Docker Engine or OVMS binary package according to the [baremetal deployment guide](../../docs/deploying_server_baremetal.md)

**(Optional) Client**: git and Python for using OpenAI client package and vLLM benchmark app


## Model preparation
Here, the original Pytorch LLM model and the tokenizer will be converted to IR format and optionally quantized.
That ensures faster initialization time, better performance and lower memory consumption.
LLM engine parameters will be defined inside the `graph.pbtxt` file.

Download export script, install it's dependencies and create directory for the models:
```console
curl https://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/main/demos/common/export_models/export_model.py -o export_model.py
pip3 install -r https://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/main/demos/common/export_models/requirements.txt
pip3 install transformers==4.53.3 --force-reinstall
mkdir models
```

Run `export_model.py` script to download and quantize the model:

> **Note:** The users in China need to set environment variable HF_ENDPOINT="https://hf-mirror.com" before running the export script to connect to the HF Hub.
**LLM**
```console
python export_model.py text_generation --source_model microsoft/Phi-3.5-vision-instruct --target_device NPU --config_file_path models/config.json --model_repository_path models --overwrite_models
```

Note that by default, NPU sets limitation on the prompt length (which in VLM also include image tokens) to 1024 tokens. You can modify that limit by using `--max_prompt_len` parameter.

> **Note:** You can change the model used in the demo out of any topology [tested](https://openvinotoolkit.github.io/openvino.genai/docs/supported-models/#visual-language-models-vlms) with OpenVINO.
You should have a model folder like below:
```
tree models
models
├── config.json
└── microsoft
└── Phi-3.5-vision-instruct
├── config.json
├── generation_config.json
├── graph.pbtxt
├── openvino_detokenizer.bin
├── openvino_detokenizer.xml
├── openvino_model.bin
├── openvino_model.xml
├── openvino_tokenizer.bin
├── openvino_tokenizer.xml
├── special_tokens_map.json
├── tokenizer_config.json
└── tokenizer.json
Create directory for the model:
```console
mkdir -p models
```

The default configuration should work in most cases but the parameters can be tuned via `export_model.py` script arguments. Run the script with `--help` argument to check available parameters and see the [LLM calculator documentation](../../docs/llm/reference.md) to learn more about configuration options.
Expand All @@ -74,8 +35,7 @@ The default configuration should work in most cases but the parameters can be tu

Running this command starts the container with NPU enabled:
```bash
docker run -d --rm --device /dev/accel --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1) -u $(id -u):$(id -g) \
-p 8000:8000 -v $(pwd)/models:/workspace:ro openvino/model_server:latest-gpu --rest_port 8000 --config_path /workspace/config.json
docker run -d --rm --device /dev/accel --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1) -u $(id -u):$(id -g) -p 8000:8000 -v $(pwd)/models:/models:rw openvino/model_server:latest-gpu --rest_port 8000 --model_repository_path /models --source_model OpenVINO/Phi-3.5-vision-instruct-fp16-ov --task text_generation --target_device NPU
```
:::

Expand All @@ -88,33 +48,28 @@ Assuming you have unpacked model server package, make sure to:

as mentioned in [deployment guide](../../docs/deploying_server_baremetal.md), in every new shell that will start OpenVINO Model Server.

Depending on how you prepared models in the first step of this demo, they are deployed to either CPU or GPU (it's defined in `config.json`). If you run on GPU make sure to have appropriate drivers installed, so the device is accessible for the model server.

```bat
ovms --rest_port 8000 --config_path ./models/config.json
ovms --rest_port 8000 --model_repository_path models --source_model OpenVINO/Phi-3.5-vision-instruct-fp16-ov --task text_generation --target_device NPU
```
:::

## Readiness Check

Wait for the model to load. You can check the status with a simple command:
```console
curl http://localhost:8000/v1/config
curl http://localhost:8000/v3/models
```
```json
{
"microsoft/Phi-3.5-vision-instruct": {
"model_version_status": [
{
"version": "1",
"state": "AVAILABLE",
"status": {
"error_code": "OK",
"error_message": "OK"
}
}
]
"object": "list",
"data": [
{
"id": "OpenVINO/Phi-3.5-vision-instruct-fp16-ov",
"object": "model",
"created": 1773742559,
"owned_by": "OVMS"
}
]
}
```

Expand All @@ -130,12 +85,11 @@ curl https://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/m

Referring to local filesystem images in requests requires passing additional parameter `--allowed_local_media_path` (described in [Model Server Parameters](../../docs/parameters.md) section) when starting docker container:
```bash
docker run -d --rm --device /dev/accel --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1) -u $(id -u):$(id -g) \
-p 8000:8000 -v $(pwd)/models:/workspace:ro -v $(pwd):/images:ro openvino/model_server:latest-gpu --rest_port 8000 --config_path /workspace/config.json --allowed_local_media_path /images
docker run -d --rm --device /dev/accel --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1) -u $(id -u):$(id -g) -p 8000:8000 -v $(pwd)/models:/models:rw openvino/model_server:latest-gpu --rest_port 8000 --model_repository_path /models --source_model OpenVINO/Phi-3.5-vision-instruct-fp16-ov --task text_generation --target_device NPU --allowed_local_media_path /images
```

```bash
curl http://localhost:8000/v3/chat/completions -H "Content-Type: application/json" -d "{ \"model\": \"microsoft/Phi-3.5-vision-instruct\", \"messages\":[{\"role\": \"user\", \"content\": [{\"type\": \"text\", \"text\": \"Describe what is one the picture.\"},{\"type\": \"image_url\", \"image_url\": {\"url\": \"/images/zebra.jpeg\"}}]}], \"max_completion_tokens\": 100}"
curl http://localhost:8000/v3/chat/completions -H "Content-Type: application/json" -d "{ \"model\": \"OpenVINO/Phi-3.5-vision-instruct-fp16-ov\", \"messages\":[{\"role\": \"user\", \"content\": [{\"type\": \"text\", \"text\": \"Describe what is one the picture.\"},{\"type\": \"image_url\", \"image_url\": {\"url\": \"/images/zebra.jpeg\"}}]}], \"max_completion_tokens\": 100}"
```
```json
{
Expand All @@ -151,7 +105,7 @@ curl http://localhost:8000/v3/chat/completions -H "Content-Type: application/js
}
],
"created": 1741731554,
"model": "microsoft/Phi-3.5-vision-instruct",
"model": "OpenVINO/Phi-3.5-vision-instruct-fp16-ov",
"object": "chat.completion",
"usage": {
"prompt_tokens": 19,
Expand All @@ -168,7 +122,7 @@ curl http://localhost:8000/v3/chat/completions -H "Content-Type: application/js
import requests
import base64
base_url='http://127.0.0.1:8000/v3'
model_name = "microsoft/Phi-3.5-vision-instruct"
model_name = "OpenVINO/Phi-3.5-vision-instruct-fp16-ov"

def convert_image(Image):
with open(Image,'rb' ) as file:
Expand Down Expand Up @@ -199,20 +153,19 @@ print(response.text)
{
"finish_reason": "stop",
"index": 0,
"logprobs": null,
"message": {
"content": "The picture features a zebra standing in a grassy plain. Zebras are known for their distinctive black and white striped patterns, which help them blend in for camouflage purposes. The zebra pictured is standing on a green field with patches of grass, indicating it may be in its natural habitat. Zebras are typically social animals and are often found in savannahs and grasslands.",
"content": "The picture features a single zebra standing in a grassy field with well-defined black and white stripes, distinctive facial markings, and a mane that is black at the base tapering into white at the tips. The zebra pays no attention to the camera, and it is likely identified as a horse due to its body size and the visible horns that resemble small antlers, which could indicate a moment of embarrassment or enduring a little pr",
"role": "assistant"
}
}
],
"created": 1741731554,
"model": "microsoft/Phi-3.5-vision-instruct",
"created": 1773738822,
"model": "OpenVINO/Phi-3.5-vision-instruct-fp16-ov",
"object": "chat.completion",
"usage": {
"prompt_tokens": 19,
"completion_tokens": 83,
"total_tokens": 102
"prompt_tokens": 26,
"completion_tokens": 100,
"total_tokens": 126
}
}
```
Expand All @@ -233,7 +186,7 @@ pip3 install openai
from openai import OpenAI
import base64
base_url='http://localhost:8000/v3'
model_name = "microsoft/Phi-3.5-vision-instruct"
model_name = "OpenVINO/Phi-3.5-vision-instruct-fp16-ov"

client = OpenAI(api_key='unused', base_url=base_url)

Expand Down Expand Up @@ -262,7 +215,7 @@ for chunk in stream:

Output:
```
The picture features a zebra standing in a grassy area. The zebra is characterized by its distinctive black and white striped pattern, which covers its entire body, including its legs, neck, and head. Zebras have small, rounded ears and a long, flowing tail. The background appears to be a natural grassy habitat, typical of a savanna or plain.
The picture features a single zebra standing in a grassy field with well-defined black and white stripes, distinctive facial markings, and a mane that is black at the base tapering into white at the tips. The zebra pays no attention to the camera, and it is likely identified as a horse due to its body size and the visible horns that resemble small antlers, which could indicate a moment of embarrassment or enduring a little prank. Its eyes are black, and its ears are partially open, showing some interest in its surroundings. Its tail is black with a white stripe and a black tip at its end. The zebra appears to be well-fed and healthy, walking between the lush green grasses and a small patch of yellow flowers. Overall, the scene captured is both peaceful and candid, with the zebra immersed in its natural habitat.
```

:::
Expand All @@ -278,7 +231,7 @@ cd vllm
pip3 install -r requirements-cpu.txt --extra-index-url https://download.pytorch.org/whl/cpu
cd benchmarks
curl -L https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json -o ShareGPT_V3_unfiltered_cleaned_split.json # sample dataset
python benchmark_serving.py --backend openai-chat --dataset-name hf --dataset-path lmarena-ai/vision-arena-bench-v0.1 --hf-split train --host localhost --port 8000 --model microsoft/Phi-3.5-vision-instruct --endpoint /v3/chat/completions --num-prompts 10 --trust-remote-code --max-concurrency 1
python benchmark_serving.py --backend openai-chat --dataset-name hf --dataset-path lmarena-ai/vision-arena-bench-v0.1 --hf-split train --host localhost --port 8000 --model OpenVINO/Phi-3.5-vision-instruct-fp16-ov --endpoint /v3/chat/completions --num-prompts 10 --trust-remote-code --max-concurrency 1

```

Expand Down