diff --git a/demos/vlm_npu/README.md b/demos/vlm_npu/README.md index 200fe36ece..c38144b08f 100644 --- a/demos/vlm_npu/README.md +++ b/demos/vlm_npu/README.md @@ -11,58 +11,19 @@ It is targeted on client machines equipped with NPU accelerator. ## Prerequisites -**Model preparation**: Python 3.10 or higher with pip and HuggingFace account +**OVMS 2025.1 or higher** **Model Server deployment**: Installed Docker Engine or OVMS binary package according to the [baremetal deployment guide](../../docs/deploying_server_baremetal.md) **(Optional) Client**: git and Python for using OpenAI client package and vLLM benchmark app - -## Model preparation -Here, the original Pytorch LLM model and the tokenizer will be converted to IR format and optionally quantized. -That ensures faster initialization time, better performance and lower memory consumption. -LLM engine parameters will be defined inside the `graph.pbtxt` file. - -Download export script, install it's dependencies and create directory for the models: -```console -curl https://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/main/demos/common/export_models/export_model.py -o export_model.py -pip3 install -r https://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/main/demos/common/export_models/requirements.txt -pip3 install transformers==4.53.3 --force-reinstall -mkdir models -``` - -Run `export_model.py` script to download and quantize the model: - -> **Note:** The users in China need to set environment variable HF_ENDPOINT="https://hf-mirror.com" before running the export script to connect to the HF Hub. - -**LLM** -```console -python export_model.py text_generation --source_model microsoft/Phi-3.5-vision-instruct --target_device NPU --config_file_path models/config.json --model_repository_path models --overwrite_models -``` - Note that by default, NPU sets limitation on the prompt length (which in VLM also include image tokens) to 1024 tokens. You can modify that limit by using `--max_prompt_len` parameter. > **Note:** You can change the model used in the demo out of any topology [tested](https://openvinotoolkit.github.io/openvino.genai/docs/supported-models/#visual-language-models-vlms) with OpenVINO. -You should have a model folder like below: -``` -tree models -models -├── config.json -└── microsoft - └── Phi-3.5-vision-instruct - ├── config.json - ├── generation_config.json - ├── graph.pbtxt - ├── openvino_detokenizer.bin - ├── openvino_detokenizer.xml - ├── openvino_model.bin - ├── openvino_model.xml - ├── openvino_tokenizer.bin - ├── openvino_tokenizer.xml - ├── special_tokens_map.json - ├── tokenizer_config.json - └── tokenizer.json +Create directory for the model: +```console +mkdir -p models ``` The default configuration should work in most cases but the parameters can be tuned via `export_model.py` script arguments. Run the script with `--help` argument to check available parameters and see the [LLM calculator documentation](../../docs/llm/reference.md) to learn more about configuration options. @@ -74,8 +35,7 @@ The default configuration should work in most cases but the parameters can be tu Running this command starts the container with NPU enabled: ```bash -docker run -d --rm --device /dev/accel --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1) -u $(id -u):$(id -g) \ --p 8000:8000 -v $(pwd)/models:/workspace:ro openvino/model_server:latest-gpu --rest_port 8000 --config_path /workspace/config.json +docker run -d --rm --device /dev/accel --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1) -u $(id -u):$(id -g) -p 8000:8000 -v $(pwd)/models:/models:rw openvino/model_server:latest-gpu --rest_port 8000 --model_repository_path /models --source_model OpenVINO/Phi-3.5-vision-instruct-fp16-ov --task text_generation --target_device NPU ``` ::: @@ -88,10 +48,8 @@ Assuming you have unpacked model server package, make sure to: as mentioned in [deployment guide](../../docs/deploying_server_baremetal.md), in every new shell that will start OpenVINO Model Server. -Depending on how you prepared models in the first step of this demo, they are deployed to either CPU or GPU (it's defined in `config.json`). If you run on GPU make sure to have appropriate drivers installed, so the device is accessible for the model server. - ```bat -ovms --rest_port 8000 --config_path ./models/config.json +ovms --rest_port 8000 --model_repository_path models --source_model OpenVINO/Phi-3.5-vision-instruct-fp16-ov --task text_generation --target_device NPU ``` ::: @@ -99,22 +57,19 @@ ovms --rest_port 8000 --config_path ./models/config.json Wait for the model to load. You can check the status with a simple command: ```console -curl http://localhost:8000/v1/config +curl http://localhost:8000/v3/models ``` ```json { - "microsoft/Phi-3.5-vision-instruct": { - "model_version_status": [ - { - "version": "1", - "state": "AVAILABLE", - "status": { - "error_code": "OK", - "error_message": "OK" - } - } - ] + "object": "list", + "data": [ + { + "id": "OpenVINO/Phi-3.5-vision-instruct-fp16-ov", + "object": "model", + "created": 1773742559, + "owned_by": "OVMS" } + ] } ``` @@ -130,12 +85,11 @@ curl https://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/m Referring to local filesystem images in requests requires passing additional parameter `--allowed_local_media_path` (described in [Model Server Parameters](../../docs/parameters.md) section) when starting docker container: ```bash -docker run -d --rm --device /dev/accel --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1) -u $(id -u):$(id -g) \ --p 8000:8000 -v $(pwd)/models:/workspace:ro -v $(pwd):/images:ro openvino/model_server:latest-gpu --rest_port 8000 --config_path /workspace/config.json --allowed_local_media_path /images +docker run -d --rm --device /dev/accel --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1) -u $(id -u):$(id -g) -p 8000:8000 -v $(pwd)/models:/models:rw openvino/model_server:latest-gpu --rest_port 8000 --model_repository_path /models --source_model OpenVINO/Phi-3.5-vision-instruct-fp16-ov --task text_generation --target_device NPU --allowed_local_media_path /images ``` ```bash -curl http://localhost:8000/v3/chat/completions -H "Content-Type: application/json" -d "{ \"model\": \"microsoft/Phi-3.5-vision-instruct\", \"messages\":[{\"role\": \"user\", \"content\": [{\"type\": \"text\", \"text\": \"Describe what is one the picture.\"},{\"type\": \"image_url\", \"image_url\": {\"url\": \"/images/zebra.jpeg\"}}]}], \"max_completion_tokens\": 100}" +curl http://localhost:8000/v3/chat/completions -H "Content-Type: application/json" -d "{ \"model\": \"OpenVINO/Phi-3.5-vision-instruct-fp16-ov\", \"messages\":[{\"role\": \"user\", \"content\": [{\"type\": \"text\", \"text\": \"Describe what is one the picture.\"},{\"type\": \"image_url\", \"image_url\": {\"url\": \"/images/zebra.jpeg\"}}]}], \"max_completion_tokens\": 100}" ``` ```json { @@ -151,7 +105,7 @@ curl http://localhost:8000/v3/chat/completions -H "Content-Type: application/js } ], "created": 1741731554, - "model": "microsoft/Phi-3.5-vision-instruct", + "model": "OpenVINO/Phi-3.5-vision-instruct-fp16-ov", "object": "chat.completion", "usage": { "prompt_tokens": 19, @@ -168,7 +122,7 @@ curl http://localhost:8000/v3/chat/completions -H "Content-Type: application/js import requests import base64 base_url='http://127.0.0.1:8000/v3' -model_name = "microsoft/Phi-3.5-vision-instruct" +model_name = "OpenVINO/Phi-3.5-vision-instruct-fp16-ov" def convert_image(Image): with open(Image,'rb' ) as file: @@ -199,20 +153,19 @@ print(response.text) { "finish_reason": "stop", "index": 0, - "logprobs": null, "message": { - "content": "The picture features a zebra standing in a grassy plain. Zebras are known for their distinctive black and white striped patterns, which help them blend in for camouflage purposes. The zebra pictured is standing on a green field with patches of grass, indicating it may be in its natural habitat. Zebras are typically social animals and are often found in savannahs and grasslands.", + "content": "The picture features a single zebra standing in a grassy field with well-defined black and white stripes, distinctive facial markings, and a mane that is black at the base tapering into white at the tips. The zebra pays no attention to the camera, and it is likely identified as a horse due to its body size and the visible horns that resemble small antlers, which could indicate a moment of embarrassment or enduring a little pr", "role": "assistant" } } ], - "created": 1741731554, - "model": "microsoft/Phi-3.5-vision-instruct", + "created": 1773738822, + "model": "OpenVINO/Phi-3.5-vision-instruct-fp16-ov", "object": "chat.completion", "usage": { - "prompt_tokens": 19, - "completion_tokens": 83, - "total_tokens": 102 + "prompt_tokens": 26, + "completion_tokens": 100, + "total_tokens": 126 } } ``` @@ -233,7 +186,7 @@ pip3 install openai from openai import OpenAI import base64 base_url='http://localhost:8000/v3' -model_name = "microsoft/Phi-3.5-vision-instruct" +model_name = "OpenVINO/Phi-3.5-vision-instruct-fp16-ov" client = OpenAI(api_key='unused', base_url=base_url) @@ -262,7 +215,7 @@ for chunk in stream: Output: ``` -The picture features a zebra standing in a grassy area. The zebra is characterized by its distinctive black and white striped pattern, which covers its entire body, including its legs, neck, and head. Zebras have small, rounded ears and a long, flowing tail. The background appears to be a natural grassy habitat, typical of a savanna or plain. +The picture features a single zebra standing in a grassy field with well-defined black and white stripes, distinctive facial markings, and a mane that is black at the base tapering into white at the tips. The zebra pays no attention to the camera, and it is likely identified as a horse due to its body size and the visible horns that resemble small antlers, which could indicate a moment of embarrassment or enduring a little prank. Its eyes are black, and its ears are partially open, showing some interest in its surroundings. Its tail is black with a white stripe and a black tip at its end. The zebra appears to be well-fed and healthy, walking between the lush green grasses and a small patch of yellow flowers. Overall, the scene captured is both peaceful and candid, with the zebra immersed in its natural habitat. ``` ::: @@ -278,7 +231,7 @@ cd vllm pip3 install -r requirements-cpu.txt --extra-index-url https://download.pytorch.org/whl/cpu cd benchmarks curl -L https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json -o ShareGPT_V3_unfiltered_cleaned_split.json # sample dataset -python benchmark_serving.py --backend openai-chat --dataset-name hf --dataset-path lmarena-ai/vision-arena-bench-v0.1 --hf-split train --host localhost --port 8000 --model microsoft/Phi-3.5-vision-instruct --endpoint /v3/chat/completions --num-prompts 10 --trust-remote-code --max-concurrency 1 +python benchmark_serving.py --backend openai-chat --dataset-name hf --dataset-path lmarena-ai/vision-arena-bench-v0.1 --hf-split train --host localhost --port 8000 --model OpenVINO/Phi-3.5-vision-instruct-fp16-ov --endpoint /v3/chat/completions --num-prompts 10 --trust-remote-code --max-concurrency 1 ```