-
Notifications
You must be signed in to change notification settings - Fork 241
Mkulakow/responses api #4039
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Mkulakow/responses api #4039
Changes from all commits
33a7be4
21c6629
a7295b7
f4c6a1e
a0a3293
ac1420f
75875af
4f64842
99d08cd
cdd7408
852da04
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -16,7 +16,7 @@ ovms_demos_continuous_batching_accuracy | |
| ``` | ||
|
|
||
| This demo shows how to deploy LLM models in the OpenVINO Model Server using continuous batching and paged attention algorithms. | ||
| Text generation use case is exposed via OpenAI API `chat/completions` and `completions` endpoints. | ||
| Text generation use case is exposed via OpenAI API `chat/completions`, `completions` and `responses` endpoints. | ||
| That makes it easy to use and efficient especially on on Intel® Xeon® processors and ARC GPUs. | ||
|
|
||
| > **Note:** This demo was tested on 4th - 6th generation Intel® Xeon® Scalable Processors, and Intel® Core Ultra Series on Ubuntu24 and Windows11. | ||
|
|
@@ -73,7 +73,7 @@ curl http://localhost:8000/v3/models | |
|
|
||
| ## Request Generation | ||
|
|
||
| Model exposes both `chat/completions` and `completions` endpoints with and without stream capabilities. | ||
| Model exposes both `chat/completions`, `completions` and `responses` endpoints with and without stream capabilities. | ||
| Chat endpoint is expected to be used for scenarios where conversation context should be pasted by the client and the model prompt is created by the server based on the jinja model template. | ||
| Completion endpoint should be used to pass the prompt directly by the client and for models without the jinja template. Here is demonstrated model `Qwen/Qwen3-30B-A3B-Instruct-2507` in int4 precision. It has chat capability so `chat/completions` endpoint will be employed: | ||
|
|
||
|
|
@@ -148,9 +148,76 @@ curl -s http://localhost:8000/v3/chat/completions -H "Content-Type: application/ | |
| ::: | ||
|
|
||
|
|
||
| ### Unary calls to responses endpoint using cURL | ||
|
|
||
| ::::{tab-set} | ||
|
|
||
| :::{tab-item} Linux | ||
| ```bash | ||
| curl http://localhost:8000/v3/responses \ | ||
| -H "Content-Type: application/json" \ | ||
| -d '{ | ||
| "model": "meta-llama/Meta-Llama-3-8B-Instruct", | ||
| "max_output_tokens":30, | ||
| "input": "What is OpenVINO?" | ||
| }'| jq . | ||
| ``` | ||
| ::: | ||
|
|
||
| :::{tab-item} Windows | ||
| Windows Powershell | ||
| ```powershell | ||
| (Invoke-WebRequest -Uri "http://localhost:8000/v3/responses" ` | ||
| -Method POST ` | ||
| -Headers @{ "Content-Type" = "application/json" } ` | ||
| -Body '{"model": "meta-llama/Meta-Llama-3-8B-Instruct", "max_output_tokens": 30, "input": "What is OpenVINO?"}').Content | ||
| ``` | ||
|
|
||
| Windows Command Prompt | ||
| ```bat | ||
| curl -s http://localhost:8000/v3/responses -H "Content-Type: application/json" -d "{\"model\": \"meta-llama/Meta-Llama-3-8B-Instruct\", \"max_output_tokens\": 30, \"input\": \"What is OpenVINO?\"}" | ||
| ``` | ||
| ::: | ||
|
|
||
| :::: | ||
|
|
||
| :::{dropdown} Expected Response | ||
| ```json | ||
| { | ||
| "id": "resp-1724405400", | ||
| "object": "response", | ||
| "created_at": 1724405400, | ||
| "model": "meta-llama/Meta-Llama-3-8B-Instruct", | ||
| "status": "completed", | ||
| "output": [ | ||
| { | ||
| "id": "msg-0", | ||
| "type": "message", | ||
| "role": "assistant", | ||
| "status": "completed", | ||
| "content": [ | ||
| { | ||
| "type": "output_text", | ||
| "text": "OpenVINO is an open-source software framework developed by Intel for optimizing and deploying computer vision, machine learning, and deep learning models on various devices,", | ||
| "annotations": [] | ||
| } | ||
| ] | ||
| } | ||
| ], | ||
| "usage": { | ||
| "input_tokens": 27, | ||
| "input_tokens_details": { "cached_tokens": 0 }, | ||
| "output_tokens": 30, | ||
| "output_tokens_details": { "reasoning_tokens": 0 }, | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is reasoning_tokens supported when the model actually reason? |
||
| "total_tokens": 57 | ||
| } | ||
| } | ||
| ``` | ||
| ::: | ||
|
|
||
| ### OpenAI Python package | ||
|
|
||
| The endpoints `chat/completions` and `completions` are compatible with OpenAI client so it can be easily used to generate code also in streaming mode: | ||
| The endpoints `chat/completions`, `completions` and `responses` are compatible with OpenAI client so it can be easily used to generate code also in streaming mode: | ||
|
|
||
| Install the client library: | ||
| ```console | ||
|
|
@@ -262,6 +329,31 @@ So, **6 = 3**. | |
| ``` | ||
| ::: | ||
|
|
||
| :::{tab-item} Responses | ||
| ```python | ||
| from openai import OpenAI | ||
|
|
||
| client = OpenAI( | ||
| base_url="http://localhost:8000/v3", | ||
| api_key="unused" | ||
| ) | ||
|
|
||
| stream = client.responses.create( | ||
| model="meta-llama/Meta-Llama-3-8B-Instruct", | ||
| input="Say this is a test", | ||
| stream=True, | ||
| ) | ||
| for event in stream: | ||
| if event.type == "response.output_text.delta": | ||
| print(event.delta, end="", flush=True) | ||
| ``` | ||
|
|
||
| Output: | ||
| ``` | ||
| It looks like you're testing me! | ||
| ``` | ||
| ::: | ||
|
|
||
| :::: | ||
|
|
||
| ## Check how to use AI agents with MCP servers and language models | ||
|
|
@@ -300,5 +392,6 @@ Check the [guide of using lm-evaluation-harness](./accuracy/README.md) | |
| - [Official OpenVINO LLM models in HuggingFace](https://huggingface.co/collections/OpenVINO/llm) | ||
| - [Chat Completions API](../../docs/model_server_rest_api_chat.md) | ||
| - [Completions API](../../docs/model_server_rest_api_completions.md) | ||
| - [Responses API](../../docs/model_server_rest_api_responses.md) | ||
| - [Writing client code](../../docs/clients_genai.md) | ||
| - [LLM calculator reference](../../docs/llm/reference.md) | ||
| Original file line number | Diff line number | Diff line change | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
|
|
@@ -9,7 +9,7 @@ ovms_demos_vlm_npu | |||||||||
| ``` | ||||||||||
|
|
||||||||||
| This demo shows how to deploy Vision Language Models in the OpenVINO Model Server. | ||||||||||
| Text generation use case is exposed via OpenAI API `chat/completions` endpoint. | ||||||||||
| Text generation use case is exposed via OpenAI API `chat/completions` and `responses` endpoints. | ||||||||||
|
|
||||||||||
| > **Note:** This demo was tested on 4th - 6th generation Intel® Xeon® Scalable Processors, Intel® Arc™ GPU Series and Intel® Core Ultra Series on Ubuntu24, RedHat9 and Windows11. | ||||||||||
|
|
||||||||||
|
|
@@ -119,6 +119,45 @@ curl http://localhost:8000/v3/chat/completions -H "Content-Type: application/js | |||||||||
| ``` | ||||||||||
| ::: | ||||||||||
|
|
||||||||||
| :::{dropdown} **Unary call with curl using responses endpoint** | ||||||||||
| **Note**: using urls in request requires `--allowed_media_domains` parameter described [here](../../../docs/parameters.md) | ||||||||||
|
Comment on lines
+122
to
+123
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||||
|
|
||||||||||
| ```bash | ||||||||||
| curl http://localhost:8000/v3/responses -H "Content-Type: application/json" -d "{ \"model\": \"OpenGVLab/InternVL2-2B\", \"input\":[{\"role\": \"user\", \"content\": [{\"type\": \"input_text\", \"text\": \"Describe what is on the picture.\"},{\"type\": \"input_image\", \"image_url\": \"http://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/releases/2025/3/demos/common/static/images/zebra.jpeg\"}]}], \"max_output_tokens\": 100}" | ||||||||||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you break down the command into multiple lines to showcase what is actually being sent? |
||||||||||
| ``` | ||||||||||
| ```json | ||||||||||
| { | ||||||||||
| "id": "resp-1741731554", | ||||||||||
| "object": "response", | ||||||||||
| "created_at": 1741731554, | ||||||||||
| "model": "OpenGVLab/InternVL2-2B", | ||||||||||
| "status": "completed", | ||||||||||
| "output": [ | ||||||||||
| { | ||||||||||
| "id": "msg-0", | ||||||||||
| "type": "message", | ||||||||||
| "role": "assistant", | ||||||||||
| "status": "completed", | ||||||||||
| "content": [ | ||||||||||
| { | ||||||||||
| "type": "output_text", | ||||||||||
| "text": "The picture features a zebra standing in a grassy plain. Zebras are known for their distinctive black and white striped patterns, which help them blend in for camouflage purposes.", | ||||||||||
| "annotations": [] | ||||||||||
| } | ||||||||||
| ] | ||||||||||
| } | ||||||||||
| ], | ||||||||||
| "usage": { | ||||||||||
| "input_tokens": 19, | ||||||||||
| "input_tokens_details": { "cached_tokens": 0 }, | ||||||||||
| "output_tokens": 83, | ||||||||||
| "output_tokens_details": { "reasoning_tokens": 0 }, | ||||||||||
| "total_tokens": 102 | ||||||||||
| } | ||||||||||
| } | ||||||||||
| ``` | ||||||||||
| ::: | ||||||||||
|
|
||||||||||
| :::{dropdown} **Unary call with python requests library** | ||||||||||
|
|
||||||||||
| ```console | ||||||||||
|
|
@@ -177,9 +216,9 @@ print(response.text) | |||||||||
| } | ||||||||||
| ``` | ||||||||||
| ::: | ||||||||||
| :::{dropdown} **Streaming request with OpenAI client** | ||||||||||
| :::{dropdown} **Streaming request with OpenAI client using chat/completions** | ||||||||||
|
|
||||||||||
| The endpoints `chat/completions` is compatible with OpenAI client so it can be easily used to generate code also in streaming mode: | ||||||||||
| The endpoints `chat/completions` and `responses` are compatible with OpenAI client so it can be easily used to generate code also in streaming mode: | ||||||||||
|
|
||||||||||
| Install the client library: | ||||||||||
| ```console | ||||||||||
|
|
@@ -223,6 +262,79 @@ The picture features a zebra standing in a grassy area. The zebra is characteriz | |||||||||
|
|
||||||||||
| ::: | ||||||||||
|
|
||||||||||
| :::{dropdown} **Streaming request with OpenAI client using responses endpoint** | ||||||||||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||||
|
|
||||||||||
| ```console | ||||||||||
| pip3 install openai | ||||||||||
| ``` | ||||||||||
| ```python | ||||||||||
| from openai import OpenAI | ||||||||||
| import base64 | ||||||||||
| base_url='http://localhost:8080/v3' | ||||||||||
| model_name = "OpenGVLab/InternVL2-2B" | ||||||||||
|
|
||||||||||
| client = OpenAI(api_key='unused', base_url=base_url) | ||||||||||
|
|
||||||||||
| def convert_image(Image): | ||||||||||
| with open(Image,'rb' ) as file: | ||||||||||
| base64_image = base64.b64encode(file.read()).decode("utf-8") | ||||||||||
| return base64_image | ||||||||||
|
|
||||||||||
| stream = client.responses.create( | ||||||||||
| model=model_name, | ||||||||||
| input=[ | ||||||||||
| { | ||||||||||
| "role": "user", | ||||||||||
| "content": [ | ||||||||||
| {"type": "input_text", "text": "Describe what is on the picture."}, | ||||||||||
| {"type": "input_image", "image_url": f"data:image/jpeg;base64,{convert_image('zebra.jpeg')}"} | ||||||||||
| ] | ||||||||||
| } | ||||||||||
| ], | ||||||||||
| stream=True, | ||||||||||
| ) | ||||||||||
| for event in stream: | ||||||||||
| if event.type == "response.output_text.delta": | ||||||||||
| print(event.delta, end="", flush=True) | ||||||||||
| ``` | ||||||||||
|
|
||||||||||
| Output: | ||||||||||
| ``` | ||||||||||
| The picture features a zebra standing in a grassy area. The zebra is characterized by its distinctive black and white striped pattern, which covers its entire body, including its legs, neck, and head. Zebras have small, rounded ears and a long, flowing tail. The background appears to be a natural grassy habitat, typical of a savanna or plain. | ||||||||||
| ``` | ||||||||||
|
|
||||||||||
| ::: | ||||||||||
|
|
||||||||||
| ## Benchmarking text generation with high concurrency | ||||||||||
|
|
||||||||||
| OpenVINO Model Server employs efficient parallelization for text generation. It can be used to generate text also in high concurrency in the environment shared by multiple clients. | ||||||||||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why is it being added in responses api PR? Is it related? |
||||||||||
| It can be demonstrated using benchmarking app from vLLM repository: | ||||||||||
| ```console | ||||||||||
| git clone --branch v0.7.3 --depth 1 https://github.com/vllm-project/vllm | ||||||||||
| cd vllm | ||||||||||
| pip3 install -r requirements-cpu.txt --extra-index-url https://download.pytorch.org/whl/cpu | ||||||||||
| cd benchmarks | ||||||||||
| python benchmark_serving.py --backend openai-chat --dataset-name hf --dataset-path lmarena-ai/vision-arena-bench-v0.1 --hf-split train --host localhost --port 8000 --model OpenGVLab/InternVL2-2B --endpoint /v3/chat/completions --max-concurrency 1 --num-prompts 100 --trust-remote-code | ||||||||||
|
|
||||||||||
| Burstiness factor: 1.0 (Poisson process) | ||||||||||
| Maximum request concurrency: None | ||||||||||
| ============ Serving Benchmark Result ============ | ||||||||||
| Successful requests: 100 | ||||||||||
| Benchmark duration (s): 287.81 | ||||||||||
| Total input tokens: 15381 | ||||||||||
| Total generated tokens: 20109 | ||||||||||
| Request throughput (req/s): 0.35 | ||||||||||
| Output token throughput (tok/s): 69.87 | ||||||||||
| Total Token throughput (tok/s): 123.31 | ||||||||||
| ---------------Time to First Token---------------- | ||||||||||
| Mean TTFT (ms): 1513.96 | ||||||||||
| Median TTFT (ms): 1368.93 | ||||||||||
| P99 TTFT (ms): 2647.45 | ||||||||||
| -----Time per Output Token (excl. 1st token)------ | ||||||||||
| Mean TPOT (ms): 6.68 | ||||||||||
| Median TPOT (ms): 6.68 | ||||||||||
| P99 TPOT (ms): 8.02 | ||||||||||
| ``` | ||||||||||
|
|
||||||||||
| ## Testing the model accuracy over serving API | ||||||||||
|
|
||||||||||
|
|
@@ -237,5 +349,6 @@ Check [VLM usage with NPU acceleration](../../vlm_npu/README.md) | |||||||||
| - [Export models to OpenVINO format](../common/export_models/README.md) | ||||||||||
| - [Supported VLM models](https://openvinotoolkit.github.io/openvino.genai/docs/supported-models/#visual-language-models-vlms) | ||||||||||
| - [Chat Completions API](../../../docs/model_server_rest_api_chat.md) | ||||||||||
| - [Responses API](../../../docs/model_server_rest_api_responses.md) | ||||||||||
| - [Writing client code](../../../docs/clients_genai.md) | ||||||||||
| - [LLM calculator reference](../../../docs/llm/reference.md) | ||||||||||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.