Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
88 changes: 48 additions & 40 deletions demos/continuous_batching/long_context/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,22 +2,34 @@

Using LLM models with very long context and prompts might be particularly challenging. The key goals are to get maximum throughput, minimal latency and reasonable memory consumption.
It is very common for applications using RAG chain, documents summarization, question answering and many more.
Below optimizations can significantly boost performance:
The optimizations below can significantly boost performance:

- Prefix caching
- KV cache compression
- Max prompt length (NPU)
- Cache interval multiplier for linear attention
- Tuning max number of batched tokens

**Prefix caching**

Prefix caching in large language models (LLMs) is an optimization technique used to improve performance when processing repeated or static parts of input prompts. Instead of recomputing KV for the same prefix (e.g., a fixed instruction or context), it is cached after the first computation and stored after request is already processed and response is returned.
When the same prefix is encountered again, the cached KV is reused, skipping redundant computations. This reduces latency and computational overhead, especially in scenarios like chatbots or applications with repetitive prompts.
When the same prefix is encountered again, the cached KV is reused, skipping redundant computations. This reduces latency and computational overhead, especially in scenarios like chatbots or applications with repetitive prompts. It is enabled by default.

**KV cache compression:**
KV cache stores the intermediate key and value tensors generated by the model’s attention layers for each token in the input sequence.
This cache allows the model to avoid recomputing attention for previous tokens when generating new tokens, greatly speeding up inference for long contexts.
For very long contexts or high concurrency, the KV cache can consume a large amount of memory (RAM or VRAM).
Compression reduces this memory usage, enabling longer prompts or more parallel requests without running out of memory.
This parameter is applicable only to pipelines with continuous batching and paged attention. It is not used with NPU device.

**Max prompt length**
Because NPU is using static memory allocation for prompt processing, there was introduced a dedicated parameter for NPU device - max_num_prompt. The default value 1024 should be adjust to the expected requests size.

**Cache interval multiplier**
This parameter is dedicated for models with linear attention and prefix caching enabled. It adjusts the allocation size for state blocks internally in openvino.genai backend. For processing long inputs with low memory footprint, it is recommended to increase this parameter from default value 8 to higher like 64.

**Max number batched tokens**
This parameter influences behavior of continuous batching algorithm and the size of chunked prompts for batching. It is efficient to use default value of 256 tokens when concurrent processing is expected. When usually one client is connecting to the local model, especially with long prompts, increasing the value might improve first token latency.

## Deployment

Expand All @@ -32,43 +44,44 @@ mkdir models
::: {tab-item} GPU
:sync: GPU
```bash
docker run --user $(id -u):$(id -g) -d --rm -v $(pwd)/models:/models:rw -p 8000:8000 --device /dev/dri --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1) openvino/model_server:latest-gpu --rest_port 8000 --model_repository_path /models --source_model OpenVINO/gpt-oss-20b-int4-ov --tool_parser gptoss --reasoning_parser gptoss --task text_generation --enable_prefix_caching true --target_device GPU
docker run --user $(id -u):$(id -g) -d --rm -v $(pwd)/models:/models:rw -p 8000:8000 --device /dev/dri --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1) openvino/model_server:latest-gpu --rest_port 8000 --model_repository_path /models --source_model OpenVINO/gpt-oss-20b-int4-ov --tool_parser gptoss --reasoning_parser gptoss --task text_generation --kv_cache_precision u4 --target_device GPU --cache_size 5 --max_num_batched_tokens 4096
```
:::
:::{tab-item} NPU
```bash
docker run --user $(id -u):$(id -g) -d --rm -v $(pwd)/models:/models:rw -p 8000:8000 --device /dev/accel --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1) openvino/model_server:latest-gpu --rest_port 8000 --model_repository_path /models --source_model OpenVINO/Qwen3-8B-int4-cw-ov --max_prompt_len 16000 --tool_parser hermes3 --task text_generation --enable_prefix_caching true --target_device NPU
docker run --user $(id -u):$(id -g) -d --rm -v $(pwd)/models:/models:rw -p 8000:8000 --device /dev/accel --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1) openvino/model_server:latest-gpu --rest_port 8000 --model_repository_path /models --source_model OpenVINO/Qwen3-8B-int4-cw-ov --max_prompt_len 16000 --tool_parser hermes3 --task text_generation --target_device NPU
```
**Note:** It's recommended to set `--max_prompt_len` value to as low as possible. This will improve performance, but limit number of tokens model will accept.
:::
::::

## Testing performance
## Testing latency

Using `vllm` benchmark it's possible to check performance of the model with desired context length. It's also possible to set prefix parameters to check the performance benefit from prefix caching.
```bash
The command below can generate synthetic load with configurable cached prompt length (5000) and new tokens length (10).
```text
pip install vllm --extra-index-url https://wheels.vllm.ai/nightly/cpu
vllm bench serve --backend openai --base-url http://localhost:8000/v3 --endpoint /completions --model OpenVINO/gpt-oss-20b-int4-ov --tokenizer openai/gpt-oss-20b --prefix-repetition-prefix-len 50000 --prefix-repetition-suffix-len 10 --prefix-repetition-output-len 20 --prefix-repetition-num-prefixes 1 --num-prompts 2 --max_concurrency 1 --dataset-name prefix_repetition --num-warmups 1 --seed 42
vllm bench serve --backend openai --base-url http://localhost:8000/v3 --endpoint /completions --model OpenVINO/gpt-oss-20b-int4-ov --tokenizer openai/gpt-oss-20b --prefix-repetition-prefix-len 50000 --prefix-repetition-suffix-len 10 --prefix-repetition-output-len 20 --prefix-repetition-num-prefixes 1 --num-prompts 1 --max_concurrency 1 --dataset-name prefix_repetition --num-warmups 1 --seed 1
```

## Performance Comparison Table

::::{tab-set}
:::{tab-item} iGPU
:sync: GPU
Platform: Intel(R) Core(TM) Ultra X7 368H
| Context Length (tokens) | TTFT No Caching (ms) | TPOT No Caching (ms) | TTFT Prefix Caching (ms) | TPOT Prefix Caching (ms) | KV Cache Usage (GB) |
**iGPU Platform: Intel(R) Core(TM) Ultra X7 368H**

| Context Length (tokens) | TTFT No Cache (ms) | TPOT No Cache (ms) | TTFT Prefix Cache (ms) | TPOT Prefix Cached (ms) | KV Cache Usage (GB) |
|---------------|---------------|---------------|---------------|---------------|---------------|
| 12,000 | 17 889 | 79.30 | 577 | 57.28 | 0.2 |
| 25,000 | 54 280 | 90.81 | 2 575 | 77.94 | 0.3 |
| 50,000 | 121 360 | 123.44 | 6 792 | 101.23 | 0.6 |
| 100,000 | 122 068 | 127.04 | 6 885 | 100.47 | 1.1 |
| 1,000 | 943 | 30.48 | 175 | 28.88 | 0.02 |
| 12,000 | 6 496 | 38.19 | 328 | 37.80 | 0.2 |
| 25,000 | 21 381 | 39.87 | 408 | 48.50 | 0.3 |
| 50,000 | 88 159 | 69.46 | 945 | 82.42 | 0.6 |
| 100,000 | 248 777 | 119.20 | 1258 | 102.36 | 1.5 |

:::
:::{tab-item} NPU
:sync: NPU
Platform: Intel(R) Core(TM) Ultra X7 368H
| Context Length (tokens) | TTFT No Caching (ms) | TPOT No Caching (ms) | TTFT Prefix Caching (ms) | TPOT Prefix Caching (ms) |
Those results confirm gain from prefix caching for repeated tokens and demonstrate low KV cache usage thanks to quantization even with long context.



***NPU Platform: Intel(R) Core(TM) Ultra X7 368H**

| Context Length (tokens) | TTFT No Cache (ms) | TPOT No Cache (ms) | TTFT Prefix Cached (ms) | TPOT Prefix Cached (ms) |
|---------------|---------------|---------------|---------------|---------------|
| 500 | 1 514 | 76.62 | 1 491 | 77.36 |
| 1,000 | 1 366 | 78.10 | 1 374 | 79.18 |
Expand All @@ -77,32 +90,23 @@ Platform: Intel(R) Core(TM) Ultra X7 368H
| 8,000 | 15 432 | 76.74 | 3 285 | 77.51 |
| 16,000 | 43 117 | 80.30 | 5 356 | 80.97 |

:::
::::
This table shows the gain from prefix caching on NPU device and flat latency for whole range to prompt length.

The results show that the cache usage grows with the context length.
Prefix caching is very effective in reducing the first token generation making the long context calls practical even on slower HW.

## Testing accuracy

Testing accuracy for use cases with long context can be done via [lm-eval_harness](../accuracy/README.md).
The only difference is that the configured testing task should include a relevant dataset.

## Cache Precision

KV cache compression has minimal impact on accuracy and significantly reduces memory consumption and benchmark time.
It's recommended to use default KV cache precision which is INT8, but it's possible to change it to INT4. To do it, use parameter `--kv_cache_precision u4`.
The default value is u8, but it's possible to change it to u4, f16 or f32.

| Context Length (tokens) | TTFT for precision u4 (ms) | Cache size for u4 (GB) | TTFT for precision u8 (ms) | Cache size for u8 (GB) |
|-----------------|-----------------|----------------|-----------------|-----------------|
| 50,000 | 6 812 | 0.6 | 6 792 | 0.6 |
| 100,000 | 7 234 | 0.6 | 6 885 | 1.1 |
| 50,000 | 945 | 0.7 | 985 | 1.5 |

Lower precision in KV Cache reduces the memory consumption and can also improve latency.

## Max prompt length for NPU

Parameter `--max_prompt_len` has significant impact on performance for NPU. The lower parameter value is the faster request will be processed.
In a table below shows a comparison on prompt with 4K tokens with `--max_prompt_len` set to 16K and 4K.
Parameter `--max_prompt_len` has impact on the latency. It should be adjusted for expected input length to optimize performance.

| Max prompt length | TTFT (ms) | TPOT (ms) |
|---------------------------|------------------|-------------------|
Expand All @@ -112,12 +116,16 @@ In a table below shows a comparison on prompt with 4K tokens with `--max_prompt_

## Recommendations

Enable prefix caching feature with `--enable_prefix_caching` parameter when you expect reusing parts of the context. That is typically the case for RAG, chat and agentic application.
Take advantage of prefix caching to process repeated tokens faster.

Use KV cache compression as `u4` unless no compromise on accuracy is possible.

Set the KV cache size via `--cache_size` parameter based on the available memory, expected concurrency and context length. It will improve the performance.

Use KV cache compression as INT8 which is the default setting.
For NPU set parameter `--max_prompt_len` based on expected input length. Lower `max_prompt_len` values, will reduce memory usage and latency.

Set the KV cache size via `--cache_size` parameter based on the available memory, expected concurrency and context length or use default value (`0`) to make it dynamic. It will improve the performance.
For models with linear attention like Qwen3.6-35B-A3B, set parameter --cache_interval_multiplier=64 to reduce memory usage with prefix caching

For NPU set parameter `--max_prompt_len` as low as possible. The lower `max_prompt_len` value, the better the performance will be.
In a scenario with low concurrency and long context, increase `max_num_batched_tokens` to higher numbers like 4096 or even max model context.

**Note:** You can force reducing the concurrency on the server using a parameter `--rest_workers` which by default allows number of connections the same like number of CPU cores. Alternatively the limit can be set on the model level in `--max_num_seqs`.