update demo of long context handling by dtrawins · Pull Request #4279 · openvinotoolkit/model_server

dtrawins · 2026-06-10T07:51:34Z

🛠 Summary

JIRA/Issue if applicable.
Describe the changes.

🧪 Checklist

Unit tests added.
The documentation updated.
Change follows security best practices.
``

przepeck · 2026-06-10T08:12:11Z

 This cache allows the model to avoid recomputing attention for previous tokens when generating new tokens, greatly speeding up inference for long contexts.
 For very long contexts or high concurrency, the KV cache can consume a large amount of memory (RAM or VRAM).
 Compression reduces this memory usage, enabling longer prompts or more parallel requests without running out of memory.
+This parameter is applicable only or pipelines with continuous batching and paged attention. It is not used with NPU device.


Suggested change

This parameter is applicable only or pipelines with continuous batching and paged attention. It is not used with NPU device.

This parameter is applicable only for pipelines with continuous batching and paged attention. It is not used with NPU device.

przepeck · 2026-06-10T08:14:22Z

-
+|     50,000      |      945     |       0.7      |      985       |       1.5       |
+|    100,000      |      1258    |       1.5      |      1713      |       3         |
+Lower precision in KV Cache reduce the memory consumption and can also improve latency. 


Suggested change

Lower precision in KV Cache reduce the memory consumption and can also improve latency.

Lower precision in KV Cache reduce the memory consumption and can also improve latency.

Co-authored-by: Paweł Rzepecki <pawel.rzepecki@intel.com>

Co-authored-by: Trawinski, Dariusz <dariusz.trawinski@intel.com>

dtrawins and others added 2 commits June 10, 2026 09:49

update demo of long context handling

fbb1c1d

Merge branch 'przepeck/long_context_demo' into long-update

b5fc73c

dtrawins requested review from dkalinowski and przepeck June 10, 2026 07:52

przepeck reviewed Jun 10, 2026

View reviewed changes

Comment thread demos/continuous_batching/long_context/README.md Outdated

przepeck reviewed Jun 10, 2026

View reviewed changes

dtrawins added 2 commits June 10, 2026 10:23

review changes

4867867

merge

95b5c1e

przepeck reviewed Jun 10, 2026

View reviewed changes

Comment thread demos/continuous_batching/long_context/README.md Outdated

Apply suggestions from code review

25cb2d3

Co-authored-by: Paweł Rzepecki <pawel.rzepecki@intel.com>

przepeck approved these changes Jun 10, 2026

View reviewed changes

dtrawins commented Jun 10, 2026

View reviewed changes

Comment thread demos/continuous_batching/long_context/README.md Outdated

Apply suggestions from code review

2ae5a7c

Co-authored-by: Trawinski, Dariusz <dariusz.trawinski@intel.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

update demo of long context handling#4279

update demo of long context handling#4279
dtrawins wants to merge 6 commits into
przepeck/long_context_demofrom
long-update

dtrawins commented Jun 10, 2026

Uh oh!

przepeck Jun 10, 2026

Uh oh!

Uh oh!

przepeck Jun 10, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	This parameter is applicable only or pipelines with continuous batching and paged attention. It is not used with NPU device.
	This parameter is applicable only for pipelines with continuous batching and paged attention. It is not used with NPU device.

	Lower precision in KV Cache reduce the memory consumption and can also improve latency.

	Lower precision in KV Cache reduce the memory consumption and can also improve latency.

Conversation

dtrawins commented Jun 10, 2026

🛠 Summary

🧪 Checklist

Uh oh!

przepeck Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

przepeck Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants