Skip to content

update demo of long context handling#4279

Open
dtrawins wants to merge 6 commits into
przepeck/long_context_demofrom
long-update
Open

update demo of long context handling#4279
dtrawins wants to merge 6 commits into
przepeck/long_context_demofrom
long-update

Conversation

@dtrawins

Copy link
Copy Markdown
Collaborator

🛠 Summary

JIRA/Issue if applicable.
Describe the changes.

🧪 Checklist

  • Unit tests added.
  • The documentation updated.
  • Change follows security best practices.
    ``

@dtrawins dtrawins requested review from dkalinowski and przepeck June 10, 2026 07:52
This cache allows the model to avoid recomputing attention for previous tokens when generating new tokens, greatly speeding up inference for long contexts.
For very long contexts or high concurrency, the KV cache can consume a large amount of memory (RAM or VRAM).
Compression reduces this memory usage, enabling longer prompts or more parallel requests without running out of memory.
This parameter is applicable only or pipelines with continuous batching and paged attention. It is not used with NPU device.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This parameter is applicable only or pipelines with continuous batching and paged attention. It is not used with NPU device.
This parameter is applicable only for pipelines with continuous batching and paged attention. It is not used with NPU device.

Comment thread demos/continuous_batching/long_context/README.md Outdated

| 50,000 | 945 | 0.7 | 985 | 1.5 |
| 100,000 | 1258 | 1.5 | 1713 | 3 |
Lower precision in KV Cache reduce the memory consumption and can also improve latency.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Lower precision in KV Cache reduce the memory consumption and can also improve latency.
Lower precision in KV Cache reduce the memory consumption and can also improve latency.

Comment thread demos/continuous_batching/long_context/README.md Outdated
Co-authored-by: Paweł Rzepecki  <pawel.rzepecki@intel.com>
Comment thread demos/continuous_batching/long_context/README.md Outdated
Co-authored-by: Trawinski, Dariusz <dariusz.trawinski@intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants