[WIP] Support long-context and MTP prefix-cache hits by grimoire · Pull Request #4688 · InternLM/lmdeploy

grimoire · 2026-06-17T06:30:25Z

Summary

This PR is a follow-up to the prefix-cache refactor in #4618.

It enables prefix-cache reuse in two cases that were previously rolled back or disabled:

allow prefix-cache hits to resume long-context chunked prefill from the matched prefix instead of rolling back when the remaining suffix still needs chunking;
enable prefix caching for Spec/MTP with one-block overlap recompute, so the target model recomputes the hidden-state bridge needed by the draft/MTP path;
keep matched-but-recomputed overlap blocks private/writable during trie allocation, avoiding writes into shared cached KV blocks;
handle SSM prefix-cache restore through exact ready checkpoints, including sparse checkpoint cases where the private recompute span may be larger than one block;
add regressions for scheduler rollback, chunk flags, cached-token accounting, MTP overlap matching/allocation, VLM boundary expansion, and SSM checkpoint restore.

grimoire added 5 commits June 16, 2026 18:19

allow prefix-cache hits to resume long-context chunks

e01a174

mtp

d5b0805

add test

5421810

rename

db99457

better readability

078a62e