Skip to content

[WIP] Support long-context and MTP prefix-cache hits#4688

Draft
grimoire wants to merge 5 commits into
InternLM:mainfrom
grimoire:prefix-caching-part2
Draft

[WIP] Support long-context and MTP prefix-cache hits#4688
grimoire wants to merge 5 commits into
InternLM:mainfrom
grimoire:prefix-caching-part2

Conversation

@grimoire

Copy link
Copy Markdown
Collaborator

Summary

This PR is a follow-up to the prefix-cache refactor in #4618.

It enables prefix-cache reuse in two cases that were previously rolled back or disabled:

  • allow prefix-cache hits to resume long-context chunked prefill from the matched prefix instead of rolling back when the remaining suffix still needs chunking;
  • enable prefix caching for Spec/MTP with one-block overlap recompute, so the target model recomputes the hidden-state bridge needed by the draft/MTP path;
  • keep matched-but-recomputed overlap blocks private/writable during trie allocation, avoiding writes into shared cached KV blocks;
  • handle SSM prefix-cache restore through exact ready checkpoints, including sparse checkpoint cases where the private recompute span may be larger than one block;
  • add regressions for scheduler rollback, chunk flags, cached-token accounting, MTP overlap matching/allocation, VLM boundary expansion, and SSM checkpoint restore.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant