Update prefix scorer to report cached prefix length in tokens #2053

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

mayabar wants to merge 8 commits into kubernetes-sigs:main from mayabar:update-prefix-scorer

+158 −77

Contributor

mayabar commented Jan 5, 2026 •

edited

Loading

What this PR does / why we need it:
Currently, the prefix length stored in the prefix cache plugin is measured in blocks.

As part of enabling easy configuration for disaggregated PD support in the inference scheduler, all configuration field units will use tokens. This involves converting from characters to tokens using the average token length constant.

Which issue(s) this PR fixes:
Fixes #2068

Does this PR introduce a user-facing change?:

Prefix Plugin Changes
- New parameter: Added `blockSizeTokens` to prefix plugin configuration, defining cache block length in tokens (replacing character-based sizing).
- Deprecation notice: The legacy `blockSize` parameter is deprecated. Instantiating the prefix plugin will fail if `blockSize` is defined without also specifying `blockSizeTokens`.
- Data unit update: Changed data stored in `PrepareRequestData` in the prefix plugin from blocks to tokens.

k8s-ci-robot added the do-not-merge/work-in-progress label

k8s-ci-robot requested review from danehans and kfswain

January 5, 2026 08:31

Contributor

k8s-ci-robot commented Jan 5, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: mayabar
Once this PR has been reviewed and has the lgtm label, please assign kfswain for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

netlify bot commented Jan 5, 2026 •

edited

Loading

✅ Deploy Preview for gateway-api-inference-extension ready!

Name	Link
🔨 Latest commit	`c1cea68`
🔍 Latest deploy log	https://app.netlify.com/projects/gateway-api-inference-extension/deploys/695f6778d985ea0007581598
😎 Deploy Preview	https://deploy-preview-2053--gateway-api-inference-extension.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

k8s-ci-robot added the cncf-cla: yes label

Contributor

k8s-ci-robot commented Jan 5, 2026

Hi @mayabar. Thanks for your PR.

I'm waiting for a github.com member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot added needs-ok-to-test size/L labels

mayabar mentioned this pull request

[WIP] Extend support for different ways to decide if disaggregated PD is required llm-d/llm-d-inference-scheduler#531

Open

mayabar changed the title ~~WIP: Update prefix scorer to report cached prefix length in tokens~~ Update prefix scorer to report cached prefix length in tokens

k8s-ci-robot removed the do-not-merge/work-in-progress label

ahg-g reviewed

View reviewed changes

Contributor

ahg-g left a comment

Can you please add more context in the description why this is needed.

/ok-to-test

pkg/epp/scheduling/framework/plugins/multi/prefix/plugin.go Outdated Show resolved Hide resolved

pkg/epp/scheduling/framework/plugins/multi/prefix/plugin.go Outdated Show resolved Hide resolved

k8s-ci-robot added ok-to-test and removed needs-ok-to-test labels

mayabar requested a review from ahg-g

January 5, 2026 14:03

ahg-g reviewed

View reviewed changes

pkg/epp/scheduling/framework/plugins/multi/prefix/plugin.go Outdated

    
              	// The input prompt is broken into sizes of BlockSizeTokens to calculate block hashes . Requests

              	// with length shorter than the block size will be ignored.

              	BlockSize int `json:"blockSize"`

              	BlockSizeTokens int `json:"blockSize"`

Contributor

ahg-g Jan 5, 2026 •

edited

Loading

We need to add to the description that this PR introduces a user-facing change (the user here being the one who deploys the epp); This PR removes a config variable and adds a new one with different semantics.

In fact, we should keep the old variable, mark it as deprecated and fail to instantiate the plugin if set with an error message to instruct the user to migrate to the new parameter with its new semantics.

k8s-ci-robot added the needs-rebase label

mayabar force-pushed the update-prefix-scorer branch from 40656e7 to b301782 Compare

January 6, 2026 12:27

k8s-ci-robot removed the needs-rebase label

ahg-g reviewed

View reviewed changes

pkg/epp/scheduling/framework/plugins/multi/prefix/plugin.go

    
              	state := &SchedulingContextState{

              		PrefixHashes:       hashes,

              		PrefixCacheServers: p.matchLongestPrefix(ctx, hashes),

              		PrefixCacheServers: p.matchLongestPrefix(ctx, hashes, blockSize),

Contributor

ahg-g Jan 6, 2026

do we not need to set the blockSize parameter here like we do in Score?

ahg-g reviewed

View reviewed changes

pkg/epp/scheduling/framework/plugins/multi/prefix/plugin.go

    
              	// A map of server to its longest prefix cache match length.

              	PrefixCacheServers map[ServerID]int

              	// Size of a block in tokens

              	BlockSize int

Contributor

ahg-g Jan 6, 2026

should we be consistent and also name this BlockSizeTokens?

ahg-g reviewed

View reviewed changes

pkg/epp/scheduling/framework/plugins/multi/prefix/plugin.go

    
              				// Update servers with their longest prefix match.

              				res[server]++

              				// Update servers with their longest prefix match, prefix length is in tokens.

              				res[server] += blockSize

Contributor

ahg-g Jan 6, 2026

why do we need to report the longest prefix in tokens, isn't it enough to track it in terms of number of blocks? The less the number of places where we make the blockSize a factor the better, right?

ahg-g reviewed

View reviewed changes

pkg/epp/scheduling/framework/plugins/multi/prefix/plugin.go

    
              	total := len(state.PrefixHashes)

              	// total prefix length in tokens

              	total := len(state.PrefixHashes) * blockSize

Contributor

ahg-g Jan 6, 2026

If matchLongestPrefix reports the number of matched blocks, then we don't need to multiply by blockSize here, right? May be I am missing something, but If we do that, wouldn't we restrict the relevance and use of the blockSizeTokens to the function that computes the hashes.

mayabar requested a review from ahg-g

January 6, 2026 13:26

k8s-ci-robot added the needs-rebase label

mayabar added 4 commits

January 8, 2026 10:13


           matchLongestPrefix returns cached prefix length in characters instea…

171124e

…d of tokens

Signed-off-by: Maya Barnea <mayab@il.ibm.com>


          - Change data stored for prefix cache plugin in the prepareData step …

760b891

…contains length values in tokens.

- Add block size to SchedulingContextState of the prefix cache plugin.
- Tests partial updates

Signed-off-by: Maya Barnea <mayab@il.ibm.com>


          fix merge problem

94b5f94

Signed-off-by: Maya Barnea <mayab@il.ibm.com>


          fixes

8ae9a82

Signed-off-by: Maya Barnea <mayab@il.ibm.com>

mayabar added 4 commits

January 8, 2026 10:13


          typo

535b33d

Signed-off-by: Maya Barnea <mayab@il.ibm.com>


          rename BlockSize to BlockSizeTokens in prefix plugin

462591e

Signed-off-by: Maya Barnea <mayab@il.ibm.com>


          In the prefix plugin, keep both block size parameters: the legacy one…

b104b4a

… defined in chars and the new one defined in tokens, update tests accordingly

Signed-off-by: Maya Barnea <mayab@il.ibm.com>


          fix documentation and test

c1cea68

Signed-off-by: Maya Barnea <mayab@il.ibm.com>

mayabar force-pushed the update-prefix-scorer branch from b301782 to c1cea68 Compare

January 8, 2026 08:14

k8s-ci-robot added needs-rebase and removed needs-rebase labels

Contributor

k8s-ci-robot commented Jan 9, 2026

PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes needs-rebase ok-to-test size/L