-
Notifications
You must be signed in to change notification settings - Fork 219
Update prefix scorer to report cached prefix length in tokens #2053
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: mayabar The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
✅ Deploy Preview for gateway-api-inference-extension ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
|
Hi @mayabar. Thanks for your PR. I'm waiting for a github.com member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
ahg-g
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please add more context in the description why this is needed.
/ok-to-test
| // The input prompt is broken into sizes of BlockSizeTokens to calculate block hashes . Requests | ||
| // with length shorter than the block size will be ignored. | ||
| BlockSize int `json:"blockSize"` | ||
| BlockSizeTokens int `json:"blockSize"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to add to the description that this PR introduces a user-facing change (the user here being the one who deploys the epp); This PR removes a config variable and adds a new one with different semantics.
In fact, we should keep the old variable, mark it as deprecated and fail to instantiate the plugin if set with an error message to instruct the user to migrate to the new parameter with its new semantics.
40656e7 to
b301782
Compare
| state := &SchedulingContextState{ | ||
| PrefixHashes: hashes, | ||
| PrefixCacheServers: p.matchLongestPrefix(ctx, hashes), | ||
| PrefixCacheServers: p.matchLongestPrefix(ctx, hashes, blockSize), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we not need to set the blockSize parameter here like we do in Score?
| // A map of server to its longest prefix cache match length. | ||
| PrefixCacheServers map[ServerID]int | ||
| // Size of a block in tokens | ||
| BlockSize int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we be consistent and also name this BlockSizeTokens?
| // Update servers with their longest prefix match. | ||
| res[server]++ | ||
| // Update servers with their longest prefix match, prefix length is in tokens. | ||
| res[server] += blockSize |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need to report the longest prefix in tokens, isn't it enough to track it in terms of number of blocks? The less the number of places where we make the blockSize a factor the better, right?
|
|
||
| total := len(state.PrefixHashes) | ||
| // total prefix length in tokens | ||
| total := len(state.PrefixHashes) * blockSize |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If matchLongestPrefix reports the number of matched blocks, then we don't need to multiply by blockSize here, right? May be I am missing something, but If we do that, wouldn't we restrict the relevance and use of the blockSizeTokens to the function that computes the hashes.
…d of tokens Signed-off-by: Maya Barnea <mayab@il.ibm.com>
…contains length values in tokens. - Add block size to SchedulingContextState of the prefix cache plugin. - Tests partial updates Signed-off-by: Maya Barnea <mayab@il.ibm.com>
Signed-off-by: Maya Barnea <mayab@il.ibm.com>
Signed-off-by: Maya Barnea <mayab@il.ibm.com>
… defined in chars and the new one defined in tokens, update tests accordingly Signed-off-by: Maya Barnea <mayab@il.ibm.com>
Signed-off-by: Maya Barnea <mayab@il.ibm.com>
b301782 to
c1cea68
Compare
|
PR needs rebase. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
What this PR does / why we need it:
Currently, the prefix length stored in the prefix cache plugin is measured in blocks.
As part of enabling easy configuration for disaggregated PD support in the inference scheduler, all configuration field units will use tokens. This involves converting from characters to tokens using the average token length constant.
Which issue(s) this PR fixes:
Fixes #2068
Does this PR introduce a user-facing change?: