-
Notifications
You must be signed in to change notification settings - Fork 458
Adding chat template to vllm decode. #2978
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
36ae2d0 to
b2ab71b
Compare
b2ab71b to
5a52d69
Compare
khatwanimohit
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
5a52d69 to
883dd10
Compare
|
hi @NicoGrande, but I keep getting the error below. However, I tried commenting out this line and it works, and the generation looks good to me now. |
Hi @ChingTsai ! To include |
|
Hi @NicoGrande , I just tested out your PR, it seems like it only supports unscanned model, meaning that only models with scan_layers=false can successfully ran. When using scanned model, I ran into error where the model generates unreasonable tokens. As we use scanned models for SFT, I would like to ask if this PR could also support scanned models? cc: @ChingTsai |
Hi @cychiuak! Thank you for flagging this! I think we should be able to add some functionality to support this. The codepath will likely involve unrolling the scanned checkpoint and then using the unscanned version for inference. I can add this to my TODO list and update this PR with this feature request when this is ready. |
|
Thanks @NicoGrande ! Supporting scanned checkpoints would significantly enhance the training user experiment. |
+1 Thank you @NicoGrande for taking on this! This feature will unblock us. Otherwise, the current only viable workflow is to convert the ckpt in orbax format to HF (safetensors) format and run the inference, which is a pain in hyperparameter searches. Would be great if you can help to add the support. Thanks! |
883dd10 to
96690ab
Compare
Description
Adds optional support for chat templates in
vllm_decode.py. This helps achieve similar responses to the native vLLM model implementation when compared to MaxText on vLLM.FIXES: b/476253050
Tests
Running the following command resulted in the following diff with respect to the vLLM native model:
https://diff.googleplex.com/#key=j8DNmAFF934Q
Checklist
Before submitting this PR, please make sure (put X in square brackets):
gemini-reviewlabel.