feat(vllm): add grammar and structured output support#8806
feat(vllm): add grammar and structured output support#8806eureka928 wants to merge 8 commits intomudler:masterfrom
Conversation
✅ Deploy Preview for localai ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
Add two new fields to PredictOptions in the proto: - JSONSchema (field 52): raw JSON schema string for backends that support native structured output (e.g. vLLM guided decoding) - ResponseFormat (field 53): response format type string These fields allow backends like vLLM to receive structured output constraints natively instead of only through GBNF grammar conversion. Ref: mudler#6857 Signed-off-by: eureka928 <meobius123@gmail.com>
Add JSONSchema field to ModelConfig to carry the raw JSON schema string alongside the GBNF Grammar. Pass both JSONSchema and ResponseFormat through gRPCPredictOpts to backends via the new proto fields. This allows backends like vLLM to receive the original JSON schema for native structured output support. Ref: mudler#6857 Signed-off-by: eureka928 <meobius123@gmail.com>
In chat and completion endpoints, when response_format is json_schema, extract the raw JSON schema and store it on config.JSONSchema alongside the GBNF grammar. Also set config.ResponseFormat to the format type. This allows backends that support native structured output (like vLLM) to use the JSON schema directly instead of the GBNF grammar. Ref: mudler#6857 Signed-off-by: eureka928 <meobius123@gmail.com>
Update the vLLM backend to support structured output: - Import GuidedDecodingParams from vllm.sampling_params - Handle JSONSchema: parse and pass as GuidedDecodingParams(json_schema=...) - Handle json_object response format: GuidedDecodingParams(json_object=True) - Fall back to Grammar (GBNF) via GuidedDecodingParams(grammar=...) - Remove phantom GuidedDecoding mapping (field doesn't exist in proto) - Fix missing 'import time' and 'import json' for load_video and schema parsing Priority: JSONSchema > json_object > Grammar (GBNF fallback) Ref: mudler#6857 Signed-off-by: eureka928 <meobius123@gmail.com>
- Make GuidedDecodingParams import conditional (try/except) for backwards compatibility with older vLLM versions - Remove GBNF grammar fallback — vLLM expects EBNF, not GBNF, so passing LocalAI's GBNF grammar would produce confusing errors - Pass JSONSchema as string directly instead of parsing to dict (safer across vLLM versions) - Add GBNF grammar generation for json_schema in completion endpoint so non-vLLM backends (llama.cpp) also get grammar enforcement Ref: mudler#6857 Signed-off-by: eureka928 <meobius123@gmail.com>
- Handle both StructuredOutputsParams (vLLM latest) and GuidedDecodingParams (vLLM <=0.8.x) with graceful fallback - Use the correct SamplingParams field name for each version (structured_outputs vs guided_decoding) - Use 'json' parameter (not 'json_schema') matching both APIs - Re-add grammar (GBNF/BNF) passthrough — both vLLM APIs accept a 'grammar' parameter handled by xgrammar which supports GBNF - Priority: JSONSchema > json_object > Grammar Ref: mudler#6857 Signed-off-by: eureka928 <meobius123@gmail.com>
Update the compatibility notice to include vLLM alongside llama.cpp. Add a vLLM-specific section with examples for all three supported methods: json_schema, json_object, and grammar (via xgrammar). Ref: mudler#6857 Signed-off-by: eureka928 <meobius123@gmail.com>
dabd63c to
bb08454
Compare
|
GM @mudler |
|
Hi @localai-bot would you review this PR? |
| _structured_output_field = "guided_decoding" | ||
| except ImportError: | ||
| _structured_output_cls = None | ||
| _structured_output_field = None |
There was a problem hiding this comment.
Do we need a fallback? We usually pin the upstream version.
There was a problem hiding this comment.
Good point. I checked and vLLM is actually not pinned to a specific version — requirements-after.txt just lists vllm with no version constraint, and different platform builds (CPU/CUDA/ROCm) may end up with different vLLM versions.
That said, if the project plans to pin vLLM to a specific version, I'm happy to drop the fallback and target whichever API is current. Let me know which you'd prefer.
There was a problem hiding this comment.
OK, when you say newer versions, how new? If it's a very recent change then maybe we need this, otherwise we probably don't
There was a problem hiding this comment.
The rename happened in vLLM v0.8.x → latest. GuidedDecodingParams was renamed to StructuredOutputsParams and the corresponding SamplingParams field changed from guided_decoding to structured_outputs.
Since vLLM isn't pinned (requirements-after.txt just says vllm), builds can land on either version depending on when/how the image is built. If we pin to a specific version, I can drop the fallback and target that API directly — let me know which version to target.
Also in the latest push: I've refactored to use the Metadata map instead of new proto fields, as discussed.
| dat, _ := json.Marshal(config.ResponseFormatMap) | ||
| _ = json.Unmarshal(dat, &d) | ||
| if d.Type == "json_object" { | ||
| switch d.Type { |
There was a problem hiding this comment.
If we require changes in the OpenAI compat API, what about the OpenAI realtime API? Also open responses?
There was a problem hiding this comment.
Good question. The changes here are only in the chat (/v1/chat/completions) and completion (/v1/completions) endpoint handlers — the standard OpenAI endpoints where response_format is specified.
- Realtime API (
/v1/realtime): WebSocket-based audio streaming — structured output doesn't apply here. - Open Responses API (
/v1/responses): This could benefit from structured output support too, but it has its own separate handler incore/http/endpoints/openresponses/. I'd suggest that as a follow-up to keep this PR focused.
The core plumbing (config → gRPC → backend) is shared, so extending to Open Responses later would be straightforward once the approach is settled here.
There was a problem hiding this comment.
I want to see how this looks with openresponses as well
There was a problem hiding this comment.
Done — added structured output support to the Open Responses API in the latest commit. The text_format parameter now generates grammar + passes JSON schema via metadata, same as chat/completion endpoints. Also added docs with examples.
|
Also can you create e2e tests for this? |
…ctured output Address review feedback: - Remove JSONSchema and ResponseFormat proto fields; pass them via the existing Metadata map instead, avoiding proto changes - vLLM backend reads json_schema and response_format from request.Metadata - Add structured output support (json_schema, json_object) to Open Responses API via text_format parameter - Update docs with Open Responses structured output examples Ref: mudler#6857 Signed-off-by: eureka928 <meobius123@gmail.com>
Would you review again? |
Description
This PR fixes #6857
Adds grammar and structured output support to the vLLM backend, enabling users to enforce structured outputs via JSON schema, JSON object, and BNF/GBNF grammar constraints.
Problem
The vLLM backend ignored all structured output parameters:
Grammarfield from the proto was never readGuidedDecodingmapping referenced a non-existent proto fieldresponse_formatwithjson_schemaorjson_objecthad no effect on vLLMimport timecaused a runtime crash on video inputSolution
Proto (
backend.proto):JSONSchema(field 52) andResponseFormat(field 53) toPredictOptions, allowing backends to receive the raw JSON schema and format type nativelyGo endpoints (
chat.go,completion.go):response_format: {type: "json_schema", ...}and store it onconfig.JSONSchemaconfig.ResponseFormatto the format type (json_object/json_schema)json_schemagrammar support to the completion endpoint (was missing)Go backend (
options.go,model_config.go):JSONSchemaandResponseFormatthroughgRPCPredictOptsto backendsvLLM backend (
backend.py):StructuredOutputsParams(vLLM latest) andGuidedDecodingParams(vLLM <=0.8.x) with graceful import fallbackJSONSchema>json_object>Grammarimport timeandimport jsonGuidedDecodingmappingDocs (
constrained_grammars.md):How It Works
Verification
GuidedDecodingParams.jsonandStructuredOutputsParams.jsonboth accept JSON stringsgrammar_is_likely_lark()in vLLM correctly identifies GBNF as non-Lark (via::=detection)config.ResponseFormatusage (image endpointb64_jsongoes through a different code path)Notes for Reviewers
backend.protois committedGuidedDecodingParams/guided_decoding) and latest (StructuredOutputsParams/structured_outputs)Signed commits
@mudler