feat(vllm): add grammar and structured output support by eureka928 · Pull Request #8806 · mudler/LocalAI

eureka928 · 2026-03-06T02:58:29Z

Description

This PR fixes #6857

Adds grammar and structured output support to the vLLM backend, enabling users to enforce structured outputs via JSON schema, JSON object, and BNF/GBNF grammar constraints.

Problem

The vLLM backend ignored all structured output parameters:

The Grammar field from the proto was never read
A phantom GuidedDecoding mapping referenced a non-existent proto field
response_format with json_schema or json_object had no effect on vLLM
Missing import time caused a runtime crash on video input

Solution

Proto (backend.proto):

Added JSONSchema (field 52) and ResponseFormat (field 53) to PredictOptions, allowing backends to receive the raw JSON schema and format type natively

Go endpoints (chat.go, completion.go):

Extract the raw JSON schema string from response_format: {type: "json_schema", ...} and store it on config.JSONSchema
Set config.ResponseFormat to the format type (json_object / json_schema)
GBNF grammar is still generated in parallel for llama.cpp compatibility
Added json_schema grammar support to the completion endpoint (was missing)

Go backend (options.go, model_config.go):

Pass JSONSchema and ResponseFormat through gRPCPredictOpts to backends

vLLM backend (backend.py):

Support both StructuredOutputsParams (vLLM latest) and GuidedDecodingParams (vLLM <=0.8.x) with graceful import fallback
Handle three structured output modes with priority: JSONSchema > json_object > Grammar
Fix missing import time and import json
Remove phantom GuidedDecoding mapping

Docs (constrained_grammars.md):

Updated compatibility notice to include vLLM
Added vLLM section with examples for JSON schema, JSON object, and grammar

How It Works

User request (response_format or grammar parameter)
  → chat.go/completion.go: extracts raw JSON schema → config.JSONSchema
  → also generates GBNF grammar → config.Grammar (for llama.cpp compat)
  → options.go: passes both via gRPC PredictOptions
  → vLLM backend: uses StructuredOutputsParams/GuidedDecodingParams
  → llama.cpp backend: uses Grammar (GBNF) as before — no regression

Verification

xgrammar (used by vLLM) explicitly follows the GBNF spec from llama.cpp, so grammar passthrough works
GuidedDecodingParams.json and StructuredOutputsParams.json both accept JSON strings
grammar_is_likely_lark() in vLLM correctly identifies GBNF as non-Lark (via ::= detection)
No conflict with existing config.ResponseFormat usage (image endpoint b64_json goes through a different code path)

Notes for Reviewers

The proto generated Go files are gitignored and built at compile time — only backend.proto is committed
Compatible with both vLLM <=0.8.x (GuidedDecodingParams / guided_decoding) and latest (StructuredOutputsParams / structured_outputs)
No regression for llama.cpp or other backends

Signed commits

Yes, I signed my commits.

@mudler

netlify · 2026-03-06T02:58:34Z

✅ Deploy Preview for localai ready!

Name	Link
🔨 Latest commit	`278e7e2`
🔍 Latest deploy log	https://app.netlify.com/projects/localai/deploys/69b454077c0aee00078d12a0
😎 Deploy Preview	https://deploy-preview-8806--localai.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

Add two new fields to PredictOptions in the proto: - JSONSchema (field 52): raw JSON schema string for backends that support native structured output (e.g. vLLM guided decoding) - ResponseFormat (field 53): response format type string These fields allow backends like vLLM to receive structured output constraints natively instead of only through GBNF grammar conversion. Ref: mudler#6857 Signed-off-by: eureka928 <meobius123@gmail.com>

Add JSONSchema field to ModelConfig to carry the raw JSON schema string alongside the GBNF Grammar. Pass both JSONSchema and ResponseFormat through gRPCPredictOpts to backends via the new proto fields. This allows backends like vLLM to receive the original JSON schema for native structured output support. Ref: mudler#6857 Signed-off-by: eureka928 <meobius123@gmail.com>

In chat and completion endpoints, when response_format is json_schema, extract the raw JSON schema and store it on config.JSONSchema alongside the GBNF grammar. Also set config.ResponseFormat to the format type. This allows backends that support native structured output (like vLLM) to use the JSON schema directly instead of the GBNF grammar. Ref: mudler#6857 Signed-off-by: eureka928 <meobius123@gmail.com>

Update the vLLM backend to support structured output: - Import GuidedDecodingParams from vllm.sampling_params - Handle JSONSchema: parse and pass as GuidedDecodingParams(json_schema=...) - Handle json_object response format: GuidedDecodingParams(json_object=True) - Fall back to Grammar (GBNF) via GuidedDecodingParams(grammar=...) - Remove phantom GuidedDecoding mapping (field doesn't exist in proto) - Fix missing 'import time' and 'import json' for load_video and schema parsing Priority: JSONSchema > json_object > Grammar (GBNF fallback) Ref: mudler#6857 Signed-off-by: eureka928 <meobius123@gmail.com>

- Make GuidedDecodingParams import conditional (try/except) for backwards compatibility with older vLLM versions - Remove GBNF grammar fallback — vLLM expects EBNF, not GBNF, so passing LocalAI's GBNF grammar would produce confusing errors - Pass JSONSchema as string directly instead of parsing to dict (safer across vLLM versions) - Add GBNF grammar generation for json_schema in completion endpoint so non-vLLM backends (llama.cpp) also get grammar enforcement Ref: mudler#6857 Signed-off-by: eureka928 <meobius123@gmail.com>

- Handle both StructuredOutputsParams (vLLM latest) and GuidedDecodingParams (vLLM <=0.8.x) with graceful fallback - Use the correct SamplingParams field name for each version (structured_outputs vs guided_decoding) - Use 'json' parameter (not 'json_schema') matching both APIs - Re-add grammar (GBNF/BNF) passthrough — both vLLM APIs accept a 'grammar' parameter handled by xgrammar which supports GBNF - Priority: JSONSchema > json_object > Grammar Ref: mudler#6857 Signed-off-by: eureka928 <meobius123@gmail.com>

Update the compatibility notice to include vLLM alongside llama.cpp. Add a vLLM-specific section with examples for all three supported methods: json_schema, json_object, and grammar (via xgrammar). Ref: mudler#6857 Signed-off-by: eureka928 <meobius123@gmail.com>

eureka928 · 2026-03-06T12:53:43Z

GM @mudler
Would you review my PR?
Thank you for your time

eureka928 · 2026-03-09T13:07:41Z

Hi @localai-bot would you review this PR?
Thank you

eureka928 · 2026-03-12T12:37:06Z

Hi @richiejp @sozercan would you review my PR if you have a moment?
Thank you

richiejp

Thanks, see comments.

richiejp · 2026-03-13T10:55:56Z

backend/python/vllm/backend.py

+        _structured_output_field = "guided_decoding"
+    except ImportError:
+        _structured_output_cls = None
+        _structured_output_field = None


Do we need a fallback? We usually pin the upstream version.

Good point. I checked and vLLM is actually not pinned to a specific version — requirements-after.txt just lists vllm with no version constraint, and different platform builds (CPU/CUDA/ROCm) may end up with different vLLM versions.

That said, if the project plans to pin vLLM to a specific version, I'm happy to drop the fallback and target whichever API is current. Let me know which you'd prefer.

OK, when you say newer versions, how new? If it's a very recent change then maybe we need this, otherwise we probably don't

The rename happened in vLLM v0.8.x → latest. GuidedDecodingParams was renamed to StructuredOutputsParams and the corresponding SamplingParams field changed from guided_decoding to structured_outputs.

Since vLLM isn't pinned (requirements-after.txt just says vllm), builds can land on either version depending on when/how the image is built. If we pin to a specific version, I can drop the fallback and target that API directly — let me know which version to target.

Also in the latest push: I've refactored to use the Metadata map instead of new proto fields, as discussed.

backend/backend.proto

richiejp · 2026-03-13T11:16:28Z

core/http/endpoints/openai/completion.go

 			dat, _ := json.Marshal(config.ResponseFormatMap)
 			_ = json.Unmarshal(dat, &d)
-			if d.Type == "json_object" {
+			switch d.Type {


If we require changes in the OpenAI compat API, what about the OpenAI realtime API? Also open responses?

Good question. The changes here are only in the chat (/v1/chat/completions) and completion (/v1/completions) endpoint handlers — the standard OpenAI endpoints where response_format is specified.

Realtime API (/v1/realtime): WebSocket-based audio streaming — structured output doesn't apply here.

Open Responses API (/v1/responses): This could benefit from structured output support too, but it has its own separate handler in core/http/endpoints/openresponses/. I'd suggest that as a follow-up to keep this PR focused.

The core plumbing (config → gRPC → backend) is shared, so extending to Open Responses later would be straightforward once the approach is settled here.

I want to see how this looks with openresponses as well

Done — added structured output support to the Open Responses API in the latest commit. The text_format parameter now generates grammar + passes JSON schema via metadata, same as chat/completion endpoints. Also added docs with examples.

richiejp · 2026-03-13T16:11:03Z

Also can you create e2e tests for this?

…ctured output Address review feedback: - Remove JSONSchema and ResponseFormat proto fields; pass them via the existing Metadata map instead, avoiding proto changes - vLLM backend reads json_schema and response_format from request.Metadata - Add structured output support (json_schema, json_object) to Open Responses API via text_format parameter - Update docs with Open Responses structured output examples Ref: mudler#6857 Signed-off-by: eureka928 <meobius123@gmail.com>

eureka928 · 2026-03-13T18:35:34Z

Also can you create e2e tests for this?

Would you review again?

eureka928 added 7 commits March 6, 2026 04:03

eureka928 force-pushed the feat/vllm-structured-output branch from dabd63c to bb08454 Compare March 6, 2026 03:03

richiejp reviewed Mar 13, 2026

View reviewed changes

eureka928 requested a review from richiejp March 13, 2026 18:15

Uh oh!

Conversation

eureka928 commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Solution

How It Works

Verification

Uh oh!

netlify bot commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for localai ready!

Uh oh!

eureka928 commented Mar 6, 2026

Uh oh!

eureka928 commented Mar 9, 2026

Uh oh!

eureka928 commented Mar 12, 2026

Uh oh!

richiejp left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

richiejp commented Mar 13, 2026

Uh oh!

eureka928 commented Mar 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

eureka928 commented Mar 6, 2026 •

edited

Loading

netlify bot commented Mar 6, 2026 •

edited

Loading