fix(backends): raise error when OpenAI backend receives content=None#1062
fix(backends): raise error when OpenAI backend receives content=None#1062planetf1 wants to merge 2 commits into
Conversation
Thinking-mode models (e.g. Qwen3 served via vLLM with --reasoning-parser qwen3) return content=None with finish_reason=stop and non-zero completion_tokens when they emit only reasoning tokens. The OpenAI backend silently skipped content accumulation, leaving the caller with an empty string and no indication that anything went wrong. Detect this case in post_processing and raise RuntimeError with a message suggesting enable_thinking: False via model_options. The check covers both streaming and non-streaming paths (post_processing is shared) and skips when tool_calls are present, since those legitimately permit empty content. Closes generative-computing#1060 Assisted-by: Claude Code
b37b587 to
601a224
Compare
|
The PR description has been updated. Please fill out the template for your PR to be reviewed. |
…y-content raise Before this commit, the RuntimeError for thinking-mode empty responses was raised mid-post_processing, leaving the OTel span unclosed (the cleanup block runs later). base.py only catches generation-time exceptions from the chunk queue, not exceptions from _post_process itself, so the span leaked on every thinking-mode failure. Changes: - Move model/provider metadata assignment before the guard so all fields satisfy the backend telemetry contract even when raising - Build the RuntimeError first, close the span via set_span_error/end_backend_span, then raise — no span leaks on the error path - Scope the enable_thinking hint to vLLM/Qwen3; add generic "consult your runtime's docs" for other providers - Include reasoning content length in the error message when mot._thinking is set - Add streaming-path regression test (oai_chat_response_choice absent) - Add tool_calls bypass test - Remove redundant @pytest.mark.asyncio decorators (asyncio_mode=auto in pyproject) - Fix tool_calls param type annotation to list[dict] | None Assisted-by: Claude Code
ajbozarth
left a comment
There was a problem hiding this comment.
A few notes inline — nothing on the error-path span cleanup before the raise since that's getting rewritten anyway.
| and not mot.tool_calls | ||
| ): | ||
| thinking_note = ( | ||
| f" Reasoning content ({len(mot._thinking)} chars) is in mot._thinking." |
There was a problem hiding this comment.
Nit: the error message points users at mot._thinking, which is a private attribute today. #909 item 4 is considering whether to promote it to public — happy to leave this as-is and let the message track whatever shakes out of that issue, but worth flagging.
|
|
||
| # content=None with stop+tokens means thinking-only mode; surface it rather than returning "". | ||
| finish_reason = choice_response.get("finish_reason") | ||
| completion_tokens = usage.get("completion_tokens", 0) if usage else 0 |
There was a problem hiding this comment.
When usage is unavailable (e.g. streaming without stream_options={"include_usage": True}, or an OpenAI-compatible provider that omits it), completion_tokens falls back to 0 and the guard silently passes — same empty-string symptom the fix is meant to surface. Not a regression, but worth a comment here or a line in the error so the next person hitting it knows why the guard didn't fire for them.
| finish_reason: str = "stop", | ||
| content: str | None = None, | ||
| completion_tokens: int = 9, | ||
| tool_calls: list[dict] | None = None, |
There was a problem hiding this comment.
Optional: the helper hand-assembles mot._meta["oai_chat_response"] and oai_chat_response_choice. Driving a real ChatCompletion fixture through processing() instead would catch future drift in those meta keys. Fine as a unit test; mentioning in case you want the broader coverage.
Misc PR
Type of PR
Description
Detect silent empty-content responses in the OpenAI backend when the API returns
content=Nonewithfinish_reason=stopand non-zerocompletion_tokens. This is the signature of a thinking-mode model (e.g. Qwen3 via vLLM--reasoning-parser qwen3) that emitted only reasoning tokens.Raises a
RuntimeErrorinpost_processing()with an actionable message suggestingenable_thinking: Falseviamodel_options. The check covers both streaming and non-streaming paths (post_processingis shared) and skips when tool calls are present, since those legitimately permit empty content.Before
After
Testing
Attribution
Assisted-by: Claude Code