Skip to content

fix(responses): stream reasoning as a live reasoning item (#9658)#10284

Open
localai-bot wants to merge 1 commit into
masterfrom
fix/9658-responses-streaming-reasoning
Open

fix(responses): stream reasoning as a live reasoning item (#9658)#10284
localai-bot wants to merge 1 commit into
masterfrom
fix/9658-responses-streaming-reasoning

Conversation

@localai-bot

Copy link
Copy Markdown
Collaborator

Closes #9658

Problem

On the /v1/responses (Responses API) streaming path, a reasoning model's <think> monologue was streamed to the client as ordinary message text (output_text.delta on a msg_ item) and only reclassified into a reasoning output item after the stream completed. Subsequent deltas also kept referencing the old msg_ id.

Root causes

  1. The reasoning gate used extractor.Reasoning(), which only reflects the Go-side ProcessToken parser and never the autoparser's ProcessChatDeltaReasoning accumulator - so autoparser-driven reasoning was dropped live and rebuilt only at end-of-stream.
  2. The non-tool path eagerly emitted the msg_ item before any token, forcing reasoning to a later index and mis-attributing deltas to msg_.
  3. Missing sticky preferAutoparser, letting a content-only autoparser leak <think> into content (Regression: Reasoning/thinking output provided as regular output #9985).

Fix

Extracted a pure streamReasoningRouter helper (mirroring chat_stream_workers.go) that gates on reasoningDelta != "", opens the message item lazily, and keeps a sticky autoparser preference. Both streaming callbacks now route reasoning deltas to the reasoning_ id, and the completed-response assembly orders reasoning -> message -> tool_calls.

Behavior note

A pure-reasoning turn with no content no longer emits an empty message item.

Test plan

  • New Ginkgo specs for streamReasoningRouter (red -> green).
  • go test ./core/http/endpoints/openresponses/... ./core/http/endpoints/openai/... green.
  • Scoped golangci-lint --new-from-merge-base=origin/master clean.

Assisted-by: claude:claude-opus-4-8 [Claude Code]

…9658)

In the /v1/responses streaming handler a reasoning model's thinking
monologue was streamed to the client as normal message text (a msg_
output item with output_text.delta) and only reclassified into a
reasoning item after the stream completed. Subsequent output_text.delta
events also kept referencing the old msg_ item id instead of the
reasoning_ id.

Root causes:

1. The live reasoning item was gated on extractor.Reasoning(), which is
   only updated by the Go-side raw-tag parser (ProcessToken). When the
   C++ autoparser drives reasoning through reasoning_content ChatDeltas,
   the reasoning delta is computed via ProcessChatDeltaReasoning into a
   separate accumulator, so extractor.Reasoning() stays empty and the
   gate never fired. The reasoning item was thus only reconstructed at
   end-of-stream.

2. The non-tool-call path created the message/msg_ output item eagerly
   before any token, forcing reasoning to a higher output index and
   making mis-split <think> text land on the pre-existing message item.

3. Neither path carried the sticky preferAutoparser flag, so a
   content-only autoparser (the non-jinja pure-content fallback, #9985)
   could leak <think>...</think> tokens into content.

Extract the per-token reasoning-vs-message classification into a pure,
unit-tested streamReasoningRouter (mirroring chooseDeferredReasoning and
processStream in the chat streaming worker): it gates the reasoning item
on the reasoning delta, opens the message item lazily on the first
content delta, and keeps a sticky preferAutoparser fallback. Both
streaming paths now route reasoning deltas to the reasoning_ id and order
the reasoning item ahead of the message at completion.

Assisted-by: claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

reasoning extraction only happens at end of stream

2 participants