Skip to content

fix(reasoning): stop prefilled <think> from swallowing tag-less answers#10225

Merged
mudler merged 2 commits into
masterfrom
fix/autoparser-prefill-swallow
Jun 9, 2026
Merged

fix(reasoning): stop prefilled <think> from swallowing tag-less answers#10225
mudler merged 2 commits into
masterfrom
fix/autoparser-prefill-swallow

Conversation

@localai-bot

Copy link
Copy Markdown
Collaborator

Problem

When a chat template injects the thinking start token into the prompt (so DetectThinkingStartToken returns e.g. <think>), the model's output begins inside a reasoning block and carries only the closing tag. The non-jinja autoparser fallback — the peg-native "pure content" path (#9985) — prepends the start token so the extractor can pair it with the model's </think>.

But on a complete response that contains no closing tag, the model answered directly with no reasoning at all. Prepending the start token there manufactures an unclosed block that swallows the entire answer into reasoning, leaving the OpenAI content field empty.

This breaks short/direct answers — session names, JSON summaries, any terse completion where the model skips the think block — which come back with empty content. The defensive prefill extraction added by #9991 to the complete-response paths is what surfaces it.

Reproduction

A <think>-prefilled reasoning model (e.g. an imported qwen3-family GGUF) running peg-native, asked something it answers directly:

{"messages":[{"role":"user","content":"Say the single word: hello"}]}
→ {"message":{"content":"","reasoning":"hello"}}   # content swallowed

Fix

Add reasoning.ExtractReasoningComplete: it only honors a prefilled start token when the response actually contains the matching closing tag (proof a reasoning block exists). Genuine reasoning tags already present in the content still extract; a tag-less direct answer stays in content.

Applied at every complete-response site:

  • applyAutoparserOverride (core/http/endpoints/openai/chat.go)
  • realtime fallback ×2 (core/http/endpoints/openai/realtime.go)
  • openresponses fallback ×3 (core/http/endpoints/openresponses/responses.go)

The streaming per-token extractor (chat.go tokenCallback) is intentionally left on ExtractReasoningWithConfig — mid-stream an as-yet-unclosed block is legitimate and must surface as reasoning deltas as they arrive.

Also adds reasoning.ClosingTokenForStart and hoists the default reasoning tag pairs to package scope so both helpers share one source of truth.

Tests

  • pkg/reasoning: ExtractReasoningComplete / ClosingTokenForStart units — prefill + tag-less ⇒ stays content; prefill + close tag ⇒ still splits; fully-tagged ⇒ splits.
  • core/http/endpoints/openai: applyAutoparserOverride regression specs for the prefilled-thinking-token path (tag-less answer, tag-less JSON, prefill-with-close, fully-tagged).
  • All affected packages pass; gofmt and go vet clean. No behavior change for jinja-enabled installs (autoparser-populated reasoning_content is still trusted untouched).

🤖 Generated with Claude Code

mudler and others added 2 commits June 8, 2026 23:20
When a chat template injects the thinking start token into the prompt (so
DetectThinkingStartToken returns e.g. "<think>"), the model's output begins
inside a reasoning block and carries only the closing tag. The non-jinja
autoparser fallback (peg-native "pure content" mode, issue #9985) prepends the
start token so the extractor can pair it with the model's </think>.

But on a COMPLETE response that contains no closing tag, the model answered
directly with no reasoning at all. Prepending the start token there manufactures
an unclosed block that swallows the entire answer into reasoning, leaving the
OpenAI `content` field empty. This breaks short/direct answers — session names,
JSON summaries, any terse completion where the model skips the think block —
which come back with empty content. Regression surfaced by #9991, which added
the defensive prefill extraction to the complete-response paths.

Add reasoning.ExtractReasoningComplete: it only honors a prefilled start token
when the response actually contains the matching closing tag (proof a reasoning
block exists). Genuine reasoning tags already in the content still extract;
tag-less content stays content. Apply it at every complete-response site
(applyAutoparserOverride, realtime, openresponses). The streaming per-token
extractor is intentionally left on ExtractReasoningWithConfig — mid-stream an
as-yet-unclosed block is legitimate and must surface as reasoning deltas.

Also adds reasoning.ClosingTokenForStart and hoists the default reasoning tag
pairs to package scope so both helpers share one source of truth.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…gression

Adds the end-to-end case that actually broke session summaries / auto-titles
and was not covered before: a request with enable_thinking=false against a
<think>-capable model. In non-thinking mode the model emits no reasoning block,
so llama.cpp's autoparser returns ChatDeltas with content set and
reasoning_content empty (verified against stock llama-server: same model with
chat_template_kwargs.enable_thinking=false returns reasoning_content=null,
content="hello"). thinkingStartToken is still "<think>" because it is detected
per-model from the enable_thinking=true render, so the old code prepended it and
swallowed the answer. The test fails without the ExtractReasoningComplete gate.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@mudler mudler merged commit e1ec03d into master Jun 9, 2026
59 checks passed
@mudler mudler deleted the fix/autoparser-prefill-swallow branch June 9, 2026 07:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants