fix(reasoning): stop prefilled <think> from swallowing tag-less answers#10225
Merged
Conversation
When a chat template injects the thinking start token into the prompt (so DetectThinkingStartToken returns e.g. "<think>"), the model's output begins inside a reasoning block and carries only the closing tag. The non-jinja autoparser fallback (peg-native "pure content" mode, issue #9985) prepends the start token so the extractor can pair it with the model's </think>. But on a COMPLETE response that contains no closing tag, the model answered directly with no reasoning at all. Prepending the start token there manufactures an unclosed block that swallows the entire answer into reasoning, leaving the OpenAI `content` field empty. This breaks short/direct answers — session names, JSON summaries, any terse completion where the model skips the think block — which come back with empty content. Regression surfaced by #9991, which added the defensive prefill extraction to the complete-response paths. Add reasoning.ExtractReasoningComplete: it only honors a prefilled start token when the response actually contains the matching closing tag (proof a reasoning block exists). Genuine reasoning tags already in the content still extract; tag-less content stays content. Apply it at every complete-response site (applyAutoparserOverride, realtime, openresponses). The streaming per-token extractor is intentionally left on ExtractReasoningWithConfig — mid-stream an as-yet-unclosed block is legitimate and must surface as reasoning deltas. Also adds reasoning.ClosingTokenForStart and hoists the default reasoning tag pairs to package scope so both helpers share one source of truth. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…gression Adds the end-to-end case that actually broke session summaries / auto-titles and was not covered before: a request with enable_thinking=false against a <think>-capable model. In non-thinking mode the model emits no reasoning block, so llama.cpp's autoparser returns ChatDeltas with content set and reasoning_content empty (verified against stock llama-server: same model with chat_template_kwargs.enable_thinking=false returns reasoning_content=null, content="hello"). thinkingStartToken is still "<think>" because it is detected per-model from the enable_thinking=true render, so the old code prepended it and swallowed the answer. The test fails without the ExtractReasoningComplete gate. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
When a chat template injects the thinking start token into the prompt (so
DetectThinkingStartTokenreturns e.g.<think>), the model's output begins inside a reasoning block and carries only the closing tag. The non-jinja autoparser fallback — thepeg-native"pure content" path (#9985) — prepends the start token so the extractor can pair it with the model's</think>.But on a complete response that contains no closing tag, the model answered directly with no reasoning at all. Prepending the start token there manufactures an unclosed block that swallows the entire answer into
reasoning, leaving the OpenAIcontentfield empty.This breaks short/direct answers — session names, JSON summaries, any terse completion where the model skips the think block — which come back with empty
content. The defensive prefill extraction added by #9991 to the complete-response paths is what surfaces it.Reproduction
A
<think>-prefilled reasoning model (e.g. an imported qwen3-family GGUF) running peg-native, asked something it answers directly:Fix
Add
reasoning.ExtractReasoningComplete: it only honors a prefilled start token when the response actually contains the matching closing tag (proof a reasoning block exists). Genuine reasoning tags already present in the content still extract; a tag-less direct answer stays incontent.Applied at every complete-response site:
applyAutoparserOverride(core/http/endpoints/openai/chat.go)core/http/endpoints/openai/realtime.go)core/http/endpoints/openresponses/responses.go)The streaming per-token extractor (
chat.gotokenCallback) is intentionally left onExtractReasoningWithConfig— mid-stream an as-yet-unclosed block is legitimate and must surface as reasoning deltas as they arrive.Also adds
reasoning.ClosingTokenForStartand hoists the default reasoning tag pairs to package scope so both helpers share one source of truth.Tests
pkg/reasoning:ExtractReasoningComplete/ClosingTokenForStartunits — prefill + tag-less ⇒ stays content; prefill + close tag ⇒ still splits; fully-tagged ⇒ splits.core/http/endpoints/openai:applyAutoparserOverrideregression specs for the prefilled-thinking-token path (tag-less answer, tag-less JSON, prefill-with-close, fully-tagged).go vetclean. No behavior change for jinja-enabled installs (autoparser-populatedreasoning_contentis still trusted untouched).🤖 Generated with Claude Code