feat(serve): NDJSON streaming for Ollama /api/chat + /api/generate — true drop-in (PMAT-928)#2222
Merged
Merged
Conversation
…true drop-in (PMAT-928) PMAT-923 (#2216) wired Ollama /api/chat + /api/generate into the real apr serve routers but always returned a SINGLE coalesced (done:true) body — the adapter forced stream:false. Ollama clients default to stream:true and expect newline-delimited JSON: a sequence of {...,message:{role,content:<token>},done:false} chunks then a terminal {...,done:true,...stats}. A default Ollama client therefore saw a non-streaming body from apr serve. This makes an Ollama request with stream != false (Ollama's wire default — the serde default is now default_stream()==true, not bool::default()==false) respond with a chunked application/x-ndjson body: one done:false object per token, then a terminal done:true object carrying done_reason/prompt_eval_count/eval_count/ total_duration/eval_duration. stream:false keeps the existing single coalesced object (no regression). Streaming reuses the SAME incremental token stream the OpenAI /v1/chat/completions SSE path uses: the APR-CPU router's spawn_cpu_streaming_task → generate_with_cache_streaming → mpsc channel. A new spawn_cpu_token_text_stream sends each token's decoded text through that channel and ollama_ndjson_stream reshapes each into an NDJSON line — only the wire framing differs (NDJSON vs SSE data: events), not a re-decode of a batch result. Wired into all real apr serve routers (APR-CPU per-token; GPU-fallback, WGPU, and both SafeTensors routers honor stream:true with NDJSON framing over their batch generate_with_cache result via reshape_openai_to_ollama_ndjson). Falsifier (crates/apr-cli/tests/ollama_ndjson_streaming.rs) drives the REAL APR-CPU serve router via build_demo_streaming_apr_cpu_router_for_test (scripted tokens through the identical mpsc + NDJSON pipeline): POST /api/chat stream:true asserts application/x-ndjson AND multiple newline-delimited objects with intermediate done:false + a final done:true (NOT one object); stream:false stays a single coalesced object. RED on the old coalesced code (single application/json object even with stream:true), GREEN on the fix. Mutation-verified: forcing the stream branch off collapses the body back to one object → falsifier RED. Contract: contracts/apr-serve-openai-compat-v1.yaml v1.14.0 → v1.15.0 adds OBLIG-OLLAMA-NDJSON-STREAMING + 3 single-line falsifier refs; pv validate + pv lint contracts/ pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
PMAT-923 (#2216) wired Ollama
/api/chat+/api/generateinto the 5 realapr serverouters but always returned a single coalesced (done:true) body — the adapter forcedstream:false. Ollama clients default tostream:trueand expect newline-delimited JSON: a sequence of{...,message:{role,content:<token>},done:false}chunks then a terminal{...,done:true,...stats}. So a default Ollama client streaming fromapr servesaw a non-streaming body.This PR makes an Ollama request with
stream != false(Ollama's wire default) respond with a chunkedapplication/x-ndjsonbody — onedone:falseobject per token then a terminaldone:trueobject with stats.stream:falsekeeps the existing single coalesced object (no regression).Streaming is real (reuses the /v1 token stream)
The
/v1/chat/completionshandler on the APR-CPU router already does true per-token SSE streaming viaspawn_cpu_streaming_task→generate_with_cache_streaming→ an mpsc channel of token IDs. The Ollama NDJSON path reuses that same incremental stream: a newspawn_cpu_token_text_streamsends each token's decoded text through the channel, andollama_ndjson_streamreshapes each token into an NDJSON line. Only the wire framing differs (NDJSON lines vs SSEdata:events) — it is NOT a re-decode of a coalesced batch result.Wired into all real
apr serverouters (the #2216 lesson —apr servedoes NOT mount realizar'screate_router):build_apr_cpu_router): true per-token NDJSON.build_gpu_router), WGPU, SafeTensors x2: batch backends (generate_with_cache), so they honorstream:truewith correct NDJSON framing (one content chunk + terminaldone:true) over their coalesced result viareshape_openai_to_ollama_ndjson— honest about granularity, correct wire shape.Falsifier
crates/apr-cli/tests/ollama_ndjson_streaming.rsdrives the REAL APR-CPU serve router viabuild_demo_streaming_apr_cpu_router_for_test(a scripted token sequence flows through the IDENTICAL mpsc + NDJSON reshape pipeline the transformer uses — only the token source is faked, never the wire framing):stream:true→ assertsapplication/x-ndjsonAND multiple newline-delimited objects with intermediatedone:falsetoken chunks + a finaldone:true(NOT one object); token chunks reassemble to the full generation.stream:false→ exactly one coalesced object (application/json,done:true).stream→ streams (Ollama default).RED on the old coalesced code (single
application/jsonobject even withstream:true); GREEN on the fix. Mutation-verified: forcing the stream branch off internally collapses the body back to one coalesced object → the multi-line falsifiers flip RED (verified locally).Contract
contracts/apr-serve-openai-compat-v1.yamlv1.14.0 → v1.15.0 addsOBLIG-OLLAMA-NDJSON-STREAMING+ 3 single-line falsifier refs.pv validate+pv lint contracts/pass (0 errors).Tests
cargo test -p apr-cli --lib serve::ollama— 14 pass (incl. 7 new NDJSON unit tests)cargo test -p apr-cli --test ollama_ndjson_streaming— 4 pass (new falsifier)cargo test -p apr-cli --test ollama_api_serve_compat— 3 pass (PMAT-923, no regression)cargo test -p aprender-serve --lib— 15450 passcargo clippy -p apr-cli --all-targets --features inferenceclean on all touched files🤖 Generated with Claude Code