Skip to content

feat(serve): NDJSON streaming for Ollama /api/chat + /api/generate — true drop-in (PMAT-928)#2222

Merged
noahgift merged 1 commit into
mainfrom
beat/ollama-ndjson-streaming
Jun 25, 2026
Merged

feat(serve): NDJSON streaming for Ollama /api/chat + /api/generate — true drop-in (PMAT-928)#2222
noahgift merged 1 commit into
mainfrom
beat/ollama-ndjson-streaming

Conversation

@noahgift

Copy link
Copy Markdown
Contributor

Summary

PMAT-923 (#2216) wired Ollama /api/chat + /api/generate into the 5 real apr serve routers but always returned a single coalesced (done:true) body — the adapter forced stream:false. Ollama clients default to stream:true and expect newline-delimited JSON: a sequence of {...,message:{role,content:<token>},done:false} chunks then a terminal {...,done:true,...stats}. So a default Ollama client streaming from apr serve saw a non-streaming body.

This PR makes an Ollama request with stream != false (Ollama's wire default) respond with a chunked application/x-ndjson body — one done:false object per token then a terminal done:true object with stats. stream:false keeps the existing single coalesced object (no regression).

Streaming is real (reuses the /v1 token stream)

The /v1/chat/completions handler on the APR-CPU router already does true per-token SSE streaming via spawn_cpu_streaming_taskgenerate_with_cache_streaming → an mpsc channel of token IDs. The Ollama NDJSON path reuses that same incremental stream: a new spawn_cpu_token_text_stream sends each token's decoded text through the channel, and ollama_ndjson_stream reshapes each token into an NDJSON line. Only the wire framing differs (NDJSON lines vs SSE data: events) — it is NOT a re-decode of a coalesced batch result.

Wired into all real apr serve routers (the #2216 lesson — apr serve does NOT mount realizar's create_router):

  • APR-CPU (build_apr_cpu_router): true per-token NDJSON.
  • GPU-fallback (build_gpu_router), WGPU, SafeTensors x2: batch backends (generate_with_cache), so they honor stream:true with correct NDJSON framing (one content chunk + terminal done:true) over their coalesced result via reshape_openai_to_ollama_ndjson — honest about granularity, correct wire shape.

Falsifier

crates/apr-cli/tests/ollama_ndjson_streaming.rs drives the REAL APR-CPU serve router via build_demo_streaming_apr_cpu_router_for_test (a scripted token sequence flows through the IDENTICAL mpsc + NDJSON reshape pipeline the transformer uses — only the token source is faked, never the wire framing):

  • stream:true → asserts application/x-ndjson AND multiple newline-delimited objects with intermediate done:false token chunks + a final done:true (NOT one object); token chunks reassemble to the full generation.
  • stream:false → exactly one coalesced object (application/json, done:true).
  • absent stream → streams (Ollama default).

RED on the old coalesced code (single application/json object even with stream:true); GREEN on the fix. Mutation-verified: forcing the stream branch off internally collapses the body back to one coalesced object → the multi-line falsifiers flip RED (verified locally).

Contract

contracts/apr-serve-openai-compat-v1.yaml v1.14.0 → v1.15.0 adds OBLIG-OLLAMA-NDJSON-STREAMING + 3 single-line falsifier refs. pv validate + pv lint contracts/ pass (0 errors).

Tests

  • cargo test -p apr-cli --lib serve::ollama — 14 pass (incl. 7 new NDJSON unit tests)
  • cargo test -p apr-cli --test ollama_ndjson_streaming — 4 pass (new falsifier)
  • cargo test -p apr-cli --test ollama_api_serve_compat — 3 pass (PMAT-923, no regression)
  • cargo test -p aprender-serve --lib — 15450 pass
  • cargo clippy -p apr-cli --all-targets --features inference clean on all touched files

🤖 Generated with Claude Code

…true drop-in (PMAT-928)

PMAT-923 (#2216) wired Ollama /api/chat + /api/generate into the real apr serve
routers but always returned a SINGLE coalesced (done:true) body — the adapter
forced stream:false. Ollama clients default to stream:true and expect
newline-delimited JSON: a sequence of {...,message:{role,content:<token>},done:false}
chunks then a terminal {...,done:true,...stats}. A default Ollama client therefore
saw a non-streaming body from apr serve.

This makes an Ollama request with stream != false (Ollama's wire default — the
serde default is now default_stream()==true, not bool::default()==false) respond
with a chunked application/x-ndjson body: one done:false object per token, then a
terminal done:true object carrying done_reason/prompt_eval_count/eval_count/
total_duration/eval_duration. stream:false keeps the existing single coalesced
object (no regression).

Streaming reuses the SAME incremental token stream the OpenAI /v1/chat/completions
SSE path uses: the APR-CPU router's spawn_cpu_streaming_task →
generate_with_cache_streaming → mpsc channel. A new spawn_cpu_token_text_stream
sends each token's decoded text through that channel and ollama_ndjson_stream
reshapes each into an NDJSON line — only the wire framing differs (NDJSON vs SSE
data: events), not a re-decode of a batch result. Wired into all real apr serve
routers (APR-CPU per-token; GPU-fallback, WGPU, and both SafeTensors routers honor
stream:true with NDJSON framing over their batch generate_with_cache result via
reshape_openai_to_ollama_ndjson).

Falsifier (crates/apr-cli/tests/ollama_ndjson_streaming.rs) drives the REAL
APR-CPU serve router via build_demo_streaming_apr_cpu_router_for_test (scripted
tokens through the identical mpsc + NDJSON pipeline): POST /api/chat stream:true
asserts application/x-ndjson AND multiple newline-delimited objects with
intermediate done:false + a final done:true (NOT one object); stream:false stays a
single coalesced object. RED on the old coalesced code (single application/json
object even with stream:true), GREEN on the fix. Mutation-verified: forcing the
stream branch off collapses the body back to one object → falsifier RED.

Contract: contracts/apr-serve-openai-compat-v1.yaml v1.14.0 → v1.15.0 adds
OBLIG-OLLAMA-NDJSON-STREAMING + 3 single-line falsifier refs; pv validate + pv lint
contracts/ pass.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@noahgift noahgift enabled auto-merge June 24, 2026 17:55
@noahgift noahgift disabled auto-merge June 25, 2026 08:22
@noahgift noahgift enabled auto-merge June 25, 2026 08:33
@noahgift noahgift added this pull request to the merge queue Jun 25, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Jun 25, 2026
@noahgift noahgift added this pull request to the merge queue Jun 25, 2026
Merged via the queue into main with commit 3d19fe4 Jun 25, 2026
15 of 21 checks passed
@noahgift noahgift deleted the beat/ollama-ndjson-streaming branch June 25, 2026 09:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant