feat(serve): NDJSON streaming for Ollama /api/chat + /api/generate — true drop-in (PMAT-928) by noahgift · Pull Request #2222 · paiml/aprender

noahgift · 2026-06-24T17:55:06Z

Summary

PMAT-923 (#2216) wired Ollama /api/chat + /api/generate into the 5 real apr serve routers but always returned a single coalesced (done:true) body — the adapter forced stream:false. Ollama clients default to stream:true and expect newline-delimited JSON: a sequence of {...,message:{role,content:<token>},done:false} chunks then a terminal {...,done:true,...stats}. So a default Ollama client streaming from apr serve saw a non-streaming body.

This PR makes an Ollama request with stream != false (Ollama's wire default) respond with a chunked application/x-ndjson body — one done:false object per token then a terminal done:true object with stats. stream:false keeps the existing single coalesced object (no regression).

Streaming is real (reuses the /v1 token stream)

The /v1/chat/completions handler on the APR-CPU router already does true per-token SSE streaming via spawn_cpu_streaming_task → generate_with_cache_streaming → an mpsc channel of token IDs. The Ollama NDJSON path reuses that same incremental stream: a new spawn_cpu_token_text_stream sends each token's decoded text through the channel, and ollama_ndjson_stream reshapes each token into an NDJSON line. Only the wire framing differs (NDJSON lines vs SSE data: events) — it is NOT a re-decode of a coalesced batch result.

Wired into all real apr serve routers (the #2216 lesson — apr serve does NOT mount realizar's create_router):

APR-CPU (build_apr_cpu_router): true per-token NDJSON.
GPU-fallback (build_gpu_router), WGPU, SafeTensors x2: batch backends (generate_with_cache), so they honor stream:true with correct NDJSON framing (one content chunk + terminal done:true) over their coalesced result via reshape_openai_to_ollama_ndjson — honest about granularity, correct wire shape.

Falsifier

crates/apr-cli/tests/ollama_ndjson_streaming.rs drives the REAL APR-CPU serve router via build_demo_streaming_apr_cpu_router_for_test (a scripted token sequence flows through the IDENTICAL mpsc + NDJSON reshape pipeline the transformer uses — only the token source is faked, never the wire framing):

stream:true → asserts application/x-ndjson AND multiple newline-delimited objects with intermediate done:false token chunks + a final done:true (NOT one object); token chunks reassemble to the full generation.
stream:false → exactly one coalesced object (application/json, done:true).
absent stream → streams (Ollama default).

RED on the old coalesced code (single application/json object even with stream:true); GREEN on the fix. Mutation-verified: forcing the stream branch off internally collapses the body back to one coalesced object → the multi-line falsifiers flip RED (verified locally).

Contract

contracts/apr-serve-openai-compat-v1.yaml v1.14.0 → v1.15.0 adds OBLIG-OLLAMA-NDJSON-STREAMING + 3 single-line falsifier refs. pv validate + pv lint contracts/ pass (0 errors).

Tests

cargo test -p apr-cli --lib serve::ollama — 14 pass (incl. 7 new NDJSON unit tests)
cargo test -p apr-cli --test ollama_ndjson_streaming — 4 pass (new falsifier)
cargo test -p apr-cli --test ollama_api_serve_compat — 3 pass (PMAT-923, no regression)
cargo test -p aprender-serve --lib — 15450 pass
cargo clippy -p apr-cli --all-targets --features inference clean on all touched files

🤖 Generated with Claude Code

…true drop-in (PMAT-928) PMAT-923 (#2216) wired Ollama /api/chat + /api/generate into the real apr serve routers but always returned a SINGLE coalesced (done:true) body — the adapter forced stream:false. Ollama clients default to stream:true and expect newline-delimited JSON: a sequence of {...,message:{role,content:<token>},done:false} chunks then a terminal {...,done:true,...stats}. A default Ollama client therefore saw a non-streaming body from apr serve. This makes an Ollama request with stream != false (Ollama's wire default — the serde default is now default_stream()==true, not bool::default()==false) respond with a chunked application/x-ndjson body: one done:false object per token, then a terminal done:true object carrying done_reason/prompt_eval_count/eval_count/ total_duration/eval_duration. stream:false keeps the existing single coalesced object (no regression). Streaming reuses the SAME incremental token stream the OpenAI /v1/chat/completions SSE path uses: the APR-CPU router's spawn_cpu_streaming_task → generate_with_cache_streaming → mpsc channel. A new spawn_cpu_token_text_stream sends each token's decoded text through that channel and ollama_ndjson_stream reshapes each into an NDJSON line — only the wire framing differs (NDJSON vs SSE data: events), not a re-decode of a batch result. Wired into all real apr serve routers (APR-CPU per-token; GPU-fallback, WGPU, and both SafeTensors routers honor stream:true with NDJSON framing over their batch generate_with_cache result via reshape_openai_to_ollama_ndjson). Falsifier (crates/apr-cli/tests/ollama_ndjson_streaming.rs) drives the REAL APR-CPU serve router via build_demo_streaming_apr_cpu_router_for_test (scripted tokens through the identical mpsc + NDJSON pipeline): POST /api/chat stream:true asserts application/x-ndjson AND multiple newline-delimited objects with intermediate done:false + a final done:true (NOT one object); stream:false stays a single coalesced object. RED on the old coalesced code (single application/json object even with stream:true), GREEN on the fix. Mutation-verified: forcing the stream branch off collapses the body back to one object → falsifier RED. Contract: contracts/apr-serve-openai-compat-v1.yaml v1.14.0 → v1.15.0 adds OBLIG-OLLAMA-NDJSON-STREAMING + 3 single-line falsifier refs; pv validate + pv lint contracts/ pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

noahgift enabled auto-merge June 24, 2026 17:55

noahgift disabled auto-merge June 25, 2026 08:22

noahgift enabled auto-merge June 25, 2026 08:33

noahgift added this pull request to the merge queue Jun 25, 2026

github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Jun 25, 2026

noahgift added this pull request to the merge queue Jun 25, 2026

Merged via the queue into main with commit 3d19fe4 Jun 25, 2026
15 of 21 checks passed

noahgift deleted the beat/ollama-ndjson-streaming branch June 25, 2026 09:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(serve): NDJSON streaming for Ollama /api/chat + /api/generate — true drop-in (PMAT-928)#2222

feat(serve): NDJSON streaming for Ollama /api/chat + /api/generate — true drop-in (PMAT-928)#2222
noahgift merged 1 commit into
mainfrom
beat/ollama-ndjson-streaming

noahgift commented Jun 24, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

noahgift commented Jun 24, 2026

Summary

Streaming is real (reuses the /v1 token stream)

Falsifier

Contract

Tests

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant