feat(LlamaContext): default itemPrioritizationStrategy to firstInFirstOut by andreinknv · Pull Request #604 · withcatai/node-llama-cpp

andreinknv · 2026-05-14T22:39:57Z

Summary

Flips the default itemPrioritizationStrategy on LlamaContext from "maximumParallelism" to "firstInFirstOut". The strategy is documented and remains explicitly selectable; only the default changes.

The firstInFirstOutStrategy is already implemented in this repo at src/evaluator/LlamaContext/_strategies/firstInFirstOutStrategy.ts and exercised by tests that pass it explicitly. This PR doesn't add code — it just changes which one fires when callers don't opt in.

Why

The current default, maximumParallelism, divides every decode batch evenly across all in-flight sequences so they all complete around the same time. That's a fair scheduler, but it produces a flat wall-clock latency floor: every concurrent caller waits roughly the slowest prompt's full duration before seeing any tokens.

firstInFirstOut matches llama-server's slot-iteration pattern: finish earlier prompts first, free their slots, accept new work. Earlier callers see their tokens within seconds; later callers don't pay any throughput cost (same total work in the same wall-clock window). For most concurrent use cases — chat sessions sharing a context, agentic loops, embedding/rerank pipelines — this is the behavior callers expect from a context with N sequences.

Bench

Measured on Qwen2.5-Coder-3B-Instruct Q4_K_M, Apple Silicon (MacBook M-series, Metal backend), contextSize: 8192, sequences: 8, flashAttention: true. Identical prompts at 1500-token prompt + 256-token decode, no sampling temperature variance.

c=8, p=1500 (concurrent prompts, prompt length)

metric	`maximumParallelism` (default)	`firstInFirstOut`	delta
ttft_min	31955 ms	5790 ms	5.5× faster
ttft_avg	33214 ms	19204 ms	1.7× faster
total wall (8 prompts)	~33500 ms	~33500 ms	unchanged

c=12 stress

Stable; no regression vs c=8. Latency distribution remains FIFO-shaped.

The total work moved through the GPU is identical between the two strategies — the only thing that changes is when each caller sees its tokens.

Compatibility

BatchingOptions.itemPrioritizationStrategy is unchanged in shape and documented values. Callers who rely on maximumParallelism behavior can keep it by passing batching: { itemPrioritizationStrategy: "maximumParallelism" } explicitly.
No public API changes.

Test plan

Existing tests pass locally (npm test).
Diff is one line.
CI on this PR.

🤖 Generated with Claude Code

…tOut Concurrent prompt throughput on a single LlamaContext with N sequences improves substantially when batched items are prioritized in arrival order rather than spread evenly across slots. The existing maximumParallelism strategy divides each batch evenly across all in-flight prompts so they all finish around the same time. That is fair, but it produces a flat wall-clock latency floor for every concurrent caller equal to roughly the slowest prompt's full duration. The firstInFirstOut strategy (already implemented in this repo at src/evaluator/LlamaContext/_strategies/firstInFirstOutStrategy.ts) matches llama-server's slot-iteration pattern: finish earlier prompts first, free their slots, accept new work. Measured on Qwen2.5-Coder-3B-Instruct Q4_K_M on Apple Silicon (MacBook M-series, Metal backend, contextSize=8192, sequences=8, flashAttention=true, identical prompts at 1500 tokens prompt + 256 tokens decode): c=8, p=1500 maximumParallelism (current default) ttft_min 31955 ms firstInFirstOut ttft_min 5790 ms (5.5x faster) maximumParallelism ttft_avg 33214 ms firstInFirstOut ttft_avg 19204 ms c=12 stress firstInFirstOut stable; no regression vs c=8 Total prompt throughput is unchanged (same total work in the same wall-clock window). What changes is the latency distribution: with FIFO, earlier callers see their tokens within seconds; with maximumParallelism every caller waits for the whole batch to finish. For most concurrent use cases the FIFO behavior is what callers expect from a context with multiple sequences. The flag is unchanged and documented, so users who want maximumParallelism can still opt in explicitly via `batching.itemPrioritizationStrategy`. No new tests — this is a one-line default flip. The firstInFirstOutStrategy was already exercised by existing tests that pass it explicitly.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(LlamaContext): default itemPrioritizationStrategy to firstInFirstOut#604

feat(LlamaContext): default itemPrioritizationStrategy to firstInFirstOut#604
andreinknv wants to merge 1 commit into
withcatai:masterfrom
andreinknv:feat/fifo-batching-default

andreinknv commented May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

andreinknv commented May 14, 2026

Summary

Why

Bench

c=8, p=1500 (concurrent prompts, prompt length)

c=12 stress

Compatibility

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant