Skip to content

feat(LlamaContext): default itemPrioritizationStrategy to firstInFirstOut#604

Open
andreinknv wants to merge 1 commit into
withcatai:masterfrom
andreinknv:feat/fifo-batching-default
Open

feat(LlamaContext): default itemPrioritizationStrategy to firstInFirstOut#604
andreinknv wants to merge 1 commit into
withcatai:masterfrom
andreinknv:feat/fifo-batching-default

Conversation

@andreinknv
Copy link
Copy Markdown

Summary

Flips the default itemPrioritizationStrategy on LlamaContext from "maximumParallelism" to "firstInFirstOut". The strategy is documented and remains explicitly selectable; only the default changes.

The firstInFirstOutStrategy is already implemented in this repo at src/evaluator/LlamaContext/_strategies/firstInFirstOutStrategy.ts and exercised by tests that pass it explicitly. This PR doesn't add code — it just changes which one fires when callers don't opt in.

Why

The current default, maximumParallelism, divides every decode batch evenly across all in-flight sequences so they all complete around the same time. That's a fair scheduler, but it produces a flat wall-clock latency floor: every concurrent caller waits roughly the slowest prompt's full duration before seeing any tokens.

firstInFirstOut matches llama-server's slot-iteration pattern: finish earlier prompts first, free their slots, accept new work. Earlier callers see their tokens within seconds; later callers don't pay any throughput cost (same total work in the same wall-clock window). For most concurrent use cases — chat sessions sharing a context, agentic loops, embedding/rerank pipelines — this is the behavior callers expect from a context with N sequences.

Bench

Measured on Qwen2.5-Coder-3B-Instruct Q4_K_M, Apple Silicon (MacBook M-series, Metal backend), contextSize: 8192, sequences: 8, flashAttention: true. Identical prompts at 1500-token prompt + 256-token decode, no sampling temperature variance.

c=8, p=1500 (concurrent prompts, prompt length)

metric maximumParallelism (default) firstInFirstOut delta
ttft_min 31955 ms 5790 ms 5.5× faster
ttft_avg 33214 ms 19204 ms 1.7× faster
total wall (8 prompts) ~33500 ms ~33500 ms unchanged

c=12 stress

Stable; no regression vs c=8. Latency distribution remains FIFO-shaped.

The total work moved through the GPU is identical between the two strategies — the only thing that changes is when each caller sees its tokens.

Compatibility

  • BatchingOptions.itemPrioritizationStrategy is unchanged in shape and documented values. Callers who rely on maximumParallelism behavior can keep it by passing batching: { itemPrioritizationStrategy: "maximumParallelism" } explicitly.
  • No public API changes.

Test plan

  • Existing tests pass locally (npm test).
  • Diff is one line.
  • CI on this PR.

🤖 Generated with Claude Code

…tOut

Concurrent prompt throughput on a single LlamaContext with N sequences
improves substantially when batched items are prioritized in arrival
order rather than spread evenly across slots.

The existing maximumParallelism strategy divides each batch evenly across
all in-flight prompts so they all finish around the same time. That is
fair, but it produces a flat wall-clock latency floor for every concurrent
caller equal to roughly the slowest prompt's full duration. The
firstInFirstOut strategy (already implemented in this repo at
src/evaluator/LlamaContext/_strategies/firstInFirstOutStrategy.ts)
matches llama-server's slot-iteration pattern: finish earlier prompts
first, free their slots, accept new work.

Measured on Qwen2.5-Coder-3B-Instruct Q4_K_M on Apple Silicon
(MacBook M-series, Metal backend, contextSize=8192, sequences=8,
flashAttention=true, identical prompts at 1500 tokens prompt + 256
tokens decode):

  c=8, p=1500
    maximumParallelism (current default)  ttft_min  31955 ms
    firstInFirstOut                       ttft_min   5790 ms  (5.5x faster)
    maximumParallelism                    ttft_avg  33214 ms
    firstInFirstOut                       ttft_avg  19204 ms

  c=12 stress
    firstInFirstOut                       stable; no regression vs c=8

Total prompt throughput is unchanged (same total work in the same
wall-clock window). What changes is the latency distribution: with
FIFO, earlier callers see their tokens within seconds; with
maximumParallelism every caller waits for the whole batch to finish.

For most concurrent use cases the FIFO behavior is what callers
expect from a context with multiple sequences. The flag is
unchanged and documented, so users who want maximumParallelism can
still opt in explicitly via `batching.itemPrioritizationStrategy`.

No new tests — this is a one-line default flip. The
firstInFirstOutStrategy was already exercised by existing tests that
pass it explicitly.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant