feat(LlamaContext): default itemPrioritizationStrategy to firstInFirstOut#604
Open
andreinknv wants to merge 1 commit into
Open
feat(LlamaContext): default itemPrioritizationStrategy to firstInFirstOut#604andreinknv wants to merge 1 commit into
andreinknv wants to merge 1 commit into
Conversation
…tOut
Concurrent prompt throughput on a single LlamaContext with N sequences
improves substantially when batched items are prioritized in arrival
order rather than spread evenly across slots.
The existing maximumParallelism strategy divides each batch evenly across
all in-flight prompts so they all finish around the same time. That is
fair, but it produces a flat wall-clock latency floor for every concurrent
caller equal to roughly the slowest prompt's full duration. The
firstInFirstOut strategy (already implemented in this repo at
src/evaluator/LlamaContext/_strategies/firstInFirstOutStrategy.ts)
matches llama-server's slot-iteration pattern: finish earlier prompts
first, free their slots, accept new work.
Measured on Qwen2.5-Coder-3B-Instruct Q4_K_M on Apple Silicon
(MacBook M-series, Metal backend, contextSize=8192, sequences=8,
flashAttention=true, identical prompts at 1500 tokens prompt + 256
tokens decode):
c=8, p=1500
maximumParallelism (current default) ttft_min 31955 ms
firstInFirstOut ttft_min 5790 ms (5.5x faster)
maximumParallelism ttft_avg 33214 ms
firstInFirstOut ttft_avg 19204 ms
c=12 stress
firstInFirstOut stable; no regression vs c=8
Total prompt throughput is unchanged (same total work in the same
wall-clock window). What changes is the latency distribution: with
FIFO, earlier callers see their tokens within seconds; with
maximumParallelism every caller waits for the whole batch to finish.
For most concurrent use cases the FIFO behavior is what callers
expect from a context with multiple sequences. The flag is
unchanged and documented, so users who want maximumParallelism can
still opt in explicitly via `batching.itemPrioritizationStrategy`.
No new tests — this is a one-line default flip. The
firstInFirstOutStrategy was already exercised by existing tests that
pass it explicitly.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Flips the default
itemPrioritizationStrategyonLlamaContextfrom"maximumParallelism"to"firstInFirstOut". The strategy is documented and remains explicitly selectable; only the default changes.The
firstInFirstOutStrategyis already implemented in this repo atsrc/evaluator/LlamaContext/_strategies/firstInFirstOutStrategy.tsand exercised by tests that pass it explicitly. This PR doesn't add code — it just changes which one fires when callers don't opt in.Why
The current default,
maximumParallelism, divides every decode batch evenly across all in-flight sequences so they all complete around the same time. That's a fair scheduler, but it produces a flat wall-clock latency floor: every concurrent caller waits roughly the slowest prompt's full duration before seeing any tokens.firstInFirstOutmatches llama-server's slot-iteration pattern: finish earlier prompts first, free their slots, accept new work. Earlier callers see their tokens within seconds; later callers don't pay any throughput cost (same total work in the same wall-clock window). For most concurrent use cases — chat sessions sharing a context, agentic loops, embedding/rerank pipelines — this is the behavior callers expect from a context with N sequences.Bench
Measured on Qwen2.5-Coder-3B-Instruct Q4_K_M, Apple Silicon (MacBook M-series, Metal backend),
contextSize: 8192,sequences: 8,flashAttention: true. Identical prompts at 1500-token prompt + 256-token decode, no sampling temperature variance.c=8, p=1500 (concurrent prompts, prompt length)
maximumParallelism(default)firstInFirstOutc=12 stress
Stable; no regression vs c=8. Latency distribution remains FIFO-shaped.
The total work moved through the GPU is identical between the two strategies — the only thing that changes is when each caller sees its tokens.
Compatibility
BatchingOptions.itemPrioritizationStrategyis unchanged in shape and documented values. Callers who rely onmaximumParallelismbehavior can keep it by passingbatching: { itemPrioritizationStrategy: "maximumParallelism" }explicitly.Test plan
npm test).🤖 Generated with Claude Code