Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 10 additions & 0 deletions examples/models/qwen3_5_moe/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -243,6 +243,16 @@ Each `done` event reports
(`new`/`exact_prefix`/`dirty`/`mismatch`/`equal`) for measuring the hit rate.
`--no-warm-resume` forces a full prefill every request (for A/B comparison).

**Tool-call turns (token-ID continuation):** an assistant turn re-rendered from
its parsed tool call rarely re-tokenizes to the tokens the model actually
generated, so plain warm resume misses on agent loops. The server stores the
exact generated token ids per session and, on the next turn, sends the prompt as
segments (`{"text"}` / `{"ids"}`) that splice those ids back in for prior
assistant turns instead of re-rendering them — so the resident state stays an
exact token prefix and resume hits. Tool *results* remain text (re-tokenized
deterministically). The worker's exact-token check still backstops everything, so
a mismatch just falls back to a full prefill.

This is **isolation + warm resume, not concurrency**: execution is still
synchronous (one in-flight request; `--num-runners > 1` is rejected since more
workers would duplicate the weights). Fair interleaving across in-flight requests
Expand Down
11 changes: 11 additions & 0 deletions extension/llm/server/cpp/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -95,3 +95,14 @@ target_include_directories(
test_worker_prefill_plan PUBLIC ${_common_include_directories}
)
add_test(NAME worker_prefill_plan COMMAND test_worker_prefill_plan)

# Worker-loop harness (worker_handle_request + WorkerSessions) driven by a
# scriptable fake LLMSession/Tokenizer/LLMEngine -- no model/GPU. It includes
# the full worker_loop.h, so it needs the JSON include + the runtime/tokenizer
# libs.
add_executable(test_worker_loop test_worker_loop.cpp)
target_include_directories(
test_worker_loop PUBLIC ${_common_include_directories} ${_json_include}
)
target_link_libraries(test_worker_loop PUBLIC ${link_libraries})
add_test(NAME worker_loop COMMAND test_worker_loop)
Loading
Loading