pytorch · mergennachin · Jun 9, 2026 · Jun 9, 2026 · Jun 9, 2026 · Jun 10, 2026
diff --git a/examples/models/qwen3_5_moe/README.md b/examples/models/qwen3_5_moe/README.md
@@ -243,6 +243,16 @@ Each `done` event reports
 (`new`/`exact_prefix`/`dirty`/`mismatch`/`equal`) for measuring the hit rate.
 `--no-warm-resume` forces a full prefill every request (for A/B comparison).
 
+**Tool-call turns (token-ID continuation):** an assistant turn re-rendered from
+its parsed tool call rarely re-tokenizes to the tokens the model actually
+generated, so plain warm resume misses on agent loops. The server stores the
+exact generated token ids per session and, on the next turn, sends the prompt as
+segments (`{"text"}` / `{"ids"}`) that splice those ids back in for prior
+assistant turns instead of re-rendering them — so the resident state stays an
+exact token prefix and resume hits. Tool *results* remain text (re-tokenized
+deterministically). The worker's exact-token check still backstops everything, so
+a mismatch just falls back to a full prefill.
+
 This is **isolation + warm resume, not concurrency**: execution is still
 synchronous (one in-flight request; `--num-runners > 1` is rejected since more
 workers would duplicate the weights). Fair interleaving across in-flight requests

@@ -95,3 +95,14 @@ target_include_directories(
   test_worker_prefill_plan PUBLIC ${_common_include_directories}
 )
 add_test(NAME worker_prefill_plan COMMAND test_worker_prefill_plan)
+
+# Worker-loop harness (worker_handle_request + WorkerSessions) driven by a
+# scriptable fake LLMSession/Tokenizer/LLMEngine -- no model/GPU. It includes
+# the full worker_loop.h, so it needs the JSON include + the runtime/tokenizer
+# libs.
+add_executable(test_worker_loop test_worker_loop.cpp)
+target_include_directories(
+  test_worker_loop PUBLIC ${_common_include_directories} ${_json_include}
+)
+target_link_libraries(test_worker_loop PUBLIC ${link_libraries})
+add_test(NAME worker_loop COMMAND test_worker_loop)