Keep the KV cache on cancelled llama generations#513
Conversation
A cancelled generation was misrouted as a runtime error: LlamaRuntimeManager surfaces an outer-Task cancellation as LlamaRuntimeError.cancelled, which fell through to the cache-resetting LlamaRuntimeError handler instead of the quiet CancellationError path. That synchronously wiped the native KV sequence on the main actor and forced a full prompt re-decode on the next keystroke. On the slower base models nearly every keystroke supersedes the in-flight generation, so this fired ~twice a second during typing. Because the active accept event tap shares the main run loop the reset was stalling, those wipes surfaced as input lag. The cooperative cancel inside LlamaRuntimeCore.generate already unwinds cleanly (its KV-trim defer restores prompt-only state), so the cache stays valid. Add a dedicated catch LlamaRuntimeError.cancelled that throws SuggestionClientError.cancelled without resetting. Extract a narrow LlamaRuntimeGenerating seam so the engine's failure routing can be unit-tested against a fake runtime.
| /// Behavior-shaped view of the llama runtime that `LlamaSuggestionEngine` depends on: run one | ||
| /// generation and drop the native KV cache. Extracted so the engine's failure handling — in | ||
| /// particular the invariant that a *cancelled* generation must NOT reset the cache (resetting it on | ||
| /// every superseded keystroke was the base-model input-lag regression) — can be unit-tested against | ||
| /// a fake runtime instead of loading a real model. `LlamaRuntimeManager` is the production conformer. | ||
| @MainActor | ||
| protocol LlamaRuntimeGenerating: AnyObject { | ||
| func generate(prompt: String, cachedPrefixBytes: Int?, options: LlamaGenerationOptions) async throws -> String | ||
| func resetPromptCache() | ||
| } |
There was a problem hiding this comment.
Protocol lives outside its natural file scope
SuggestionSubsystemContracts.swift declares itself in its file header as defining contracts that SuggestionCoordinator depends on, but LlamaRuntimeGenerating is consumed exclusively by LlamaSuggestionEngine. In isolation this isn't a bug, but as this file grows it becomes less clear which protocols belong to the coordinator boundary and which are internal seams elsewhere. A dedicated LlamaRuntimeContracts.swift (alongside the existing Runtime/ files) or a comment noting this protocol is for the engine layer would keep the boundary explicit for future readers.
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
| func test_pureCancellationError_doesNotResetCache_andThrowsCancelled() async { | ||
| // Guards the pre-existing clean path so a future refactor cannot regress it either. | ||
| let runtime = FakeLlamaRuntime() | ||
| runtime.generateResult = .failure(CancellationError()) | ||
| let engine = LlamaSuggestionEngine(runtimeManager: runtime) | ||
|
|
||
| await assertThrowsCancelled(engine) | ||
| XCTAssertEqual(runtime.resetCount, 0) | ||
| } |
There was a problem hiding this comment.
CancellationError test exercises a path the real runtime cannot produce
LlamaRuntimeManager.generate always wraps any CancellationError it sees into LlamaRuntimeError.cancelled before rethrowing, so the engine will never receive a raw CancellationError from generate. The only source of a raw CancellationError to the engine is the try Task.checkCancellation() call on line 66 of LlamaSuggestionEngine.swift (the post-generation check). The test is still useful as a guard that CancellationError is not accidentally routed to the cache-resetting handler if the runtime contract changes, but the comment ("Guards the pre-existing clean path") could mention that this exercises the post-generation check rather than the generation call itself, to prevent future maintainers from misreading the intent.
Summary
Typing became laggy after the base-model migration. Root cause: a cancelled llama
generation was misrouted as a runtime error and synchronously wiped the native KV cache on the
main actor, then forced a full prompt re-decode on the next keystroke.
LlamaRuntimeManagersurfaces an outer-Task cancellation as
LlamaRuntimeError.cancelled, which fell through to thecache-resetting
LlamaRuntimeErrorhandler instead of the quietCancellationErrorpath.On the slower base models nearly every keystroke supersedes the in-flight generation, so the wipe
fired ~twice a second while typing; because the active accept event tap shares the main run loop the
reset was stalling, it surfaced as input lag. The cooperative cancel inside
LlamaRuntimeCore.generatealready unwinds cleanly (its KV-trim
deferrestores prompt-only state), so the cache is still valid —this adds a dedicated
catch LlamaRuntimeError.cancelledthat throwsSuggestionClientError.cancelledwithout resetting. A narrow
LlamaRuntimeGeneratingseam makes the failure routing unit-testable.Validation
App-hosted tests are run unsigned: the default-signed test bundle hits a Team ID mismatch on this
machine (
...different Team IDs), so signing is disabled for the local run.build-for-testing(default signing) still succeeds.
Linked issues
No tracking issue was filed; this was surfaced from on-device input-lag logs after the base-model
migration. Refs the OSS base-model cutover (#497) that exposed the latent misclassification.
Risk / rollout notes
hint. This is the intended fix — the cache is valid after a cooperative cancel — and it restores
prompt-prefix reuse across keystrokes (far fewer full re-decodes). Genuine runtime errors still
reset exactly once (covered by a test).
the OSS/base-model path. No effect on the Apple Foundation Models path.
Cotabby.xcodeprojwithxcodegen generateto register the new testfile; the pbxproj diff is only that file reference.
still living on the main run loop, per-keystroke Chrome AX-walk cost, and the caret-geometry cache
removed in Add new changes #321 — predate this regression and are tracked separately as follow-ups.
Greptile Summary
Fixes an input-lag regression introduced with the base-model migration: a cancelled llama generation was falling through to the generic
LlamaRuntimeErrorcatch handler, which synchronously wiped the native KV cache on the main actor — firing roughly twice per second during fast typing and forcing a full prompt re-decode on every subsequent keystroke. The fix adds a dedicatedcatch LlamaRuntimeError.cancelledbranch that routes cancelled generations to the quietSuggestionClientError.cancelledpath without touching the cache.LlamaSuggestionEngine.swift): inserts a specific catch clause forLlamaRuntimeError.cancelledbefore the genericLlamaRuntimeErrorhandler; narrowingruntimeManagerto the newLlamaRuntimeGeneratingprotocol enables unit-testing the routing without loading a real model.SuggestionSubsystemContracts.swift+LlamaRuntimeManager.swift): addsLlamaRuntimeGenerating(wrappinggenerateandresetPromptCache) and satisfies it with an empty extension onLlamaRuntimeManager.LlamaSuggestionEngineCancellationTests.swift): four tests using aFakeLlamaRuntimedouble pin all four routing outcomes — runtime cancel (no reset),CancellationError(no reset), genuine runtime error (reset exactly once), and success (no reset).Confidence Score: 4/5
Safe to merge; the cancellation routing change is narrow, the catch-clause order is correct, and four targeted regression tests guard the fix.
The core change is a single added catch clause in a well-understood error chain. Catch ordering is correct, the
LlamaRuntimeGeneratingprotocol extraction is clean, and four regression tests cover all relevant routing outcomes. Two style-level observations were noted but neither affects runtime behaviour.No files require special attention;
LlamaSuggestionEngine.swiftis the only file with runtime-behaviour impact and the change there is minimal and well-covered by the new tests.Important Files Changed
catch LlamaRuntimeError.cancelledclause before the genericLlamaRuntimeErrorhandler so cancellations skip the KV-cache reset; also narrowsruntimeManagerto the newLlamaRuntimeGeneratingprotocol.LlamaRuntimeGeneratingprotocol (generate + resetPromptCache) as the test seam betweenLlamaSuggestionEngineand the real runtime; placement in this file is a minor organizational choice since the protocol servesLlamaSuggestionEngine, notSuggestionCoordinatordirectly.LlamaRuntimeGenerating; existinggenerateandresetPromptCachemethods already satisfy the protocol requirements exactly.FakeLlamaRuntimedouble covering the fixed path (LlamaRuntimeError.cancelled→ no reset), theCancellationErrorpath, genuine runtime errors (reset exactly once), and success (no reset).Sequence Diagram
sequenceDiagram participant KS as Keystroke participant SE as LlamaSuggestionEngine participant RM as LlamaRuntimeManager participant RC as LlamaRuntimeCore KS->>SE: generateSuggestion(request) SE->>RM: generate(prompt, cachedPrefixBytes, options) RM->>RC: core.generate(...) [detached Task] Note over KS,RC: Next keystroke supersedes request RC-->>RM: partial buffer (cooperative cancel) RM->>RM: Task.checkCancellation() throws CancellationError RM->>RM: catch CancellationError, throw LlamaRuntimeError.cancelled RM-->>SE: throws LlamaRuntimeError.cancelled alt BEFORE this PR (bug) SE->>SE: catch LlamaRuntimeError (generic) SE->>RM: resetPromptCache() wipes KV cache SE-->>KS: throws SuggestionClientError.unavailable else AFTER this PR (fix) SE->>SE: catch LlamaRuntimeError.cancelled (new specific branch) Note over SE: No resetPromptCache() call, KV cache preserved SE-->>KS: throws SuggestionClientError.cancelled endReviews (1): Last reviewed commit: "Keep the KV cache on cancelled llama gen..." | Re-trigger Greptile