Keep the KV cache on cancelled llama generations by FuJacob · Pull Request #513 · FuJacob/cotabby

FuJacob · 2026-06-01T23:18:18Z

Summary

Typing became laggy after the base-model migration. Root cause: a cancelled llama
generation was misrouted as a runtime error and synchronously wiped the native KV cache on the
main actor, then forced a full prompt re-decode on the next keystroke. LlamaRuntimeManager
surfaces an outer-Task cancellation as LlamaRuntimeError.cancelled, which fell through to the
cache-resetting LlamaRuntimeError handler instead of the quiet CancellationError path.

On the slower base models nearly every keystroke supersedes the in-flight generation, so the wipe
fired ~twice a second while typing; because the active accept event tap shares the main run loop the
reset was stalling, it surfaced as input lag. The cooperative cancel inside LlamaRuntimeCore.generate
already unwinds cleanly (its KV-trim defer restores prompt-only state), so the cache is still valid —
this adds a dedicated catch LlamaRuntimeError.cancelled that throws SuggestionClientError.cancelled
without resetting. A narrow LlamaRuntimeGenerating seam makes the failure routing unit-testable.

Validation

xcodebuild -project Cotabby.xcodeproj -scheme Cotabby -destination 'platform=macOS' \
  build-for-testing -derivedDataPath build/DerivedData
# ** BUILD SUCCEEDED **

xcodebuild -project Cotabby.xcodeproj -scheme Cotabby -destination 'platform=macOS' test \
  -derivedDataPath build/DerivedData CODE_SIGNING_ALLOWED=NO CODE_SIGNING_REQUIRED=NO \
  -only-testing:CotabbyTests/LlamaSuggestionEngineCancellationTests \
  -only-testing:CotabbyTests/LlamaPromptCacheHintTrackerTests
# ** TEST SUCCEEDED ** — Executed 9 tests, with 0 failures
#   LlamaSuggestionEngineCancellationTests 4/4:
#     - runtime cancel (LlamaRuntimeError.cancelled) => no cache reset, throws .cancelled
#     - pure CancellationError                       => no cache reset, throws .cancelled
#     - genuine runtime error                        => resets cache exactly once, throws .unavailable
#     - successful generation                        => no cache reset

swiftlint lint --quiet
# exit 0 (clean)

App-hosted tests are run unsigned: the default-signed test bundle hits a Team ID mismatch on this
machine (...different Team IDs), so signing is disabled for the local run. build-for-testing
(default signing) still succeeds.

Linked issues

No tracking issue was filed; this was surfaced from on-device input-lag logs after the base-model
migration. Refs the OSS base-model cutover (#497) that exposed the latent misclassification.

Risk / rollout notes

Behavior change: a cancelled generation no longer resets the llama KV cache or the prompt-cache
hint. This is the intended fix — the cache is valid after a cooperative cancel — and it restores
prompt-prefix reuse across keystrokes (far fewer full re-decodes). Genuine runtime errors still
reset exactly once (covered by a test).
Performance: removes ~2 synchronous main-actor KV-cache wipes per second during fast typing on
the OSS/base-model path. No effect on the Apple Foundation Models path.
Project file: regenerated Cotabby.xcodeproj with xcodegen generate to register the new test
file; the pbxproj diff is only that file reference.
Out of scope: pre-existing amplifiers found during the investigation — the active accept tap
still living on the main run loop, per-keystroke Chrome AX-walk cost, and the caret-geometry cache
removed in Add new changes #321 — predate this regression and are tracked separately as follow-ups.

Greptile Summary

Fixes an input-lag regression introduced with the base-model migration: a cancelled llama generation was falling through to the generic LlamaRuntimeError catch handler, which synchronously wiped the native KV cache on the main actor — firing roughly twice per second during fast typing and forcing a full prompt re-decode on every subsequent keystroke. The fix adds a dedicated catch LlamaRuntimeError.cancelled branch that routes cancelled generations to the quiet SuggestionClientError.cancelled path without touching the cache.

Engine fix (LlamaSuggestionEngine.swift): inserts a specific catch clause for LlamaRuntimeError.cancelled before the generic LlamaRuntimeError handler; narrowing runtimeManager to the new LlamaRuntimeGenerating protocol enables unit-testing the routing without loading a real model.
Protocol seam (SuggestionSubsystemContracts.swift + LlamaRuntimeManager.swift): adds LlamaRuntimeGenerating (wrapping generate and resetPromptCache) and satisfies it with an empty extension on LlamaRuntimeManager.
Regression tests (LlamaSuggestionEngineCancellationTests.swift): four tests using a FakeLlamaRuntime double pin all four routing outcomes — runtime cancel (no reset), CancellationError (no reset), genuine runtime error (reset exactly once), and success (no reset).

Confidence Score: 4/5

Safe to merge; the cancellation routing change is narrow, the catch-clause order is correct, and four targeted regression tests guard the fix.

The core change is a single added catch clause in a well-understood error chain. Catch ordering is correct, the LlamaRuntimeGenerating protocol extraction is clean, and four regression tests cover all relevant routing outcomes. Two style-level observations were noted but neither affects runtime behaviour.

No files require special attention; LlamaSuggestionEngine.swift is the only file with runtime-behaviour impact and the change there is minimal and well-covered by the new tests.

Important Files Changed

Filename	Overview
Cotabby/Services/Runtime/LlamaSuggestionEngine.swift	Adds a dedicated `catch LlamaRuntimeError.cancelled` clause before the generic `LlamaRuntimeError` handler so cancellations skip the KV-cache reset; also narrows `runtimeManager` to the new `LlamaRuntimeGenerating` protocol.
Cotabby/Models/SuggestionSubsystemContracts.swift	Adds `LlamaRuntimeGenerating` protocol (generate + resetPromptCache) as the test seam between `LlamaSuggestionEngine` and the real runtime; placement in this file is a minor organizational choice since the protocol serves `LlamaSuggestionEngine`, not `SuggestionCoordinator` directly.
Cotabby/Services/Runtime/LlamaRuntimeManager.swift	Empty conformance extension to `LlamaRuntimeGenerating`; existing `generate` and `resetPromptCache` methods already satisfy the protocol requirements exactly.
CotabbyTests/LlamaSuggestionEngineCancellationTests.swift	Four focused regression tests using a `FakeLlamaRuntime` double covering the fixed path (`LlamaRuntimeError.cancelled` → no reset), the `CancellationError` path, genuine runtime errors (reset exactly once), and success (no reset).
Cotabby.xcodeproj/project.pbxproj	Mechanical xcodegen output registering the new test file; no manual edits or unexpected changes.

Sequence Diagram

sequenceDiagram
    participant KS as Keystroke
    participant SE as LlamaSuggestionEngine
    participant RM as LlamaRuntimeManager
    participant RC as LlamaRuntimeCore

    KS->>SE: generateSuggestion(request)
    SE->>RM: generate(prompt, cachedPrefixBytes, options)
    RM->>RC: core.generate(...) [detached Task]
    Note over KS,RC: Next keystroke supersedes request
    RC-->>RM: partial buffer (cooperative cancel)
    RM->>RM: Task.checkCancellation() throws CancellationError
    RM->>RM: catch CancellationError, throw LlamaRuntimeError.cancelled
    RM-->>SE: throws LlamaRuntimeError.cancelled

    alt BEFORE this PR (bug)
        SE->>SE: catch LlamaRuntimeError (generic)
        SE->>RM: resetPromptCache() wipes KV cache
        SE-->>KS: throws SuggestionClientError.unavailable
    else AFTER this PR (fix)
        SE->>SE: catch LlamaRuntimeError.cancelled (new specific branch)
        Note over SE: No resetPromptCache() call, KV cache preserved
        SE-->>KS: throws SuggestionClientError.cancelled
    end

_{Reviews (1): Last reviewed commit: "Keep the KV cache on cancelled llama gen..." | Re-trigger Greptile}

Greptile also left 2 inline comments on this PR.

A cancelled generation was misrouted as a runtime error: LlamaRuntimeManager surfaces an outer-Task cancellation as LlamaRuntimeError.cancelled, which fell through to the cache-resetting LlamaRuntimeError handler instead of the quiet CancellationError path. That synchronously wiped the native KV sequence on the main actor and forced a full prompt re-decode on the next keystroke. On the slower base models nearly every keystroke supersedes the in-flight generation, so this fired ~twice a second during typing. Because the active accept event tap shares the main run loop the reset was stalling, those wipes surfaced as input lag. The cooperative cancel inside LlamaRuntimeCore.generate already unwinds cleanly (its KV-trim defer restores prompt-only state), so the cache stays valid. Add a dedicated catch LlamaRuntimeError.cancelled that throws SuggestionClientError.cancelled without resetting. Extract a narrow LlamaRuntimeGenerating seam so the engine's failure routing can be unit-tested against a fake runtime.

greptile-apps · 2026-06-01T23:23:10Z

+/// Behavior-shaped view of the llama runtime that `LlamaSuggestionEngine` depends on: run one
+/// generation and drop the native KV cache. Extracted so the engine's failure handling — in
+/// particular the invariant that a *cancelled* generation must NOT reset the cache (resetting it on
+/// every superseded keystroke was the base-model input-lag regression) — can be unit-tested against
+/// a fake runtime instead of loading a real model. `LlamaRuntimeManager` is the production conformer.
+@MainActor
+protocol LlamaRuntimeGenerating: AnyObject {
+    func generate(prompt: String, cachedPrefixBytes: Int?, options: LlamaGenerationOptions) async throws -> String
+    func resetPromptCache()
+}


Protocol lives outside its natural file scope

SuggestionSubsystemContracts.swift declares itself in its file header as defining contracts that SuggestionCoordinator depends on, but LlamaRuntimeGenerating is consumed exclusively by LlamaSuggestionEngine. In isolation this isn't a bug, but as this file grows it becomes less clear which protocols belong to the coordinator boundary and which are internal seams elsewhere. A dedicated LlamaRuntimeContracts.swift (alongside the existing Runtime/ files) or a comment noting this protocol is for the engine layer would keep the boundary explicit for future readers.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

greptile-apps · 2026-06-01T23:23:11Z

+    func test_pureCancellationError_doesNotResetCache_andThrowsCancelled() async {
+        // Guards the pre-existing clean path so a future refactor cannot regress it either.
+        let runtime = FakeLlamaRuntime()
+        runtime.generateResult = .failure(CancellationError())
+        let engine = LlamaSuggestionEngine(runtimeManager: runtime)
+
+        await assertThrowsCancelled(engine)
+        XCTAssertEqual(runtime.resetCount, 0)
+    }


CancellationError test exercises a path the real runtime cannot produce

LlamaRuntimeManager.generate always wraps any CancellationError it sees into LlamaRuntimeError.cancelled before rethrowing, so the engine will never receive a raw CancellationError from generate. The only source of a raw CancellationError to the engine is the try Task.checkCancellation() call on line 66 of LlamaSuggestionEngine.swift (the post-generation check). The test is still useful as a guard that CancellationError is not accidentally routed to the cache-resetting handler if the runtime contract changes, but the comment ("Guards the pre-existing clean path") could mention that this exercises the post-generation check rather than the generation call itself, to prevent future maintainers from misreading the intent.

FuJacob merged commit 237ddf0 into main Jun 1, 2026
4 checks passed

FuJacob deleted the fix/cancelled-generation-keeps-kv-cache branch June 1, 2026 23:22

greptile-apps Bot reviewed Jun 1, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Keep the KV cache on cancelled llama generations#513

Keep the KV cache on cancelled llama generations#513
FuJacob merged 1 commit into
mainfrom
fix/cancelled-generation-keeps-kv-cache

FuJacob commented Jun 1, 2026 •

edited by greptile-apps Bot

Loading

Uh oh!

Uh oh!

greptile-apps Bot Jun 1, 2026

Uh oh!

greptile-apps Bot Jun 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

FuJacob commented Jun 1, 2026 • edited by greptile-apps Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Validation

Linked issues

Risk / rollout notes

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

Uh oh!

greptile-apps Bot Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

FuJacob commented Jun 1, 2026 •

edited by greptile-apps Bot

Loading