Skip to content

Keep the KV cache on cancelled llama generations#513

Merged
FuJacob merged 1 commit into
mainfrom
fix/cancelled-generation-keeps-kv-cache
Jun 1, 2026
Merged

Keep the KV cache on cancelled llama generations#513
FuJacob merged 1 commit into
mainfrom
fix/cancelled-generation-keeps-kv-cache

Conversation

@FuJacob
Copy link
Copy Markdown
Owner

@FuJacob FuJacob commented Jun 1, 2026

Summary

Typing became laggy after the base-model migration. Root cause: a cancelled llama
generation was misrouted as a runtime error and synchronously wiped the native KV cache on the
main actor, then forced a full prompt re-decode on the next keystroke. LlamaRuntimeManager
surfaces an outer-Task cancellation as LlamaRuntimeError.cancelled, which fell through to the
cache-resetting LlamaRuntimeError handler instead of the quiet CancellationError path.

On the slower base models nearly every keystroke supersedes the in-flight generation, so the wipe
fired ~twice a second while typing; because the active accept event tap shares the main run loop the
reset was stalling, it surfaced as input lag. The cooperative cancel inside LlamaRuntimeCore.generate
already unwinds cleanly (its KV-trim defer restores prompt-only state), so the cache is still valid —
this adds a dedicated catch LlamaRuntimeError.cancelled that throws SuggestionClientError.cancelled
without resetting. A narrow LlamaRuntimeGenerating seam makes the failure routing unit-testable.

Validation

xcodebuild -project Cotabby.xcodeproj -scheme Cotabby -destination 'platform=macOS' \
  build-for-testing -derivedDataPath build/DerivedData
# ** BUILD SUCCEEDED **

xcodebuild -project Cotabby.xcodeproj -scheme Cotabby -destination 'platform=macOS' test \
  -derivedDataPath build/DerivedData CODE_SIGNING_ALLOWED=NO CODE_SIGNING_REQUIRED=NO \
  -only-testing:CotabbyTests/LlamaSuggestionEngineCancellationTests \
  -only-testing:CotabbyTests/LlamaPromptCacheHintTrackerTests
# ** TEST SUCCEEDED ** — Executed 9 tests, with 0 failures
#   LlamaSuggestionEngineCancellationTests 4/4:
#     - runtime cancel (LlamaRuntimeError.cancelled) => no cache reset, throws .cancelled
#     - pure CancellationError                       => no cache reset, throws .cancelled
#     - genuine runtime error                        => resets cache exactly once, throws .unavailable
#     - successful generation                        => no cache reset

swiftlint lint --quiet
# exit 0 (clean)

App-hosted tests are run unsigned: the default-signed test bundle hits a Team ID mismatch on this
machine (...different Team IDs), so signing is disabled for the local run. build-for-testing
(default signing) still succeeds.

Linked issues

No tracking issue was filed; this was surfaced from on-device input-lag logs after the base-model
migration. Refs the OSS base-model cutover (#497) that exposed the latent misclassification.

Risk / rollout notes

  • Behavior change: a cancelled generation no longer resets the llama KV cache or the prompt-cache
    hint. This is the intended fix — the cache is valid after a cooperative cancel — and it restores
    prompt-prefix reuse across keystrokes (far fewer full re-decodes). Genuine runtime errors still
    reset exactly once (covered by a test).
  • Performance: removes ~2 synchronous main-actor KV-cache wipes per second during fast typing on
    the OSS/base-model path. No effect on the Apple Foundation Models path.
  • Project file: regenerated Cotabby.xcodeproj with xcodegen generate to register the new test
    file; the pbxproj diff is only that file reference.
  • Out of scope: pre-existing amplifiers found during the investigation — the active accept tap
    still living on the main run loop, per-keystroke Chrome AX-walk cost, and the caret-geometry cache
    removed in Add new changes #321 — predate this regression and are tracked separately as follow-ups.

Greptile Summary

Fixes an input-lag regression introduced with the base-model migration: a cancelled llama generation was falling through to the generic LlamaRuntimeError catch handler, which synchronously wiped the native KV cache on the main actor — firing roughly twice per second during fast typing and forcing a full prompt re-decode on every subsequent keystroke. The fix adds a dedicated catch LlamaRuntimeError.cancelled branch that routes cancelled generations to the quiet SuggestionClientError.cancelled path without touching the cache.

  • Engine fix (LlamaSuggestionEngine.swift): inserts a specific catch clause for LlamaRuntimeError.cancelled before the generic LlamaRuntimeError handler; narrowing runtimeManager to the new LlamaRuntimeGenerating protocol enables unit-testing the routing without loading a real model.
  • Protocol seam (SuggestionSubsystemContracts.swift + LlamaRuntimeManager.swift): adds LlamaRuntimeGenerating (wrapping generate and resetPromptCache) and satisfies it with an empty extension on LlamaRuntimeManager.
  • Regression tests (LlamaSuggestionEngineCancellationTests.swift): four tests using a FakeLlamaRuntime double pin all four routing outcomes — runtime cancel (no reset), CancellationError (no reset), genuine runtime error (reset exactly once), and success (no reset).

Confidence Score: 4/5

Safe to merge; the cancellation routing change is narrow, the catch-clause order is correct, and four targeted regression tests guard the fix.

The core change is a single added catch clause in a well-understood error chain. Catch ordering is correct, the LlamaRuntimeGenerating protocol extraction is clean, and four regression tests cover all relevant routing outcomes. Two style-level observations were noted but neither affects runtime behaviour.

No files require special attention; LlamaSuggestionEngine.swift is the only file with runtime-behaviour impact and the change there is minimal and well-covered by the new tests.

Important Files Changed

Filename Overview
Cotabby/Services/Runtime/LlamaSuggestionEngine.swift Adds a dedicated catch LlamaRuntimeError.cancelled clause before the generic LlamaRuntimeError handler so cancellations skip the KV-cache reset; also narrows runtimeManager to the new LlamaRuntimeGenerating protocol.
Cotabby/Models/SuggestionSubsystemContracts.swift Adds LlamaRuntimeGenerating protocol (generate + resetPromptCache) as the test seam between LlamaSuggestionEngine and the real runtime; placement in this file is a minor organizational choice since the protocol serves LlamaSuggestionEngine, not SuggestionCoordinator directly.
Cotabby/Services/Runtime/LlamaRuntimeManager.swift Empty conformance extension to LlamaRuntimeGenerating; existing generate and resetPromptCache methods already satisfy the protocol requirements exactly.
CotabbyTests/LlamaSuggestionEngineCancellationTests.swift Four focused regression tests using a FakeLlamaRuntime double covering the fixed path (LlamaRuntimeError.cancelled → no reset), the CancellationError path, genuine runtime errors (reset exactly once), and success (no reset).
Cotabby.xcodeproj/project.pbxproj Mechanical xcodegen output registering the new test file; no manual edits or unexpected changes.

Sequence Diagram

sequenceDiagram
    participant KS as Keystroke
    participant SE as LlamaSuggestionEngine
    participant RM as LlamaRuntimeManager
    participant RC as LlamaRuntimeCore

    KS->>SE: generateSuggestion(request)
    SE->>RM: generate(prompt, cachedPrefixBytes, options)
    RM->>RC: core.generate(...) [detached Task]
    Note over KS,RC: Next keystroke supersedes request
    RC-->>RM: partial buffer (cooperative cancel)
    RM->>RM: Task.checkCancellation() throws CancellationError
    RM->>RM: catch CancellationError, throw LlamaRuntimeError.cancelled
    RM-->>SE: throws LlamaRuntimeError.cancelled

    alt BEFORE this PR (bug)
        SE->>SE: catch LlamaRuntimeError (generic)
        SE->>RM: resetPromptCache() wipes KV cache
        SE-->>KS: throws SuggestionClientError.unavailable
    else AFTER this PR (fix)
        SE->>SE: catch LlamaRuntimeError.cancelled (new specific branch)
        Note over SE: No resetPromptCache() call, KV cache preserved
        SE-->>KS: throws SuggestionClientError.cancelled
    end
Loading

Fix All in Codex Fix All in Claude Code

Reviews (1): Last reviewed commit: "Keep the KV cache on cancelled llama gen..." | Re-trigger Greptile

Greptile also left 2 inline comments on this PR.

A cancelled generation was misrouted as a runtime error: LlamaRuntimeManager surfaces an outer-Task cancellation as LlamaRuntimeError.cancelled, which fell through to the cache-resetting LlamaRuntimeError handler instead of the quiet CancellationError path. That synchronously wiped the native KV sequence on the main actor and forced a full prompt re-decode on the next keystroke.

On the slower base models nearly every keystroke supersedes the in-flight generation, so this fired ~twice a second during typing. Because the active accept event tap shares the main run loop the reset was stalling, those wipes surfaced as input lag.

The cooperative cancel inside LlamaRuntimeCore.generate already unwinds cleanly (its KV-trim defer restores prompt-only state), so the cache stays valid. Add a dedicated catch LlamaRuntimeError.cancelled that throws SuggestionClientError.cancelled without resetting.

Extract a narrow LlamaRuntimeGenerating seam so the engine's failure routing can be unit-tested against a fake runtime.
@FuJacob FuJacob merged commit 237ddf0 into main Jun 1, 2026
4 checks passed
@FuJacob FuJacob deleted the fix/cancelled-generation-keeps-kv-cache branch June 1, 2026 23:22
Comment on lines +85 to +94
/// Behavior-shaped view of the llama runtime that `LlamaSuggestionEngine` depends on: run one
/// generation and drop the native KV cache. Extracted so the engine's failure handling — in
/// particular the invariant that a *cancelled* generation must NOT reset the cache (resetting it on
/// every superseded keystroke was the base-model input-lag regression) — can be unit-tested against
/// a fake runtime instead of loading a real model. `LlamaRuntimeManager` is the production conformer.
@MainActor
protocol LlamaRuntimeGenerating: AnyObject {
func generate(prompt: String, cachedPrefixBytes: Int?, options: LlamaGenerationOptions) async throws -> String
func resetPromptCache()
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Protocol lives outside its natural file scope

SuggestionSubsystemContracts.swift declares itself in its file header as defining contracts that SuggestionCoordinator depends on, but LlamaRuntimeGenerating is consumed exclusively by LlamaSuggestionEngine. In isolation this isn't a bug, but as this file grows it becomes less clear which protocols belong to the coordinator boundary and which are internal seams elsewhere. A dedicated LlamaRuntimeContracts.swift (alongside the existing Runtime/ files) or a comment noting this protocol is for the engine layer would keep the boundary explicit for future readers.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Fix in Codex Fix in Claude Code

Comment on lines +28 to +36
func test_pureCancellationError_doesNotResetCache_andThrowsCancelled() async {
// Guards the pre-existing clean path so a future refactor cannot regress it either.
let runtime = FakeLlamaRuntime()
runtime.generateResult = .failure(CancellationError())
let engine = LlamaSuggestionEngine(runtimeManager: runtime)

await assertThrowsCancelled(engine)
XCTAssertEqual(runtime.resetCount, 0)
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 CancellationError test exercises a path the real runtime cannot produce

LlamaRuntimeManager.generate always wraps any CancellationError it sees into LlamaRuntimeError.cancelled before rethrowing, so the engine will never receive a raw CancellationError from generate. The only source of a raw CancellationError to the engine is the try Task.checkCancellation() call on line 66 of LlamaSuggestionEngine.swift (the post-generation check). The test is still useful as a guard that CancellationError is not accidentally routed to the cache-resetting handler if the runtime contract changes, but the comment ("Guards the pre-existing clean path") could mention that this exercises the post-generation check rather than the generation call itself, to prevent future maintainers from misreading the intent.

Fix in Codex Fix in Claude Code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant