Skip to content

Generation-time quality controls: token masks, single-line, mid-word continuation#488

Merged
FuJacob merged 2 commits into
feat/output-safety-gatesfrom
feat/generation-quality-controls
Jun 1, 2026
Merged

Generation-time quality controls: token masks, single-line, mid-word continuation#488
FuJacob merged 2 commits into
feat/output-safety-gatesfrom
feat/generation-quality-controls

Conversation

@FuJacob
Copy link
Copy Markdown
Owner

@FuJacob FuJacob commented May 31, 2026

Summary

Wires generation-time quality controls into the app so unwanted tokens are stopped at the sampler instead of cleaned up after the fact. Builds on the engine changes in FuJacob/cotabbyinference#8. Second of two stacked PRs, based on the feat/output-safety-gates branch (#485).

  • Nonprintable token mask (free on the pin bump): the engine masks control, chat-template, and unused tokens from sampling (EOG stays sampleable so natural stopping still works), so they can never appear as ghost text. No app code beyond the dependency bump.
  • single_line: set from the focused field (LlamaGenerationOptions.singleLine = !request.isMultiLineEnabled). Single-line fields never receive a multi-line completion at the source.
  • forceWordContinuation: MidWordContinuationPolicy fires only when the caret sits strictly inside a word, so the engine constrains the first sampled token to continue the current word. At a normal word end it does not fire, so ordinary next-word predictions are unchanged.
  • Confidence suppression (off by default): the engine now reports a per-token log-probability. LlamaRuntimeCore averages it and, when LlamaGenerationOptions.confidenceFloor is raised above its default of -infinity, suppresses low-confidence completions via the pure ConfidenceSuppressionPolicy. Disabled by default, so no behavior change until a caller opts in.

Threading: singleLine / forceWordContinuation / confidenceFloor flow through LlamaGenerationOptions into LlamaRuntimeCore (sampling config, setForceWordContinuation before each decodePrompt, and the logprob accumulation in the sample loop).

Validation

swiftlint lint --strict --quiet <touched files>          # exit 0
xcodebuild ... build         -derivedDataPath ...         # ** BUILD SUCCEEDED **
xcodebuild ... build-for-testing -derivedDataPath ...     # ** TEST BUILD SUCCEEDED **
xcodebuild test ... CODE_SIGNING_ALLOWED=NO \
  -only-testing:CotabbyTests/ConfidenceSuppressionPolicyTests \
  -only-testing:CotabbyTests/MidWordContinuationPolicyTests \
  -only-testing:CotabbyTests/TrailingDuplicationFilterTests \
  -only-testing:CotabbyTests/InsertionSafetyGateTests \
  -only-testing:CotabbyTests/SentenceBoundaryClassifierTests
# ** TEST SUCCEEDED **  Executed 27 tests, with 0 failures

Package pin resolves to cotabbyinference @ feat/generation-quality-controls (be64365). The engine side was separately validated with swift test against a local gemma-3-1b model: 20 tests, 0 failures (see the engine PR).

Linked issues

Depends on FuJacob/cotabbyinference#8 (engine).

Risk / rollout notes

  • Merge order: project.yml pins the engine feature branch. Merge the engine PR to cotabbyinference main first, then flip this PR's project.yml back to branch: main, re-resolve, and merge. Kept as a draft until then.
  • Behavior: the nonprintable mask changes sampling for all local models but only removes never-visible tokens (greedy determinism preserved, verified by the engine's interleaved test). single_line only affects single-line fields. forceWordContinuation uses a narrow trigger. Confidence suppression is off by default.
  • Follow-ups (deliberate): wiring a Settings control for confidenceFloor, full multi-candidate N-best ranking, and the base-model prompt path (covered by the in-flight Feed instruct models their own chat template; write both prompt paths as prose #438). These need on-device tuning that green CI cannot stand in for.
  • Typo suppression is handled by Suppress completions on typo'd word and offer context-aware correction #353, not here.

Greptile Summary

This PR wires three generation-time quality controls into the app: single_line token masking (derived from the focused field's multi-line capability), forceWordContinuation (applied only when the caret sits strictly inside a word), and an opt-in confidence suppression layer that averages per-token log-probabilities and discards low-confidence completions. All three flow through a new LlamaGenerationOptions extension into LlamaRuntimeCore.

  • singleLine is passed as part of SamplingConfig on sequence creation and correctly flips from field metadata; however, LlamaRuntimeCore.SamplingFingerprint does not capture it, so switching between a single-line and multi-line field that share a prompt prefix can silently reuse a sequence built with the wrong single_line value.
  • forceWordContinuation is applied via setForceWordContinuation before each decodePrompt on both the fresh and reuse paths; when the reuse path has no remaining tokens to decode (remaining.isEmpty), the call is skipped and a stale seed token from the prior decode may carry the wrong constraint.
  • Confidence suppression (ConfidenceSuppressionPolicy) and mid-word policy (MidWordContinuationPolicy) are clean pure functions with good test coverage and safe defaults.

Confidence Score: 3/5

Merging now introduces a silent correctness gap: users switching between single-line and multi-line fields that share a prompt prefix can receive completions constrained by the wrong field type, with no visible error or log warning beyond the wrong output.

The singleLine flag is baked into a sequence at creation time but is absent from the fingerprint that gates KV-cache reuse. After the sequence is created correctly for a single-line field, a subsequent request from a multi-line field with the same prompt prefix will silently reuse it, masking newline tokens in a context where they should be allowed. This is an observable behavioral regression on a path users hit naturally by tabbing between fields. The rest of the feature is well-structured and tested.

LlamaRuntimeCore.swift (the SamplingFingerprint struct near the bottom of the file) and LlamaSuggestionEngine.swift (the parallel SamplingFingerprint in the hint tracker) both need singleLine added before this is safe to merge.

Important Files Changed

Filename Overview
Cotabby/Services/Runtime/LlamaRuntimeCore.swift Adds sumLogprob accumulation and confidence suppression after generation, plus setForceWordContinuation on both the fresh and reuse paths. SamplingFingerprint does not capture singleLine, allowing KV cache to be reused with the wrong single_line sampling config when switching field types.
Cotabby/Services/Runtime/LlamaSuggestionEngine.swift Wires singleLine and forceWordContinuation into LlamaGenerationOptions; the hint-tracker SamplingFingerprint does not include singleLine, which can feed a stale cache hint to the runtime when the user switches between field types.
Cotabby/Support/MidWordContinuationPolicy.swift New pure policy that fires only when both the preceding and trailing characters are letters or digits; correctly avoids triggering at word boundaries and is well-tested.
Cotabby/Support/ConfidenceSuppressionPolicy.swift New pure policy that suppresses completions below an average log-probability floor; disabled by default via -infinity sentinel and correctly tested for boundary conditions.
Cotabby/Models/LlamaRuntimeModels.swift Adds singleLine, forceWordContinuation, and confidenceFloor to LlamaGenerationOptions with safe defaults; no issues.
CotabbyTests/MidWordContinuationPolicyTests.swift Covers the four key boundary cases (mid-word, word end, space before caret, non-word character after caret); clean and focused.
CotabbyTests/ConfidenceSuppressionPolicyTests.swift Tests all four relevant cases including the at-floor boundary (should not suppress) and the disabled floor sentinel; correct and complete.

Sequence Diagram

sequenceDiagram
    participant SE as LlamaSuggestionEngine
    participant RC as LlamaRuntimeCore
    participant ENG as CotabbyInference Engine

    SE->>SE: MidWordContinuationPolicy.shouldForceContinuation(preceding, trailing)
    SE->>RC: generate(prompt, options)
    RC->>RC: SamplingFingerprint(options) — singleLine NOT captured
    RC->>RC: obtainAutocompleteSequence()
    alt KV cache reusable
        RC->>ENG: setForceWordContinuation [only if remaining non-empty]
        RC->>ENG: decodePrompt(seqID, remaining)
    else Fresh sequence
        RC->>ENG: createSequence(SamplingConfig incl. single_line)
        RC->>ENG: setForceWordContinuation(seqID, flag)
        RC->>ENG: decodePrompt(seqID, tokens)
    end
    loop up to maxPredictionTokens
        RC->>ENG: sampleNext(seqID)
        ENG-->>RC: SampleResult with logprob
        RC->>RC: "sumLogprob += logprob"
    end
    RC->>RC: ConfidenceSuppressionPolicy.shouldSuppress(avg, floor)
    RC-->>SE: generatedText or empty string
    SE->>SE: SuggestionTextNormalizer.normalize(raw, request)
Loading

Comments Outside Diff (3)

  1. Cotabby/Services/Runtime/LlamaRuntimeCore.swift, line 509-527 (link)

    P1 singleLine missing from LlamaRuntimeCore.SamplingFingerprint

    singleLine is encoded into SamplingConfig at sequence-creation time (createSequence(config)) and cannot be updated on an existing sequence. However, SamplingFingerprint — which gates KV-cache reuse — doesn't capture it. This means that if a user alternates between a multi-line field and a single-line field that share the same prompt prefix, the runtime may reuse a sequence created with the wrong single_line value. Concretely: after visiting a single-line field (sequence created with single_line: true), returning to a multi-line field with the same preceding text causes the cached sequence (with line-break tokens masked) to be reused, silently suppressing newline completions in the multi-line context.

    singleLine should be added alongside the other sampling knobs in this struct.

    Fix in Codex Fix in Claude Code

  2. Cotabby/Services/Runtime/LlamaSuggestionEngine.swift, line 229-247 (link)

    P2 singleLine absent from the hint-tracker SamplingFingerprint

    This fingerprint controls when cachedPrefixBytes is reset to nil before being passed to the runtime. Without singleLine, switching between field types with the same prompt prefix passes a non-nil hint to the runtime even though the SamplingConfig changed. The runtime's own fingerprint check (also currently missing singleLine) then allows cache reuse. Adding singleLine here closes the hint-tracker's side of the gap in concert with fixing the runtime's fingerprint.

    Fix in Codex Fix in Claude Code

  3. Cotabby/Services/Runtime/LlamaRuntimeCore.swift, line 404-426 (link)

    P2 Stale forceWordContinuation seed when remaining.isEmpty

    setForceWordContinuation is guarded inside if !remaining.isEmpty, so it is only called when there are new tokens to decode. When remaining.isEmpty (the full prompt is already in the KV cache), no new decodePrompt is issued, and the seed token — sampled at the end of the previous decodePrompt — is reused unchanged. Because forceWordContinuation depends on trailingText (which is not part of the prompt), a user can reach this state with a different trailing-text boundary: for example, they were mid-word when the first decode happened, then deleted the trailing letters so the caret is now at a word end. In that request remaining.isEmpty is true (same preceding text = same prompt), but the stale seed token may carry the word-continuation constraint from the prior decode, causing the first generated token to be joined to the word even though prediction should start fresh.

    Fix in Codex Fix in Claude Code

Fix All in Codex Fix All in Claude Code

Reviews (1): Last reviewed commit: "Add confidence-based suppression (off by..." | Re-trigger Greptile

@FuJacob FuJacob force-pushed the feat/output-safety-gates branch from 023d913 to c42c482 Compare May 31, 2026 19:13
FuJacob added 2 commits May 31, 2026 12:13
Point the CotabbyInference package at the engine branch that adds the token masks,
mid-word continuation, and KV snapshot APIs, and use them:

- The always-on nonprintable token mask now applies automatically (control,
  chat-template, and unused tokens can no longer be emitted as visible text), with
  no app code beyond the pin bump.
- single_line is set from the focused field (LlamaGenerationOptions.singleLine =
  !isMultiLineEnabled) so single-line fields never receive a multi-line completion
  at the source instead of being truncated after the fact.
- forceWordContinuation fires only when the caret sits strictly inside a word
  (MidWordContinuationPolicy), so the engine constrains the first token to continue
  that word without affecting ordinary next-word predictions.

Threads singleLine / forceWordContinuation through LlamaGenerationOptions into
LlamaRuntimeCore (sampling config + setForceWordContinuation before each
decodePrompt, fresh and reuse paths). Adds MidWordContinuationPolicy + tests.
Use the engine's new per-token logprob to drop completions the model itself was
unsure about. LlamaRuntimeCore accumulates the average per-token log-probability and,
when LlamaGenerationOptions.confidenceFloor is raised above its default of -infinity,
suppresses completions below the floor. ConfidenceSuppressionPolicy holds the pure
decision and is unit-tested. Disabled by default, so behavior is unchanged until a
caller opts in; wiring a Settings control and full multi-candidate N-best ranking
remain follow-ups.
@FuJacob FuJacob force-pushed the feat/generation-quality-controls branch from 937f8c6 to ca10419 Compare May 31, 2026 19:19
@FuJacob FuJacob marked this pull request as ready for review June 1, 2026 04:08
@FuJacob FuJacob merged commit 9d9db99 into feat/output-safety-gates Jun 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant