Generation-time quality controls: token masks, single-line, mid-word continuation#488
Merged
FuJacob merged 2 commits intoJun 1, 2026
Merged
Conversation
023d913 to
c42c482
Compare
Point the CotabbyInference package at the engine branch that adds the token masks, mid-word continuation, and KV snapshot APIs, and use them: - The always-on nonprintable token mask now applies automatically (control, chat-template, and unused tokens can no longer be emitted as visible text), with no app code beyond the pin bump. - single_line is set from the focused field (LlamaGenerationOptions.singleLine = !isMultiLineEnabled) so single-line fields never receive a multi-line completion at the source instead of being truncated after the fact. - forceWordContinuation fires only when the caret sits strictly inside a word (MidWordContinuationPolicy), so the engine constrains the first token to continue that word without affecting ordinary next-word predictions. Threads singleLine / forceWordContinuation through LlamaGenerationOptions into LlamaRuntimeCore (sampling config + setForceWordContinuation before each decodePrompt, fresh and reuse paths). Adds MidWordContinuationPolicy + tests.
Use the engine's new per-token logprob to drop completions the model itself was unsure about. LlamaRuntimeCore accumulates the average per-token log-probability and, when LlamaGenerationOptions.confidenceFloor is raised above its default of -infinity, suppresses completions below the floor. ConfidenceSuppressionPolicy holds the pure decision and is unit-tested. Disabled by default, so behavior is unchanged until a caller opts in; wiring a Settings control and full multi-candidate N-best ranking remain follow-ups.
937f8c6 to
ca10419
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Wires generation-time quality controls into the app so unwanted tokens are stopped at the sampler instead of cleaned up after the fact. Builds on the engine changes in FuJacob/cotabbyinference#8. Second of two stacked PRs, based on the
feat/output-safety-gatesbranch (#485).single_line: set from the focused field (LlamaGenerationOptions.singleLine = !request.isMultiLineEnabled). Single-line fields never receive a multi-line completion at the source.forceWordContinuation:MidWordContinuationPolicyfires only when the caret sits strictly inside a word, so the engine constrains the first sampled token to continue the current word. At a normal word end it does not fire, so ordinary next-word predictions are unchanged.LlamaRuntimeCoreaverages it and, whenLlamaGenerationOptions.confidenceFlooris raised above its default of -infinity, suppresses low-confidence completions via the pureConfidenceSuppressionPolicy. Disabled by default, so no behavior change until a caller opts in.Threading:
singleLine/forceWordContinuation/confidenceFloorflow throughLlamaGenerationOptionsintoLlamaRuntimeCore(sampling config,setForceWordContinuationbefore eachdecodePrompt, and the logprob accumulation in the sample loop).Validation
Package pin resolves to
cotabbyinference @ feat/generation-quality-controls (be64365). The engine side was separately validated withswift testagainst a local gemma-3-1b model: 20 tests, 0 failures (see the engine PR).Linked issues
Depends on FuJacob/cotabbyinference#8 (engine).
Risk / rollout notes
project.ymlpins the engine feature branch. Merge the engine PR tocotabbyinferencemainfirst, then flip this PR'sproject.ymlback tobranch: main, re-resolve, and merge. Kept as a draft until then.single_lineonly affects single-line fields.forceWordContinuationuses a narrow trigger. Confidence suppression is off by default.confidenceFloor, full multi-candidate N-best ranking, and the base-model prompt path (covered by the in-flight Feed instruct models their own chat template; write both prompt paths as prose #438). These need on-device tuning that green CI cannot stand in for.Greptile Summary
This PR wires three generation-time quality controls into the app:
single_linetoken masking (derived from the focused field's multi-line capability),forceWordContinuation(applied only when the caret sits strictly inside a word), and an opt-in confidence suppression layer that averages per-token log-probabilities and discards low-confidence completions. All three flow through a newLlamaGenerationOptionsextension intoLlamaRuntimeCore.singleLineis passed as part ofSamplingConfigon sequence creation and correctly flips from field metadata; however,LlamaRuntimeCore.SamplingFingerprintdoes not capture it, so switching between a single-line and multi-line field that share a prompt prefix can silently reuse a sequence built with the wrongsingle_linevalue.forceWordContinuationis applied viasetForceWordContinuationbefore eachdecodePrompton both the fresh and reuse paths; when the reuse path has no remaining tokens to decode (remaining.isEmpty), the call is skipped and a stale seed token from the prior decode may carry the wrong constraint.ConfidenceSuppressionPolicy) and mid-word policy (MidWordContinuationPolicy) are clean pure functions with good test coverage and safe defaults.Confidence Score: 3/5
Merging now introduces a silent correctness gap: users switching between single-line and multi-line fields that share a prompt prefix can receive completions constrained by the wrong field type, with no visible error or log warning beyond the wrong output.
The
singleLineflag is baked into a sequence at creation time but is absent from the fingerprint that gates KV-cache reuse. After the sequence is created correctly for a single-line field, a subsequent request from a multi-line field with the same prompt prefix will silently reuse it, masking newline tokens in a context where they should be allowed. This is an observable behavioral regression on a path users hit naturally by tabbing between fields. The rest of the feature is well-structured and tested.LlamaRuntimeCore.swift(theSamplingFingerprintstruct near the bottom of the file) andLlamaSuggestionEngine.swift(the parallelSamplingFingerprintin the hint tracker) both needsingleLineadded before this is safe to merge.Important Files Changed
Sequence Diagram
sequenceDiagram participant SE as LlamaSuggestionEngine participant RC as LlamaRuntimeCore participant ENG as CotabbyInference Engine SE->>SE: MidWordContinuationPolicy.shouldForceContinuation(preceding, trailing) SE->>RC: generate(prompt, options) RC->>RC: SamplingFingerprint(options) — singleLine NOT captured RC->>RC: obtainAutocompleteSequence() alt KV cache reusable RC->>ENG: setForceWordContinuation [only if remaining non-empty] RC->>ENG: decodePrompt(seqID, remaining) else Fresh sequence RC->>ENG: createSequence(SamplingConfig incl. single_line) RC->>ENG: setForceWordContinuation(seqID, flag) RC->>ENG: decodePrompt(seqID, tokens) end loop up to maxPredictionTokens RC->>ENG: sampleNext(seqID) ENG-->>RC: SampleResult with logprob RC->>RC: "sumLogprob += logprob" end RC->>RC: ConfidenceSuppressionPolicy.shouldSuppress(avg, floor) RC-->>SE: generatedText or empty string SE->>SE: SuggestionTextNormalizer.normalize(raw, request)Comments Outside Diff (3)
Cotabby/Services/Runtime/LlamaRuntimeCore.swift, line 509-527 (link)singleLinemissing fromLlamaRuntimeCore.SamplingFingerprintsingleLineis encoded intoSamplingConfigat sequence-creation time (createSequence(config)) and cannot be updated on an existing sequence. However,SamplingFingerprint— which gates KV-cache reuse — doesn't capture it. This means that if a user alternates between a multi-line field and a single-line field that share the same prompt prefix, the runtime may reuse a sequence created with the wrongsingle_linevalue. Concretely: after visiting a single-line field (sequence created withsingle_line: true), returning to a multi-line field with the same preceding text causes the cached sequence (with line-break tokens masked) to be reused, silently suppressing newline completions in the multi-line context.singleLineshould be added alongside the other sampling knobs in this struct.Cotabby/Services/Runtime/LlamaSuggestionEngine.swift, line 229-247 (link)singleLineabsent from the hint-trackerSamplingFingerprintThis fingerprint controls when
cachedPrefixBytesis reset tonilbefore being passed to the runtime. WithoutsingleLine, switching between field types with the same prompt prefix passes a non-nil hint to the runtime even though theSamplingConfigchanged. The runtime's own fingerprint check (also currently missingsingleLine) then allows cache reuse. AddingsingleLinehere closes the hint-tracker's side of the gap in concert with fixing the runtime's fingerprint.Cotabby/Services/Runtime/LlamaRuntimeCore.swift, line 404-426 (link)forceWordContinuationseed whenremaining.isEmptysetForceWordContinuationis guarded insideif !remaining.isEmpty, so it is only called when there are new tokens to decode. Whenremaining.isEmpty(the full prompt is already in the KV cache), no newdecodePromptis issued, and the seed token — sampled at the end of the previousdecodePrompt— is reused unchanged. BecauseforceWordContinuationdepends ontrailingText(which is not part of the prompt), a user can reach this state with a different trailing-text boundary: for example, they were mid-word when the first decode happened, then deleted the trailing letters so the caret is now at a word end. In that requestremaining.isEmptyis true (same preceding text = same prompt), but the stale seed token may carry the word-continuation constraint from the prior decode, causing the first generated token to be joined to the word even though prediction should start fresh.Reviews (1): Last reviewed commit: "Add confidence-based suppression (off by..." | Re-trigger Greptile