Add deterministic constrained decoder behind a default-off flag#503
Conversation
Introduces a logit-level constrained decoding path for the open-source llama runtime as an alternative to the engine's built-in stochastic sampler. Per step it reads the raw next-token logits, masks structural and control tokens via a per-model token profile, deterministically selects the highest-logit admissible token, and commits it manually with acceptToken. The result is reproducible, leak-free continuations: no chat or control markers can surface as visible completion text, and the same prompt always yields the same suggestion. New pure, unit-tested helpers carry the decision logic: TokenProfile (per-token bytes plus control/EOG/whitespace/newline flags, built once per model from the vocab) and ConstrainedSampler (deterministic argmax with exclusion, optional admissibility, top-k pool bound, and a stable single-step log-prob for confidence). The runtime builds the profile lazily on first use and caches it per model. Routing is gated by the hidden cotabbyConstrainedDecoderEnabled UserDefaults flag (default off), so the shipping sampleNext path is byte-for-byte unchanged until the constrained decoder is validated on device. The generate() lifecycle, KV reuse, cancellation, and the manager's task handling are untouched; only the inner decode loop branches on the flag. Bumps the CotabbyInference pin to main to consume the logits-read, token-accept, and vocab-introspection primitives.
|
|
||
| var generatedBytes: [UInt8] = [] | ||
| var tokensGenerated = 0 | ||
| var sumLogprob = 0.0 | ||
| var stopReason = "budget_exhausted" | ||
| var logits = [Float](repeating: 0, count: vocabSize) |
There was a problem hiding this comment.
Silent logprob gap silently inflates the confidence score
When ConstrainedSampler.logProb returns nil, tokensGenerated is still incremented but sumLogprob is not. The subsequent shouldSuppress call divides sumLogprob by tokensGenerated, so any skipped step makes the average log-probability closer to zero (artificially higher confidence), risking a low-quality completion passing the floor check. In practice logProb cannot be nil here (the token index always comes from selectToken which bounds candidates to 0..<logits.count), but the defensive if let means a future refactor that breaks that invariant would silently miscalibrate suppression rather than surfacing a crash. Consider logging a fault or asserting non-nil, or using a known-safe fallback value instead of silently skipping the accumulation.
| let profile = try autocompleteTokenProfile() | ||
| let vocabSize = profile.vocabSize | ||
| guard vocabSize > 0 else { | ||
| throw LlamaRuntimeError.generationFailed("Vocabulary unavailable for constrained decoding.") | ||
| } | ||
| // `topK` bounds the candidate pool the selector ranks; clamp to a sane positive value so a | ||
| // zero/negative request still yields a full-vocab argmax rather than an empty pool. | ||
| let topK = options.topK > 0 ? options.topK : vocabSize |
There was a problem hiding this comment.
Unreachable guard:
autocompleteTokenProfile() already guarantees a positive vocab size
autocompleteTokenProfile() throws LlamaRuntimeError.generationFailed when engine.getVocabSize() is zero and never stores a zero-entry profile in the cache, so profile.vocabSize can never be zero when the call returns without throwing. The guard here is dead code and the duplicated error message adds maintenance surface.
| let profile = try autocompleteTokenProfile() | |
| let vocabSize = profile.vocabSize | |
| guard vocabSize > 0 else { | |
| throw LlamaRuntimeError.generationFailed("Vocabulary unavailable for constrained decoding.") | |
| } | |
| // `topK` bounds the candidate pool the selector ranks; clamp to a sane positive value so a | |
| // zero/negative request still yields a full-vocab argmax rather than an empty pool. | |
| let topK = options.topK > 0 ? options.topK : vocabSize | |
| let profile = try autocompleteTokenProfile() | |
| let vocabSize = profile.vocabSize | |
| // `topK` bounds the candidate pool the selector ranks; clamp to a sane positive value so a | |
| // zero/negative request still yields a full-vocab argmax rather than an empty pool. | |
| let topK = options.topK > 0 ? options.topK : vocabSize |
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Summary
Adds a deterministic, logit-level constrained decoding path for the open-source llama runtime, gated behind a hidden default-off flag. Instead of the engine's stochastic sampler, it reads the raw next-token logits, masks structural / control tokens via a per-model token profile, argmax-selects the highest admissible token, and commits it manually with
acceptToken. This produces reproducible, leak-free completions — no chat or control markers can surface as visible text, and the same prompt always yields the same suggestion. Default behavior is unchanged.Validation
ConstrainedSamplerTests(selection, control exclusion, admissibility, top-k pool bounds, tie-breaking, determinism, single-step + averaged log-prob) andTokenProfileTests.xcodebuild test (macOS)job (local Team-ID signing prevents the xctest host from loading here).Linked issues
None.
Risk / rollout notes
cotabbyConstrainedDecoderEnabledUserDefaults flag (no UI) is true. The shippingsampleNextpath is byte-for-byte unchanged;generate()only branches its inner decode loop, and the flag is excluded from the KV-reuse fingerprint.project.yml+ the generated pbxproj move the CotabbyInference pin from thefeat/generation-quality-controlsbranch tomainto consume the constrained-generation primitives (logits read, token accept, vocab introspection).Package.resolvedis gitignored; CI resolvesbranch: mainto latest.Greptile Summary
Adds a deterministic, logit-level constrained decode path to the llama runtime, gated behind the
cotabbyConstrainedDecoderEnabledUserDefaults flag (default off). The existingsampleNextpath is byte-for-byte unchanged; the new path reads raw logits each step, masks control/structural tokens via a lazily-built, per-modelTokenProfile, and commits the highest-admissible-logit token withacceptToken.LlamaRuntimeCoresplitsgenerate()intorunEngineSampledDecodeandrunConstrainedDecode, sharing a singleshouldSuppresshelper;TokenProfileis built once on the first constrained request and cached until the model changes orshutdown()is called.ConstrainedSampleris a pure, RNG-free utility (argmax + optional admissibility set + top-K bounding) covered by a comprehensive unit-test suite;CotabbyInferencedependency is re-pinned from a feature branch tomainto consume the new logit/accept primitives.Confidence Score: 4/5
Safe to merge as-is; the constrained decoder is default-off and the shipping sampleNext path is unchanged.
The default-off flag means no user-visible behaviour changes on merge. The new constrained path has a minor confidence-score miscalibration risk when logProb is nil, a redundant vocabSize guard, a bare-CR gap in the single-line stop check, and a per-step O(N log N) full-vocabulary sort. None of these affect the shipping path or block enabling the feature, but they are worth addressing before the flag is promoted to default-on.
LlamaRuntimeCore.swift (constrained decode loop and token-profile caching) and TokenProfile.swift (isNewline CR handling) deserve a second look before the flag is turned on by default.
Important Files Changed
Flowchart
%%{init: {'theme': 'neutral'}}%% flowchart TD A[LlamaSuggestionEngine.suggest] -->|reads UserDefaults| B{cotabbyConstrainedDecoderEnabled?} B -->|false - default| C[LlamaRuntimeCore.generate] B -->|true| C C --> D[Prepare KV sequence / prompt tokens] D --> E{options.useConstrainedDecoder?} E -->|false| F[runEngineSampledDecode] F --> G[engine.sampleNext loop] G --> H[extractPiece / accumulate text] H --> I{EOS or budget?} I -->|no| G I -->|yes| J[shouldSuppress check] E -->|true| K[runConstrainedDecode] K --> L[autocompleteTokenProfile - lazy build + cache] L --> M[engine.getNextTokenLogits] M --> N[ConstrainedSampler.selectToken - argmax with exclusions] N --> O{EOS / singleLine / nil?} O -->|stop| P[shouldSuppress check] O -->|continue| Q[engine.acceptToken] Q --> R[accumulate bytes + logProb] R --> M J --> S[return generated text or empty] P --> SComments Outside Diff (2)
Cotabby/Support/TokenProfile.swift, line 704-724 (link)\rtoken won't stop single-line decodingisNewlineonly tests for LF (0x0A). A token whose bytes are solely[0x0D](carriage return) or that ends with\rwithout a following\nwould pass the single-line stop check undetected, allowing a partial carriage return to be appended to the completion before the next\n-carrying token eventually terminates the loop.isASCIIWhitespacealready includes0x0D, so the byte-level knowledge is present;isNewlinecould simply union the CR check:bytes.contains(0x0A) || bytes.contains(0x0D).Cotabby/Support/ConstrainedSampler.swift, line 559-571 (link)When
limit < count, the entire range0..<countis sorted descending by logit before taking theprefix(limit). For a 32 K–128 K vocabulary with typicaltopKvalues of 40–100, that is ~480 K–2.2 M comparisons per decode step just for the candidate pool. A min-heap of sizelimit(orArray.partialSort) would reduce that to O(N log K) — roughly 5–6× less work at K=40. On device the per-step wall time is dominated bygetNextTokenLogitsandacceptToken, so this is not a blocker, but the sort overhead compounds across all constrained-decode steps and is worth revisiting before the flag is turned on by default.Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Reviews (1): Last reviewed commit: "Add deterministic constrained decoder be..." | Re-trigger Greptile