Add deterministic constrained decoder behind a default-off flag by FuJacob · Pull Request #503 · FuJacob/cotabby

FuJacob · 2026-06-01T15:08:03Z

Summary

Adds a deterministic, logit-level constrained decoding path for the open-source llama runtime, gated behind a hidden default-off flag. Instead of the engine's stochastic sampler, it reads the raw next-token logits, masks structural / control tokens via a per-model token profile, argmax-selects the highest admissible token, and commits it manually with acceptToken. This produces reproducible, leak-free completions — no chat or control markers can surface as visible text, and the same prompt always yields the same suggestion. Default behavior is unchanged.

Validation

rm -rf build/DerivedData
xcodebuild -project Cotabby.xcodeproj -scheme Cotabby -destination 'platform=macOS' \
  build-for-testing -derivedDataPath build/DerivedData
# ** TEST BUILD SUCCEEDED **  (app + test targets, clean)

swiftlint lint --quiet
# 0 violations

New pure logic is unit-tested: ConstrainedSamplerTests (selection, control exclusion, admissibility, top-k pool bounds, tie-breaking, determinism, single-step + averaged log-prob) and TokenProfileTests.
Not validated end-to-end: decode quality can only be judged with a real model in a real text field, which cannot run headlessly. That on-device check is exactly what flipping the flag is for; the path stays default-off until then.
App-hosted unit tests execute on CI's xcodebuild test (macOS) job (local Team-ID signing prevents the xctest host from loading here).

Linked issues

None.

Risk / rollout notes

Default-off. Generation is routed through the constrained decoder only when the cotabbyConstrainedDecoderEnabled UserDefaults flag (no UI) is true. The shipping sampleNext path is byte-for-byte unchanged; generate() only branches its inner decode loop, and the flag is excluded from the KV-reuse fingerprint.
Dependency pin. project.yml + the generated pbxproj move the CotabbyInference pin from the feat/generation-quality-controls branch to main to consume the constrained-generation primitives (logits read, token accept, vocab introspection). Package.resolved is gitignored; CI resolves branch: main to latest.
Performance. On the first constrained request per model, a token profile is built once (one detokenize per vocab token) and cached, then freed on model unload. No effect when the flag is off.
Follow-up. Prefix-admissibility and best-of-N / multi-branch decode are intentionally out of scope here (they need multi-sequence engine support) and will land separately; this PR is the single-sequence deterministic path.

Greptile Summary

Adds a deterministic, logit-level constrained decode path to the llama runtime, gated behind the cotabbyConstrainedDecoderEnabled UserDefaults flag (default off). The existing sampleNext path is byte-for-byte unchanged; the new path reads raw logits each step, masks control/structural tokens via a lazily-built, per-model TokenProfile, and commits the highest-admissible-logit token with acceptToken.

LlamaRuntimeCore splits generate() into runEngineSampledDecode and runConstrainedDecode, sharing a single shouldSuppress helper; TokenProfile is built once on the first constrained request and cached until the model changes or shutdown() is called.
ConstrainedSampler is a pure, RNG-free utility (argmax + optional admissibility set + top-K bounding) covered by a comprehensive unit-test suite; CotabbyInference dependency is re-pinned from a feature branch to main to consume the new logit/accept primitives.

Confidence Score: 4/5

Safe to merge as-is; the constrained decoder is default-off and the shipping sampleNext path is unchanged.

The default-off flag means no user-visible behaviour changes on merge. The new constrained path has a minor confidence-score miscalibration risk when logProb is nil, a redundant vocabSize guard, a bare-CR gap in the single-line stop check, and a per-step O(N log N) full-vocabulary sort. None of these affect the shipping path or block enabling the feature, but they are worth addressing before the flag is promoted to default-on.

LlamaRuntimeCore.swift (constrained decode loop and token-profile caching) and TokenProfile.swift (isNewline CR handling) deserve a second look before the flag is turned on by default.

Important Files Changed

Filename	Overview
Cotabby/Services/Runtime/LlamaRuntimeCore.swift	Core change: splits generate() into engine-sampled and constrained decode paths, extracts shouldSuppress as a shared static, adds lazy token-profile caching cleared on shutdown. Minor: sumLogprob/tokensGenerated mismatch under nil logProb and redundant vocabSize guard.
Cotabby/Support/ConstrainedSampler.swift	New pure-logic file: deterministic argmax selection with control-token exclusion, admissibility filtering, and top-K bounding. Well-tested; per-step full-vocab sort is O(N log N) rather than optimal O(N log K).
Cotabby/Support/TokenProfile.swift	New pure-logic file: per-token metadata table (bytes, control, EOG, whitespace, newline flags). Defensive out-of-range handling is solid; isNewline only checks LF (0x0A), missing bare CR stop for single-line mode.
Cotabby/Services/Runtime/LlamaSuggestionEngine.swift	Reads cotabbyConstrainedDecoderEnabled UserDefaults key and plumbs it into LlamaGenerationOptions; change is minimal and default-off.
project.yml	Moves CotabbyInference pin from feat/generation-quality-controls to branch: main; deliberate per PR description but a floating branch reference means CI resolves to latest HEAD on every clean build.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[LlamaSuggestionEngine.suggest] -->|reads UserDefaults| B{cotabbyConstrainedDecoderEnabled?}
    B -->|false - default| C[LlamaRuntimeCore.generate]
    B -->|true| C

    C --> D[Prepare KV sequence / prompt tokens]
    D --> E{options.useConstrainedDecoder?}

    E -->|false| F[runEngineSampledDecode]
    F --> G[engine.sampleNext loop]
    G --> H[extractPiece / accumulate text]
    H --> I{EOS or budget?}
    I -->|no| G
    I -->|yes| J[shouldSuppress check]

    E -->|true| K[runConstrainedDecode]
    K --> L[autocompleteTokenProfile - lazy build + cache]
    L --> M[engine.getNextTokenLogits]
    M --> N[ConstrainedSampler.selectToken - argmax with exclusions]
    N --> O{EOS / singleLine / nil?}
    O -->|stop| P[shouldSuppress check]
    O -->|continue| Q[engine.acceptToken]
    Q --> R[accumulate bytes + logProb]
    R --> M

    J --> S[return generated text or empty]
    P --> S

Comments Outside Diff (2)

Cotabby/Support/TokenProfile.swift, line 704-724 (link)

Bare \r token won't stop single-line decoding

isNewline only tests for LF (0x0A). A token whose bytes are solely [0x0D] (carriage return) or that ends with \r without a following \n would pass the single-line stop check undetected, allowing a partial carriage return to be appended to the completion before the next \n-carrying token eventually terminates the loop. isASCIIWhitespace already includes 0x0D, so the byte-level knowledge is present; isNewline could simply union the CR check: bytes.contains(0x0A) || bytes.contains(0x0D).
Cotabby/Support/ConstrainedSampler.swift, line 559-571 (link)

Per-step full-vocabulary sort is O(N log N) when O(N log K) would suffice

When limit < count, the entire range 0..<count is sorted descending by logit before taking the prefix(limit). For a 32 K–128 K vocabulary with typical topK values of 40–100, that is ~480 K–2.2 M comparisons per decode step just for the candidate pool. A min-heap of size limit (or Array.partialSort) would reduce that to O(N log K) — roughly 5–6× less work at K=40. On device the per-step wall time is dominated by getNextTokenLogits and acceptToken, so this is not a blocker, but the sort overhead compounds across all constrained-decode steps and is worth revisiting before the flag is turned on by default.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

_{Reviews (1): Last reviewed commit: "Add deterministic constrained decoder be..." | Re-trigger Greptile}

Greptile also left 2 inline comments on this PR.

Introduces a logit-level constrained decoding path for the open-source llama runtime as an alternative to the engine's built-in stochastic sampler. Per step it reads the raw next-token logits, masks structural and control tokens via a per-model token profile, deterministically selects the highest-logit admissible token, and commits it manually with acceptToken. The result is reproducible, leak-free continuations: no chat or control markers can surface as visible completion text, and the same prompt always yields the same suggestion. New pure, unit-tested helpers carry the decision logic: TokenProfile (per-token bytes plus control/EOG/whitespace/newline flags, built once per model from the vocab) and ConstrainedSampler (deterministic argmax with exclusion, optional admissibility, top-k pool bound, and a stable single-step log-prob for confidence). The runtime builds the profile lazily on first use and caches it per model. Routing is gated by the hidden cotabbyConstrainedDecoderEnabled UserDefaults flag (default off), so the shipping sampleNext path is byte-for-byte unchanged until the constrained decoder is validated on device. The generate() lifecycle, KV reuse, cancellation, and the manager's task handling are untouched; only the inner decode loop branches on the flag. Bumps the CotabbyInference pin to main to consume the logits-read, token-accept, and vocab-introspection primitives.

greptile-apps · 2026-06-01T15:14:29Z

+
+        var generatedBytes: [UInt8] = []
+        var tokensGenerated = 0
+        var sumLogprob = 0.0
+        var stopReason = "budget_exhausted"
+        var logits = [Float](repeating: 0, count: vocabSize)


Silent logprob gap silently inflates the confidence score

When ConstrainedSampler.logProb returns nil, tokensGenerated is still incremented but sumLogprob is not. The subsequent shouldSuppress call divides sumLogprob by tokensGenerated, so any skipped step makes the average log-probability closer to zero (artificially higher confidence), risking a low-quality completion passing the floor check. In practice logProb cannot be nil here (the token index always comes from selectToken which bounds candidates to 0..<logits.count), but the defensive if let means a future refactor that breaks that invariant would silently miscalibrate suppression rather than surfacing a crash. Consider logging a fault or asserting non-nil, or using a known-safe fallback value instead of silently skipping the accumulation.

greptile-apps · 2026-06-01T15:14:30Z

+        let profile = try autocompleteTokenProfile()
+        let vocabSize = profile.vocabSize
+        guard vocabSize > 0 else {
+            throw LlamaRuntimeError.generationFailed("Vocabulary unavailable for constrained decoding.")
+        }
+        // `topK` bounds the candidate pool the selector ranks; clamp to a sane positive value so a
+        // zero/negative request still yields a full-vocab argmax rather than an empty pool.
+        let topK = options.topK > 0 ? options.topK : vocabSize


Unreachable guard: autocompleteTokenProfile() already guarantees a positive vocab size

autocompleteTokenProfile() throws LlamaRuntimeError.generationFailed when engine.getVocabSize() is zero and never stores a zero-entry profile in the cache, so profile.vocabSize can never be zero when the call returns without throwing. The guard here is dead code and the duplicated error message adds maintenance surface.

Suggested change

let profile = try autocompleteTokenProfile()

let vocabSize = profile.vocabSize

guard vocabSize > 0 else {

throw LlamaRuntimeError.generationFailed("Vocabulary unavailable for constrained decoding.")

}

// `topK` bounds the candidate pool the selector ranks; clamp to a sane positive value so a

// zero/negative request still yields a full-vocab argmax rather than an empty pool.

let topK = options.topK > 0 ? options.topK : vocabSize

let profile = try autocompleteTokenProfile()

let vocabSize = profile.vocabSize

// `topK` bounds the candidate pool the selector ranks; clamp to a sane positive value so a

// zero/negative request still yields a full-vocab argmax rather than an empty pool.

let topK = options.topK > 0 ? options.topK : vocabSize

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

FuJacob merged commit 4e9a10a into main Jun 1, 2026
4 checks passed

greptile-apps Bot reviewed Jun 1, 2026

View reviewed changes

FuJacob mentioned this pull request Jun 1, 2026

Add no-repeat-ngram repetition guard to the constrained decoder #504

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add deterministic constrained decoder behind a default-off flag#503

Add deterministic constrained decoder behind a default-off flag#503
FuJacob merged 1 commit into
mainfrom
experimental/constrained-engine

FuJacob commented Jun 1, 2026 •

edited by greptile-apps Bot

Loading

Uh oh!

Uh oh!

greptile-apps Bot Jun 1, 2026

Uh oh!

greptile-apps Bot Jun 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

FuJacob commented Jun 1, 2026 • edited by greptile-apps Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Validation

Linked issues

Risk / rollout notes

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Flowchart

Comments Outside Diff (2)

Uh oh!

Uh oh!

greptile-apps Bot Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

FuJacob commented Jun 1, 2026 •

edited by greptile-apps Bot

Loading