Skip to content

Add deterministic constrained decoder behind a default-off flag#503

Merged
FuJacob merged 1 commit into
mainfrom
experimental/constrained-engine
Jun 1, 2026
Merged

Add deterministic constrained decoder behind a default-off flag#503
FuJacob merged 1 commit into
mainfrom
experimental/constrained-engine

Conversation

@FuJacob
Copy link
Copy Markdown
Owner

@FuJacob FuJacob commented Jun 1, 2026

Summary

Adds a deterministic, logit-level constrained decoding path for the open-source llama runtime, gated behind a hidden default-off flag. Instead of the engine's stochastic sampler, it reads the raw next-token logits, masks structural / control tokens via a per-model token profile, argmax-selects the highest admissible token, and commits it manually with acceptToken. This produces reproducible, leak-free completions — no chat or control markers can surface as visible text, and the same prompt always yields the same suggestion. Default behavior is unchanged.

Validation

rm -rf build/DerivedData
xcodebuild -project Cotabby.xcodeproj -scheme Cotabby -destination 'platform=macOS' \
  build-for-testing -derivedDataPath build/DerivedData
# ** TEST BUILD SUCCEEDED **  (app + test targets, clean)

swiftlint lint --quiet
# 0 violations
  • New pure logic is unit-tested: ConstrainedSamplerTests (selection, control exclusion, admissibility, top-k pool bounds, tie-breaking, determinism, single-step + averaged log-prob) and TokenProfileTests.
  • Not validated end-to-end: decode quality can only be judged with a real model in a real text field, which cannot run headlessly. That on-device check is exactly what flipping the flag is for; the path stays default-off until then.
  • App-hosted unit tests execute on CI's xcodebuild test (macOS) job (local Team-ID signing prevents the xctest host from loading here).

Linked issues

None.

Risk / rollout notes

  • Default-off. Generation is routed through the constrained decoder only when the cotabbyConstrainedDecoderEnabled UserDefaults flag (no UI) is true. The shipping sampleNext path is byte-for-byte unchanged; generate() only branches its inner decode loop, and the flag is excluded from the KV-reuse fingerprint.
  • Dependency pin. project.yml + the generated pbxproj move the CotabbyInference pin from the feat/generation-quality-controls branch to main to consume the constrained-generation primitives (logits read, token accept, vocab introspection). Package.resolved is gitignored; CI resolves branch: main to latest.
  • Performance. On the first constrained request per model, a token profile is built once (one detokenize per vocab token) and cached, then freed on model unload. No effect when the flag is off.
  • Follow-up. Prefix-admissibility and best-of-N / multi-branch decode are intentionally out of scope here (they need multi-sequence engine support) and will land separately; this PR is the single-sequence deterministic path.

Greptile Summary

Adds a deterministic, logit-level constrained decode path to the llama runtime, gated behind the cotabbyConstrainedDecoderEnabled UserDefaults flag (default off). The existing sampleNext path is byte-for-byte unchanged; the new path reads raw logits each step, masks control/structural tokens via a lazily-built, per-model TokenProfile, and commits the highest-admissible-logit token with acceptToken.

  • LlamaRuntimeCore splits generate() into runEngineSampledDecode and runConstrainedDecode, sharing a single shouldSuppress helper; TokenProfile is built once on the first constrained request and cached until the model changes or shutdown() is called.
  • ConstrainedSampler is a pure, RNG-free utility (argmax + optional admissibility set + top-K bounding) covered by a comprehensive unit-test suite; CotabbyInference dependency is re-pinned from a feature branch to main to consume the new logit/accept primitives.

Confidence Score: 4/5

Safe to merge as-is; the constrained decoder is default-off and the shipping sampleNext path is unchanged.

The default-off flag means no user-visible behaviour changes on merge. The new constrained path has a minor confidence-score miscalibration risk when logProb is nil, a redundant vocabSize guard, a bare-CR gap in the single-line stop check, and a per-step O(N log N) full-vocabulary sort. None of these affect the shipping path or block enabling the feature, but they are worth addressing before the flag is promoted to default-on.

LlamaRuntimeCore.swift (constrained decode loop and token-profile caching) and TokenProfile.swift (isNewline CR handling) deserve a second look before the flag is turned on by default.

Important Files Changed

Filename Overview
Cotabby/Services/Runtime/LlamaRuntimeCore.swift Core change: splits generate() into engine-sampled and constrained decode paths, extracts shouldSuppress as a shared static, adds lazy token-profile caching cleared on shutdown. Minor: sumLogprob/tokensGenerated mismatch under nil logProb and redundant vocabSize guard.
Cotabby/Support/ConstrainedSampler.swift New pure-logic file: deterministic argmax selection with control-token exclusion, admissibility filtering, and top-K bounding. Well-tested; per-step full-vocab sort is O(N log N) rather than optimal O(N log K).
Cotabby/Support/TokenProfile.swift New pure-logic file: per-token metadata table (bytes, control, EOG, whitespace, newline flags). Defensive out-of-range handling is solid; isNewline only checks LF (0x0A), missing bare CR stop for single-line mode.
Cotabby/Services/Runtime/LlamaSuggestionEngine.swift Reads cotabbyConstrainedDecoderEnabled UserDefaults key and plumbs it into LlamaGenerationOptions; change is minimal and default-off.
project.yml Moves CotabbyInference pin from feat/generation-quality-controls to branch: main; deliberate per PR description but a floating branch reference means CI resolves to latest HEAD on every clean build.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[LlamaSuggestionEngine.suggest] -->|reads UserDefaults| B{cotabbyConstrainedDecoderEnabled?}
    B -->|false - default| C[LlamaRuntimeCore.generate]
    B -->|true| C

    C --> D[Prepare KV sequence / prompt tokens]
    D --> E{options.useConstrainedDecoder?}

    E -->|false| F[runEngineSampledDecode]
    F --> G[engine.sampleNext loop]
    G --> H[extractPiece / accumulate text]
    H --> I{EOS or budget?}
    I -->|no| G
    I -->|yes| J[shouldSuppress check]

    E -->|true| K[runConstrainedDecode]
    K --> L[autocompleteTokenProfile - lazy build + cache]
    L --> M[engine.getNextTokenLogits]
    M --> N[ConstrainedSampler.selectToken - argmax with exclusions]
    N --> O{EOS / singleLine / nil?}
    O -->|stop| P[shouldSuppress check]
    O -->|continue| Q[engine.acceptToken]
    Q --> R[accumulate bytes + logProb]
    R --> M

    J --> S[return generated text or empty]
    P --> S
Loading

Comments Outside Diff (2)

  1. Cotabby/Support/TokenProfile.swift, line 704-724 (link)

    P2 Bare \r token won't stop single-line decoding

    isNewline only tests for LF (0x0A). A token whose bytes are solely [0x0D] (carriage return) or that ends with \r without a following \n would pass the single-line stop check undetected, allowing a partial carriage return to be appended to the completion before the next \n-carrying token eventually terminates the loop. isASCIIWhitespace already includes 0x0D, so the byte-level knowledge is present; isNewline could simply union the CR check: bytes.contains(0x0A) || bytes.contains(0x0D).

    Fix in Codex Fix in Claude Code

  2. Cotabby/Support/ConstrainedSampler.swift, line 559-571 (link)

    P2 Per-step full-vocabulary sort is O(N log N) when O(N log K) would suffice

    When limit < count, the entire range 0..<count is sorted descending by logit before taking the prefix(limit). For a 32 K–128 K vocabulary with typical topK values of 40–100, that is ~480 K–2.2 M comparisons per decode step just for the candidate pool. A min-heap of size limit (or Array.partialSort) would reduce that to O(N log K) — roughly 5–6× less work at K=40. On device the per-step wall time is dominated by getNextTokenLogits and acceptToken, so this is not a blocker, but the sort overhead compounds across all constrained-decode steps and is worth revisiting before the flag is turned on by default.

    Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

    Fix in Codex Fix in Claude Code

Fix All in Codex Fix All in Claude Code

Reviews (1): Last reviewed commit: "Add deterministic constrained decoder be..." | Re-trigger Greptile

Greptile also left 2 inline comments on this PR.

Introduces a logit-level constrained decoding path for the open-source
llama runtime as an alternative to the engine's built-in stochastic
sampler. Per step it reads the raw next-token logits, masks structural
and control tokens via a per-model token profile, deterministically
selects the highest-logit admissible token, and commits it manually with
acceptToken. The result is reproducible, leak-free continuations: no
chat or control markers can surface as visible completion text, and the
same prompt always yields the same suggestion.

New pure, unit-tested helpers carry the decision logic: TokenProfile
(per-token bytes plus control/EOG/whitespace/newline flags, built once
per model from the vocab) and ConstrainedSampler (deterministic argmax
with exclusion, optional admissibility, top-k pool bound, and a stable
single-step log-prob for confidence). The runtime builds the profile
lazily on first use and caches it per model.

Routing is gated by the hidden cotabbyConstrainedDecoderEnabled
UserDefaults flag (default off), so the shipping sampleNext path is
byte-for-byte unchanged until the constrained decoder is validated on
device. The generate() lifecycle, KV reuse, cancellation, and the
manager's task handling are untouched; only the inner decode loop
branches on the flag.

Bumps the CotabbyInference pin to main to consume the logits-read,
token-accept, and vocab-introspection primitives.
@FuJacob FuJacob merged commit 4e9a10a into main Jun 1, 2026
4 checks passed
Comment on lines +276 to +281

var generatedBytes: [UInt8] = []
var tokensGenerated = 0
var sumLogprob = 0.0
var stopReason = "budget_exhausted"
var logits = [Float](repeating: 0, count: vocabSize)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Silent logprob gap silently inflates the confidence score

When ConstrainedSampler.logProb returns nil, tokensGenerated is still incremented but sumLogprob is not. The subsequent shouldSuppress call divides sumLogprob by tokensGenerated, so any skipped step makes the average log-probability closer to zero (artificially higher confidence), risking a low-quality completion passing the floor check. In practice logProb cannot be nil here (the token index always comes from selectToken which bounds candidates to 0..<logits.count), but the defensive if let means a future refactor that breaks that invariant would silently miscalibrate suppression rather than surfacing a crash. Consider logging a fault or asserting non-nil, or using a known-safe fallback value instead of silently skipping the accumulation.

Fix in Codex Fix in Claude Code

Comment on lines +268 to +275
let profile = try autocompleteTokenProfile()
let vocabSize = profile.vocabSize
guard vocabSize > 0 else {
throw LlamaRuntimeError.generationFailed("Vocabulary unavailable for constrained decoding.")
}
// `topK` bounds the candidate pool the selector ranks; clamp to a sane positive value so a
// zero/negative request still yields a full-vocab argmax rather than an empty pool.
let topK = options.topK > 0 ? options.topK : vocabSize
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Unreachable guard: autocompleteTokenProfile() already guarantees a positive vocab size

autocompleteTokenProfile() throws LlamaRuntimeError.generationFailed when engine.getVocabSize() is zero and never stores a zero-entry profile in the cache, so profile.vocabSize can never be zero when the call returns without throwing. The guard here is dead code and the duplicated error message adds maintenance surface.

Suggested change
let profile = try autocompleteTokenProfile()
let vocabSize = profile.vocabSize
guard vocabSize > 0 else {
throw LlamaRuntimeError.generationFailed("Vocabulary unavailable for constrained decoding.")
}
// `topK` bounds the candidate pool the selector ranks; clamp to a sane positive value so a
// zero/negative request still yields a full-vocab argmax rather than an empty pool.
let topK = options.topK > 0 ? options.topK : vocabSize
let profile = try autocompleteTokenProfile()
let vocabSize = profile.vocabSize
// `topK` bounds the candidate pool the selector ranks; clamp to a sane positive value so a
// zero/negative request still yields a full-vocab argmax rather than an empty pool.
let topK = options.topK > 0 ? options.topK : vocabSize

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Fix in Codex Fix in Claude Code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant