Add no-repeat-ngram repetition guard to the constrained decoder#504
Conversation
The constrained decoder selects each token by raw-logit argmax, which has no inherent resistance to repetition: a base model can loop on a phrase or a single token, and the engine's repetition penalty lives in its sampler and never reaches this raw-logit path. This adds a pure no-repeat-ngram guard (RepetitionGuard) that blocks any token which would close a 3-gram already present in the output, and threads a per-step block-list through ConstrainedSampler.selectToken (default empty, so the existing callers and tests are unaffected). The guard reasons over token ids, so it is independent of detokenization and exhaustively unit-tested. Only the constrained decode path is affected, and that path is still gated by the default-off cotabbyConstrainedDecoderEnabled flag.
| import Foundation | ||
|
|
||
| /// File overview: |
There was a problem hiding this comment.
import Foundation is not used in this file — all types (Array, Set, Int) are from the Swift standard library. Removing the import makes the file's dependencies clearer and avoids a small unnecessary overhead.
| import Foundation | |
| /// File overview: | |
| /// File overview: |
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
| ) else { | ||
| stopReason = "no_admissible_token" | ||
| break |
There was a problem hiding this comment.
Ambiguous
stopReason when repetition guard exhausts all candidates
"no_admissible_token" is already emitted when the byte-prefix constraint returns an empty admissible set; now the same string is logged when the repetition guard blocks every surviving candidate. A post-hoc log search won't distinguish between the two cases. Consider a distinct value such as "repetition_guard_exhausted" so decode diagnostics can tell apart a structural constraint failure from a repetition block.
Summary
The constrained decoder (added in #503) selects each token by raw-logit argmax, which has no inherent resistance to repetition — a base model can loop on a phrase ("I think that I think that …") or a single token. The engine's
repetition_penaltylives in its own sampler and never reaches this raw-logit path. This adds a pure no-repeat-ngram guard that blocks any token which would close a 3-gram already present in the output.Validation
RepetitionGuardTests(ngram-size floor, short history, repeated-prefix follower blocking, single-token runs, multiple followers, bigram order, non-pending prefixes) and two newConstrainedSamplerTestscases for the block-list path.Linked issues
Refs #503.
Risk / rollout notes
RepetitionGuardis new and pure.ConstrainedSampler.selectTokengains ablockedTokenIDsparameter that defaults to empty, so every existing caller and test is unchanged. OnlyrunConstrainedDecodepasses a non-empty set.cotabbyConstrainedDecoderEnabledis set; the shipping sampler path is untouched.Greptile Summary
This PR adds a pure no-repeat-ngram repetition guard to the constrained decoder path introduced in #503. Because that path selects tokens by raw-logit argmax and bypasses the engine's sampler (which carries its own
repetition_penalty), base models can fall into phrase loops; the guard closes that gap by blocking any token that would close an n-gram already present in the output.RepetitionGuardis a new stateless enum that, given a token-id history and an n-gram order, returns the set of token IDs that must be skipped this step. The O(n × prefixLength) scan is correct, bounds-safe, and covered by eight focused unit tests.ConstrainedSampler.selectTokengains ablockedTokenIDsparameter defaulting to empty, so all existing callers are unaffected; the new block-list check is correctly ordered before the admissible-set check.runConstrainedDecodenow builds ablockedTokenIDsset each step and passes it toselectToken;generatedTokenIDsis tracked in parallel withgeneratedBytesfor this purpose. The feature remains default-off behindcotabbyConstrainedDecoderEnabled.Confidence Score: 4/5
Safe to merge — the repetition guard is pure and default-off; the only issues are a superfluous import and a log string that conflates two distinct stop conditions.
The algorithm is correct and bounds-safe. The
generatedTokenIDstracking is placed consistently alongside the existinggeneratedBytesappend. The newblockedTokenIDsparameter defaults to empty, leaving all existing call sites untouched. The two minor findings are the unusedFoundationimport inRepetitionGuard.swiftand the"no_admissible_token"stop reason being reused for a new stop condition, which could make future log-based diagnosis harder but does not affect runtime behavior.No files require special attention;
LlamaRuntimeCore.swiftis the only place worth a second look if theno_admissible_tokenlog ambiguity is ever actioned.Important Files Changed
import Foundationis unused.blockedTokenIDsparameter with a default of empty set, preserving all existing call sites. Check order (excluded → blocked → admissible) is correct and consistent.Sequence Diagram
sequenceDiagram participant D as runConstrainedDecode participant E as LlamaEngine participant RG as RepetitionGuard participant CS as ConstrainedSampler loop each token step D->>E: getNextTokenLogits(sequenceID) E-->>D: logits[vocabSize] D->>RG: blockedTokens(history: generatedTokenIDs, ngramSize: 3) RG-->>D: "blockedTokenIDs (Set<Int>)" D->>CS: selectToken(logits, profile, admissibleTokenIDs: nil, topK, blockedTokenIDs) CS-->>D: tokenID? (nil stops decode) D->>D: append bytes + tokenID to history D->>E: acceptToken(sequenceID, tokenID) E-->>D: .ok / error endReviews (1): Last reviewed commit: "Add no-repeat-ngram repetition guard to ..." | Re-trigger Greptile