feat: content-aware deduplication pre-pass in gradient transform#60
Merged
Conversation
Add deduplicateToolOutputs() as a pre-pass before gradient layer selection. Detects repeated tool outputs (same file read multiple times, identical command results) and replaces earlier occurrences with compact annotations, keeping only the latest. Two dedup levels: exact content hash match and same-file-path reads with different content (edit between reads). This reduces token pressure before layer selection, potentially keeping sessions at lower (less lossy) gradient layers. A 500-line file read appearing 3 times costs ~15K tokens; after dedup: ~5.1K tokens. Inspired by Dirac's ContextManager file-read deduplication approach.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Motivation
Inspired by Dirac's ContextManager approach: in long coding sessions, the same file is often read 2-5 times (explore → edit → verify). Each read stores the full content as tokens. A 500-line file appearing 3 times costs ~15K tokens; after dedup: ~5.1K. Those saved tokens can be the difference between layer 1 (clean window eviction, prompt caching preserved) and layer 2 (tool stripping, cache busted).
Design
deduplicateToolOutputs(messages, currentTurnIdx)scans all completed tool parts, groups by content hash and file path, keeps the latest occurrence, replaces earlier ones withdedupAnnotation().state.outputis replaced (preserves tool_use/tool_result pairing).toolStripAnnotation()pattern.Files Changed
src/gradient.tsdeduplicateToolOutputs(),dedupAnnotation(),simpleHash(),extractFilePath()+ integration intotransformInner()test/gradient.test.tsTesting
vectorSearchtest isolation issue from feat: multi-provider embeddings, distillation vector search, and cross-project recall #58, not introduced here)npx tsc --noEmitclean