Skip to content

feat: content-aware deduplication pre-pass in gradient transform#60

Merged
BYK merged 1 commit intomainfrom
feat/gradient-content-dedup
Apr 9, 2026
Merged

feat: content-aware deduplication pre-pass in gradient transform#60
BYK merged 1 commit intomainfrom
feat/gradient-content-dedup

Conversation

@BYK
Copy link
Copy Markdown
Owner

@BYK BYK commented Apr 9, 2026

Summary

  • Content-aware deduplication in the gradient transform pipeline: detects repeated tool outputs (same file read multiple times, identical command results) and replaces earlier occurrences with compact annotations, keeping only the latest.
  • Two dedup levels: exact content hash (identical outputs) and same-file-path reads (different content from edits between reads).
  • Runs as a pre-pass before layer selection (between layer 0 and layer 1), reducing token pressure so sessions can stay at lower, less lossy gradient layers.

Motivation

Inspired by Dirac's ContextManager approach: in long coding sessions, the same file is often read 2-5 times (explore → edit → verify). Each read stores the full content as tokens. A 500-line file appearing 3 times costs ~15K tokens; after dedup: ~5.1K. Those saved tokens can be the difference between layer 1 (clean window eviction, prompt caching preserved) and layer 2 (tool stripping, cache busted).

Design

  • deduplicateToolOutputs(messages, currentTurnIdx) scans all completed tool parts, groups by content hash and file path, keeps the latest occurrence, replaces earlier ones with dedupAnnotation().
  • Current turn is sacred — never touched.
  • Tool parts are never removed — only state.output is replaced (preserves tool_use/tool_result pairing).
  • Small outputs skipped — outputs below 600 chars aren't deduplicated (annotation would cost more than the original).
  • Zero-cost no-op — returns original array reference when no duplicates exist.
  • Follows existing toolStripAnnotation() pattern.

Files Changed

File Change
src/gradient.ts deduplicateToolOutputs(), dedupAnnotation(), simpleHash(), extractFilePath() + integration into transformInner()
test/gradient.test.ts 7 new tests: exact-match, same-file, current-turn protection, small-output skip, no-change passthrough, bash dedup, triple-read

Testing

Add deduplicateToolOutputs() as a pre-pass before gradient layer selection.
Detects repeated tool outputs (same file read multiple times, identical
command results) and replaces earlier occurrences with compact annotations,
keeping only the latest. Two dedup levels: exact content hash match and
same-file-path reads with different content (edit between reads).

This reduces token pressure before layer selection, potentially keeping
sessions at lower (less lossy) gradient layers. A 500-line file read
appearing 3 times costs ~15K tokens; after dedup: ~5.1K tokens.

Inspired by Dirac's ContextManager file-read deduplication approach.
@BYK BYK enabled auto-merge (squash) April 9, 2026 18:40
@BYK BYK merged commit 2ec106f into main Apr 9, 2026
1 check passed
@BYK BYK deleted the feat/gradient-content-dedup branch April 9, 2026 18:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant