feat: content-aware deduplication pre-pass in gradient transform by BYK · Pull Request #60 · BYK/opencode-lore

BYK · 2026-04-09T18:40:14Z

Summary

Content-aware deduplication in the gradient transform pipeline: detects repeated tool outputs (same file read multiple times, identical command results) and replaces earlier occurrences with compact annotations, keeping only the latest.
Two dedup levels: exact content hash (identical outputs) and same-file-path reads (different content from edits between reads).
Runs as a pre-pass before layer selection (between layer 0 and layer 1), reducing token pressure so sessions can stay at lower, less lossy gradient layers.

Motivation

Inspired by Dirac's ContextManager approach: in long coding sessions, the same file is often read 2-5 times (explore → edit → verify). Each read stores the full content as tokens. A 500-line file appearing 3 times costs ~15K tokens; after dedup: ~5.1K. Those saved tokens can be the difference between layer 1 (clean window eviction, prompt caching preserved) and layer 2 (tool stripping, cache busted).

Design

deduplicateToolOutputs(messages, currentTurnIdx) scans all completed tool parts, groups by content hash and file path, keeps the latest occurrence, replaces earlier ones with dedupAnnotation().
Current turn is sacred — never touched.
Tool parts are never removed — only state.output is replaced (preserves tool_use/tool_result pairing).
Small outputs skipped — outputs below 600 chars aren't deduplicated (annotation would cost more than the original).
Zero-cost no-op — returns original array reference when no duplicates exist.
Follows existing toolStripAnnotation() pattern.

Files Changed

File	Change
`src/gradient.ts`	`deduplicateToolOutputs()`, `dedupAnnotation()`, `simpleHash()`, `extractFilePath()` + integration into `transformInner()`
`test/gradient.test.ts`	7 new tests: exact-match, same-file, current-turn protection, small-output skip, no-change passthrough, bash dedup, triple-read

Testing

All 301 tests run, 298 pass, 3 fail (pre-existing vectorSearch test isolation issue from feat: multi-provider embeddings, distillation vector search, and cross-project recall #58, not introduced here)
npx tsc --noEmit clean

Add deduplicateToolOutputs() as a pre-pass before gradient layer selection. Detects repeated tool outputs (same file read multiple times, identical command results) and replaces earlier occurrences with compact annotations, keeping only the latest. Two dedup levels: exact content hash match and same-file-path reads with different content (edit between reads). This reduces token pressure before layer selection, potentially keeping sessions at lower (less lossy) gradient layers. A 500-line file read appearing 3 times costs ~15K tokens; after dedup: ~5.1K tokens. Inspired by Dirac's ContextManager file-read deduplication approach.

BYK enabled auto-merge (squash) April 9, 2026 18:40

BYK merged commit 2ec106f into main Apr 9, 2026
1 check passed

BYK deleted the feat/gradient-content-dedup branch April 9, 2026 18:40

This was referenced Apr 9, 2026

publish: BYK/opencode-lore@0.8.0 #61

Closed

publish: BYK/opencode-lore@0.8.0 #63

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: content-aware deduplication pre-pass in gradient transform#60

feat: content-aware deduplication pre-pass in gradient transform#60
BYK merged 1 commit intomainfrom
feat/gradient-content-dedup

BYK commented Apr 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

BYK commented Apr 9, 2026

Summary

Motivation

Design

Files Changed

Testing

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant