Skip to content

feat: multi-provider embeddings, distillation vector search, and cross-project recall#58

Merged
BYK merged 1 commit intomainfrom
feat/embedding-providers-and-cross-project
Apr 9, 2026
Merged

feat: multi-provider embeddings, distillation vector search, and cross-project recall#58
BYK merged 1 commit intomainfrom
feat/embedding-providers-and-cross-project

Conversation

@BYK
Copy link
Copy Markdown
Owner

@BYK BYK commented Apr 9, 2026

Summary

  • Embedding provider abstraction: Refactored embedding.ts from hardcoded Voyage AI to an EmbeddingProvider interface with Voyage and OpenAI implementations. Config gets a provider field ("voyage" | "openai"), each reading its own env var. Fully backward-compatible — existing configs default to "voyage".

  • Distillation vector search: Schema migration 9 adds embedding BLOB to distillations table. Distillations are embedded fire-and-forget on store (both gen-0 and meta-distillation). Brute-force cosine similarity search feeds into the recall tool's RRF alongside FTS results, improving semantic recall over session history.

  • Cross-project knowledge discovery: The recall tool now searches knowledge entries from other projects when scope is "all". Results are tagged with the source project name (e.g., [knowledge/Architecture from: other-project]) and naturally rank lower via RRF since they're a separate list. This surfaces relevant knowledge you've captured in project A when working in project B.

Motivation

Inspired by analysis of MemPalace's benchmark approach: raw verbatim text + embedding search scores 96.6% on LongMemEval vs ~70% for BM25/keyword search. The 26pp gap is entirely embedding quality. Extending Lore's existing embedding infrastructure to distillations (semantically rich summaries, ~10-50 per project) is the highest-value improvement at lowest cost.

Files Changed

File Change
src/embedding.ts Provider interface + Voyage/OpenAI classes, distillation vector search/embed/backfill
src/config.ts provider field in embeddings config
src/db.ts Migration 9 (distillation embedding BLOB), projectName() helper
src/distillation.ts Embed distillations on store
src/ltm.ts searchScoredOtherProjects()
src/reflect.ts Distillation vector search + cross-project discovery in recall tool
src/index.ts Distillation embedding backfill on startup
test/db.test.ts Schema version 8 → 9
test/embedding.test.ts resetProvider() for test isolation

Testing

  • All 294 tests pass (bun test)
  • TypeScript compilation clean (npx tsc --noEmit)

@BYK BYK enabled auto-merge (squash) April 9, 2026 13:21
…s-project recall

- Abstract embedding provider interface with Voyage AI and OpenAI support.
  Config gets a 'provider' field (default: voyage, backward-compatible).
  Each provider reads its own env var (VOYAGE_API_KEY, OPENAI_API_KEY).

- Extend vector search to distillations: schema migration 9 adds embedding
  BLOB to distillations table, fire-and-forget embed on store, brute-force
  cosine search feeds into recall RRF alongside FTS results.

- Cross-project knowledge discovery in recall tool: when scope is 'all',
  searches knowledge entries from other projects and surfaces them tagged
  with the source project name.
@BYK BYK force-pushed the feat/embedding-providers-and-cross-project branch from bf195a7 to 6f616ec Compare April 9, 2026 13:23
@BYK BYK merged commit 0d74697 into main Apr 9, 2026
1 check passed
@BYK BYK deleted the feat/embedding-providers-and-cross-project branch April 9, 2026 13:24
@craft-deployer craft-deployer bot mentioned this pull request Apr 9, 2026
2 tasks
BYK added a commit that referenced this pull request Apr 9, 2026
## Summary

- **Content-aware deduplication** in the gradient transform pipeline:
detects repeated tool outputs (same file read multiple times, identical
command results) and replaces earlier occurrences with compact
annotations, keeping only the latest.
- Two dedup levels: **exact content hash** (identical outputs) and
**same-file-path reads** (different content from edits between reads).
- Runs as a pre-pass before layer selection (between layer 0 and layer
1), reducing token pressure so sessions can stay at lower, less lossy
gradient layers.

## Motivation

Inspired by Dirac's ContextManager approach: in long coding sessions,
the same file is often read 2-5 times (explore → edit → verify). Each
read stores the full content as tokens. A 500-line file appearing 3
times costs ~15K tokens; after dedup: ~5.1K. Those saved tokens can be
the difference between layer 1 (clean window eviction, prompt caching
preserved) and layer 2 (tool stripping, cache busted).

## Design

- `deduplicateToolOutputs(messages, currentTurnIdx)` scans all completed
tool parts, groups by content hash and file path, keeps the latest
occurrence, replaces earlier ones with `dedupAnnotation()`.
- **Current turn is sacred** — never touched.
- **Tool parts are never removed** — only `state.output` is replaced
(preserves tool_use/tool_result pairing).
- **Small outputs skipped** — outputs below 600 chars aren't
deduplicated (annotation would cost more than the original).
- **Zero-cost no-op** — returns original array reference when no
duplicates exist.
- Follows existing `toolStripAnnotation()` pattern.

## Files Changed

| File | Change |
|------|--------|
| `src/gradient.ts` | `deduplicateToolOutputs()`, `dedupAnnotation()`,
`simpleHash()`, `extractFilePath()` + integration into
`transformInner()` |
| `test/gradient.test.ts` | 7 new tests: exact-match, same-file,
current-turn protection, small-output skip, no-change passthrough, bash
dedup, triple-read |

## Testing

- All 301 tests run, 298 pass, 3 fail (pre-existing `vectorSearch` test
isolation issue from #58, not introduced here)
- `npx tsc --noEmit` clean
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant