Skip to content

enzodevs/code-context-v2

Repository files navigation

Code Context v2

Semantic code search for CLI-driven agent workflows. Index your codebases, search with natural language, get precise results.

Built for local agent workflows in Claude Code and similar CLI environments.

Why

LLMs work better with the right context. Grep finds text; this finds meaning. Ask for "authentication middleware" and get the actual auth logic, not every file that mentions "auth".

How it works:

  1. Index your codebase with tree-sitter AST parsing (functions, classes, methods — not arbitrary line splits)
  2. Embed chunks with Voyage AI (voyage-4-large for documents, voyage-4-lite for queries — same embedding space, asymmetric retrieval)
  3. Search with embedded LanceDB by default, then rerank-2.5 for precision
  4. Query indexed content from the CLI with semantic search commands

Architecture

Agent / CLI
    │
    ├── cc2.sh
    └── uv run code-context-manage
             │
      RetrievalPipeline
        │            │
        ▼            ▼
   Voyage AI    LanceDB local
 voyage-4-lite  vector store
  (query embed) ~/.local/share/cc2
 rerank-2.5
  (reranking)

Retrieval pipeline:

  1. Embed query with voyage-4-lite (fast, shared space with indexed docs)
  2. Retrieve candidates from embedded LanceDB (or PostgreSQL/pgvector when CC2_CODE_BACKEND=postgres)
  3. Rerank: rerank-2.5 + relative threshold (max(score_floor, top_score * factor))
  4. Dedup: overlap/containment + Jaccard similarity filtering
  5. Return: Markdown-formatted chunks with file paths, line numbers, relevance scores

Code layout:

  • src/code_context/retrieval/ — retrieval façade plus focused helpers for intent resolution, result controls, cross-file context, and quality logging
  • src/code_context/db/DatabasePool façade delegating to domain-specific stores for code, projects, memory, and books
  • src/code_context/chunking/ — tree-sitter parsing, chunk models, and post-processing/splitting helpers
  • src/code_context/indexing/ — filesystem and indexing support helpers used by Indexer
  • cli/ — user-facing entrypoints plus shared runtime/search/watcher helpers

The public APIs stay centered on RetrievalPipeline, DatabasePool, Indexer, and the CLI entrypoints; the internal modules are split to keep those surfaces stable while reducing coupling.

Quick Start

Prerequisites

  • uv (Python package manager)
  • Voyage AI API key (free tier available)
  • Docker only if you opt into PostgreSQL-backed memory/books/legacy code search

1. Clone and configure

git clone https://github.com/YOUR_USER/code-context-v2.git
cd code-context-v2

cp .env.example .env
# Edit .env — set CC2_VOYAGE_API_KEY

The default local backend is embedded LanceDB at ~/.local/share/cc2/lancedb, so Docker is not required for normal CLI code indexing/search or memory indexing/search. PostgreSQL remains available for legacy code storage, books, graph, and MCP paths.

2. Install dependencies

uv sync

3. Index a project

uv run code-context-manage --index /path/to/your/project

4. Use the search commands

Use either the direct Python entrypoint or the shell wrapper:

# List indexed projects
uv run code-context-manage --list

# Semantic search from inside an indexed repo (project inferred from cwd)
./cc2.sh search "auth middleware"

# Explicit project override when running outside the repo or targeting another project
./cc2.sh search "auth middleware" -p my-project

# Opt in to graph expansion from dense chunk hits
./cc2.sh search "auth middleware" -p my-project --graph

# Search within one file (also infers project from cwd)
./cc2.sh search-file src/auth.ts "token validation"

# Search indexed literature
./cc2.sh search-lit "dependency injection"

Search Controls

The CLI search commands support optional output-shaping controls:

Flag Default Effect
--max-tokens unset Per-request budget override. Clamped inside the retrieval pipeline.
--include-tests off Includes test/spec files when needed.
--graph off Opts CLI search into graph expansion. By default expansion favors deterministic high-value edges such as CALLS, REFERENCES, TESTS, DOCUMENTS, IMPORTS, and USES_TABLE, and avoids broad SAME_FILE fanout. Can also be enabled for CLI search with CC2_GRAPH_SEARCH_ENABLED=true.
--file-type unset Restrict to code, docs, or all.
--directory unset Restrict results to a directory prefix.
--json off Emit machine-readable output for agent/tool consumption.

Recommended defaults for agent workflows:

  • Keep include_tests=false unless the user explicitly asks about tests.
  • Start with --max-tokens between 1800 and 3200 for typical coding tasks.
  • Use --json when another tool or agent will post-process the results.

Project Resolution

For CLI code search commands, -p/--project is optional when your current working directory is inside an indexed project root.

  • cc2.sh passes the caller cwd through to the Python CLI.
  • cc2 resolves the project by finding the indexed project_root that contains the cwd.
  • If multiple indexed roots match, cc2 picks the longest matching root.
  • Use -p when running outside the repo, targeting another indexed project, or overriding cwd-based resolution.

Search Intent Guide

Use --intent to steer reranking precision:

Intent Best for
implementation Concrete runtime logic you will modify to ship a feature
definition Types/interfaces/schemas/contracts/config declarations
usage Call sites, integration points, consumer code
debug Error paths, retries, fallbacks, validation failures, observability clues
security Auth/authz, secret handling, sanitization, injection defenses
performance Hot paths, caching, batching, query shape, contention points
architecture Boundaries, adapters, orchestration, cross-module flow

Default intent is implementation when omitted.

Benchmarking

Retrieval changes should be measured against benchmark suites in benchmarks/retrieval/*.json.

# List benchmark-enabled projects
uv run python -m scripts.benchmark_retrieval --list

# Run one project
uv run python -m scripts.benchmark_retrieval code-context-v2

# Compare against a saved baseline
uv run python -m scripts.benchmark_retrieval cardify --compare pre-hybrid-cardify

# Run all benchmark suites and save a combined baseline
uv run python -m scripts.benchmark_retrieval all --save hybrid-v1

# Run with graph expansion enabled
uv run python -m scripts.benchmark_retrieval cardify --graph

# A/B compare dense-only vs dense + graph expansion in one run
# Prints graph-derived candidate/final counts, added/removed expected files,
# worsened top results, surviving edge types, and token impact.
uv run python -m scripts.benchmark_retrieval cardify --compare-graph

Benchmark JSON can set indexed_project when the suite name differs from the cc2 project id. For example, fluxomind-platform.json targets indexed project fluxomind-src.

CLI

# Index a project (auto-generates ID from folder name)
uv run code-context-manage --index /path/to/project

# Index with custom ID
uv run code-context-manage --index /path/to/project --id my-project

# Check what changed (dry-run)
uv run code-context-manage --check my-project

# Sync only changed files
uv run code-context-manage --sync my-project

# Force full reindex
uv run code-context-manage --index /path/to/project --force

# Show statistics
uv run code-context-manage --stats

# Watch for changes (background daemon)
uv run code-context-manage --watch /path/to/project

# List indexed books
uv run code-context-manage --list-books

# Initialize additive graph tables (does not reindex code)
uv run code-context-manage graph init

# Build the phase-1 graph from the existing code index
uv run code-context-manage graph build --project my-project
uv run code-context-manage graph build --project my-project --phase existing-index

# Add Phase 2 deterministic source relations incrementally (no code reindex)
uv run code-context-manage graph build --project my-project --phase deterministic

# Check graph backfill status and edge counts by type
uv run code-context-manage graph status --project my-project

# Create a project memory root with a MEMORY.md hub
./cc2.sh memory init

# Index a Markdown memory root
./cc2.sh memory index .pi/memory --project my-project

# Search indexed memory
./cc2.sh memory search "refresh token rotation" --project my-project

Standalone memory entrypoint also exists:

uv run code-context-memory init
uv run code-context-memory index .pi/memory --project my-project
uv run code-context-memory search "refresh token rotation" --project my-project

See docs/memory.md for the memory layout, MEMORY.md hub pattern, and filters.

There's also cc2.sh — a bash wrapper with a gum-based TUI plus non-interactive commands like search, search-file, and search-lit.

Supported Languages

Language Extensions Parser
TypeScript .ts, .tsx tree-sitter-typescript
JavaScript .js, .jsx, .mjs, .cjs tree-sitter-javascript
Python .py, .pyi tree-sitter-python
Java .java tree-sitter-java

Adding a new language requires a tree-sitter grammar and chunk type mappings in src/code_context/chunking/parser.py.

Configuration

All settings use the CC2_ prefix via environment variables or the repository .env file:

Variable Default Description
CC2_CODE_BACKEND lancedb Local index backend for code + memory: lancedb or legacy postgres
CC2_LANCEDB_URI ~/.local/share/cc2/lancedb Embedded LanceDB storage directory for code and memory index data
CC2_DATABASE_URL local dev Postgres PostgreSQL connection string for legacy code/memory backend, books, graph, and MCP paths
CC2_VOYAGE_API_KEY Voyage AI API key (required)
CC2_EMBEDDING_MODEL_INDEX voyage-4-large Embedding model for indexing
CC2_EMBEDDING_MODEL_QUERY voyage-4-lite Embedding model for queries
CC2_VOYAGE_MAX_REQUESTS_PER_MINUTE 1950 Global request pacing guardrail for Voyage API
CC2_VOYAGE_MAX_TOKENS_PER_MINUTE 2700000 Global token pacing guardrail for Voyage API
CC2_VOYAGE_MAX_IN_FLIGHT_REQUESTS 32 Global max concurrent Voyage API calls
CC2_INDEX_EMBEDDING_FLUSH_CHUNKS 5000 Chunks to accumulate for cross-file embedding batches during project indexing/sync
CC2_LANCEDB_WRITE_BATCH_FILES 500 Files per LanceDB write transaction during project indexing/sync
CC2_VOYAGE_RETRY_MAX_ATTEMPTS 5 Max retries for transient/rate-limit Voyage failures
CC2_VOYAGE_RETRY_BASE_DELAY_MS 250 Initial exponential backoff delay
CC2_VOYAGE_RETRY_MAX_DELAY_MS 5000 Retry delay ceiling
CC2_VOYAGE_RETRY_JITTER_MS 250 Extra random jitter to avoid retry bursts
CC2_RERANK_MODEL rerank-2.5 Reranking model
CC2_RERANK_TOP_K_OUTPUT 8 Max final results returned by code search tools
CC2_RERANK_RELATIVE_FACTOR 0.75 Relative cutoff factor (threshold = top_score * factor)
CC2_RERANK_SCORE_FLOOR 0.40 Absolute minimum rerank score floor
CC2_RERANK_FILE_SUPPORT_WEIGHT 0.06 Small post-rerank boost for files with many strong retrieved chunks
CC2_RESULT_MAX_TOKENS 8000 Token budget for results
CC2_SEARCH_LOG_PATH unset Optional JSONL path for retrieval quality logs
CC2_HYBRID_SEARCH_ENABLED true Enable dense + LanceDB FTS + exact-symbol candidate fusion before rerank
CC2_HYBRID_LEXICAL_K 50 Max lexical candidates retrieved for hybrid search
CC2_HYBRID_RRF_RANK_CONSTANT 60 Reciprocal rank fusion constant used to merge candidate channels
CC2_HYBRID_DENSE_WEIGHT 1.0 Dense retrieval weight in hybrid fusion
CC2_HYBRID_LEXICAL_WEIGHT 0.8 Lexical retrieval weight in hybrid fusion
CC2_HYBRID_EXACT_SYMBOL_WEIGHT 1.2 Exact symbol retrieval weight in hybrid fusion
CC2_EXACT_SYMBOL_SEARCH_ENABLED true Allow exact symbol-name candidate retrieval inside hybrid search
CC2_EXACT_SYMBOL_MIN_LENGTH 3 Minimum identifier length for exact symbol candidate extraction
CC2_LOG_LEVEL INFO Logging verbosity

See src/code_context/config.py for all available settings.

Performance

  • Vector search: <50ms
  • Reranking: <100ms
  • Total CLI search response: <200ms
  • Initial indexing: ~5-10 min for 1000 files
  • Incremental sync: <2s per changed file
  • Storage: ~100MB per 100k chunks

How Indexing Works

  1. Walk the project tree (skips node_modules, vendor, .git, dist, Laravel runtime/generated dirs, etc.)
  2. Hash each file with BLAKE3 — skip unchanged files
  3. Parse with tree-sitter into semantic chunks (functions, classes, methods)
  4. Small files (<200 lines) still extract symbol-level chunks; generic file chunks are dropped when symbol chunks exist
  5. Embed chunks with voyage-4-large in batches
  6. Store chunk metadata and vectors in embedded LanceDB by default
  7. File reindex operations replace old chunks before adding fresh chunks; rerun cc2 sync if an interrupted process leaves a project partially indexed

Supported code languages include TypeScript, JavaScript, Python, Java, Go, Rust, SQL, PHP, and Vue single-file components. PHP files get class/function/method chunks; Vue SFCs are indexed as file-level chunks.

Laravel defaults skip Composer dependencies, runtime/cache output, built frontend assets, PHPUnit cache files, Pi/local MCP scratch, and generated Wayfinder route/action files.

License

MIT

About

Semantic code search MCP server for Claude Code. AST-based chunking, Voyage AI embeddings, pgvector retrieval.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors