Local-first semantic search for your repos, markdown notes, docs, and knowledge vaults.
Index your repos, vaults, markdown files, and docs. Then search them in plain language and no cloud required.
# Build the binary
go build -o sem ./cmd/sem
# Move it to anexecutable path ( use sudo if needed )
mv sem /usr/local/bin/sem
sem --help
# Initialize
sem init
# Add a source
sem source add ~/my-notes --name notes
# Build the index
sem index
# Search
sem search "how do I configure authentication?"Requirements: Go 1.21+
git clone https://github.com/dusmel/sem.git
cd sem
go build -o sem ./cmd/semMove the binary somewhere in your PATH:
mv sem /usr/local/bin/- ripgrep (
rg) — required for exact and hybrid search modes. Install withbrew install ripgrep(macOS) orapt install ripgrep(Linux). - (optional) ONNX Runtime — required for real embeddings. Install with
brew install onnxruntime(macOS). Without it, sem falls back to hash-based embeddings (fast but less accurate).
Run sem doctor to check your setup.
sem search "database connection pooling"
sem search "error handling best practices"
sem search "meeting notes from last week"Limit results:
sem search "authentication" --limit 20sem supports three search modes via --mode:
| Mode | What it does | When to use |
|---|---|---|
hybrid |
Combines semantic + exact search (default) | Most queries — best overall results |
semantic |
Vector similarity only | Conceptual queries, finding related ideas |
exact |
ripgrep text search only | Finding specific terms, function names, error messages |
sem search "login flow" --mode semantic # Finds "authentication", "sign in", etc.
sem search "HandleLogin" --mode exact # Finds exact text matches
sem search "login flow" --mode hybrid # Best of both (default)If ripgrep isn't installed, hybrid and exact modes fall back to semantic automatically.
Narrow down results with these flags:
--source — search within a specific source:
sem search "config" --source notes--language — filter by programming language:
sem search "error handling" --language go
sem search "data pipeline" --language python,rust--kind — filter by file type:
sem search "setup instructions" --kind markdown
sem search "database pool" --kind code--dir — filter by subdirectory:
sem search "api endpoint" --dir src/
sem search "deployment" --dir docs/Stack filters to get precise results:
# Search Go files in src/ from the my-app source
sem search "error handling" --source my-app --language go --dir src/
# Find markdown docs about deployment
sem search "deploy pipeline" --kind markdown --source docs
# Search TypeScript files only
sem search "authentication middleware" --language typescript --kind codeUse --json to get structured output:
sem search "api design" --jsonOutput structure:
{
"query": "api design",
"mode": "hybrid",
"filters": {
"source": "",
"language": "",
"kind": ""
},
"results": [
{
"chunk_id": "abc123...",
"file_path": "/Users/hadad/notes/api-design.md",
"snippet": "REST API design principles include...",
"score": 0.892,
"source_name": "notes",
"metadata": {
"file_kind": "markdown",
"language": "markdown",
"title": "API Design Principles",
"start_line": 1,
"end_line": 15
}
}
],
"total": 1,
"elapsed_ms": 45
}Pipe it to jq for scripting:
# Get just the file paths
sem search "auth" --json | jq -r '.results[].file_path'
# Get top result snippet
sem search "auth" --json | jq -r '.results[0].snippet'
# Filter by score threshold
sem search "auth" --json | jq '.results[] | select(.score > 0.7)'Build the index:
sem indexIncremental sync (only re-indexes changed files):
sem syncsem sync is what you'll run most often — it detects new, modified, and deleted files since the last index and updates accordingly.
Full rebuild:
sem index --fullUse --full when you've changed embedding modes, chunking settings, or want a clean rebuild.
Index a specific source:
sem index --source notes
sem sync --source my-appConfig lives at ~/.sem/config.toml. The defaults work well for most cases.
| Mode | Model | Speed | Quality | Best for |
|---|---|---|---|---|
light |
MiniLM | Fastest | Good | Quick searches, low-resource machines |
balanced |
BGE Small | Fast | Better | Daily use (default) |
quality |
BGE Base | Moderate | Best | When accuracy matters most |
nomic |
Nomic Embed | Fast | Great for code | Code-heavy repositories |
Change the mode in your config:
[embedding]
mode = "quality"Or set it during init — the model downloads automatically on first use.
[chunking]
max_chars = 2200 # Maximum chunk size
overlap_chars = 300 # Overlap between chunks
min_chars = 400 # Minimum chunk size (tiny chunks are skipped)
respect_headings = true # Split markdown by headingsCode files are split by function/class boundaries. Markdown files are split by headings when respect_headings is true.
[ignore]
use_gitignore = true
default_patterns = [".git", "node_modules", "target", "dist", "build", "vendor"]sem respects .gitignore files by default. Add patterns to default_patterns for global ignores.
Run sem doctor to validate your setup:
sem doctorIt checks:
- ripgrep installation
- ONNX Runtime availability
- Cached model files
- Configuration validity
- Source path accessibility
- Bundle status
sem runs a 5-step pipeline:
- Scan — walks your source directories, respecting
.gitignoreand ignore patterns - Chunk — splits documents into semantic units (by headings for markdown, by functions for code)
- Embed — runs local ONNX models to generate vector embeddings for each chunk
- Store — saves everything in portable Parquet bundles (the canonical source of truth)
- Search — finds similar chunks using cosine similarity, optionally combined with ripgrep exact matching via Reciprocal Rank Fusion
The bundle is canonical — you can delete ~/.sem/backends/ and rebuild from the bundle anytime with sem index --full.
Most search tools match exact words. That works fine if you're searching for "authentication" in a file that literally says "authentication." But what if your file says "login flow" or "sign in process"? Traditional search misses it entirely.
Semantic search understands that "authentication," "login," and "sign in" are related concepts. It finds what you mean, not just what you type.
And everything runs locally. Your notes stay on your machine. No cloud, no API keys, no sending your data anywhere. The models run on your hardware via ONNX Runtime.
I built this because I was tired of grep-ing through my notes and coming up empty on queries I knew the answer to somewhere in there. sem is the tool I wanted — fast, local, and actually finds what I'm looking for.
MIT