Offline-first RAG system. Your documents, your models, your machine.
LocalRAG ingests your local documents, stores embeddings in a local ChromaDB database, and answers questions using Ollama (or OpenAI / Anthropic) models. No cloud required by default.
flowchart TD
userReq[User Request] --> apiLayer[FastAPI Endpoints]
apiLayer --> queryJson["POST /query (JSON)"]
apiLayer --> queryStream["POST /query/stream (SSE)"]
apiLayer --> agentQuery["POST /agent/query"]
queryJson --> ragEngine[RAG Engine]
queryStream --> ragEngine
agentQuery --> agentService[Agent Service]
agentService -->|search_documents| ragEngine
agentService -->|answer_directly| llmProvider[LLM Provider]
ragEngine --> llmProvider
llmProvider --> providers["Ollama | OpenAI | Anthropic"]
ragEngine --> vectorStore[(ChromaDB)]
apiLayer --> metrics["GET /metrics (Prometheus)"]
metrics --> prometheus[Prometheus]
prometheus --> grafana[Grafana]
-
Install Ollama — ollama.com/download. See docs/ollama.md.
-
Install dependencies:
uv sync- Start Ollama and pull models:
ollama serve
ollama pull nomic-embed-text
ollama pull llama3.2- Copy the example env file:
cp .env.example .env- Ingest documents and query:
uv run localrag ingest ./docs
uv run localrag query "What are the key topics in these documents?"That's it — no cloud API keys needed for local Ollama mode.
Start the API server:
uv run uvicorn localrag.api.main:app --reloadOpen http://127.0.0.1:8000/docs for interactive API docs.
| Method | Path | Description |
|---|---|---|
GET |
/health |
Readiness check (Ollama + ChromaDB) |
POST |
/ingest |
Ingest a single file |
POST |
/ingest/directory |
Ingest a directory recursively |
POST |
/query |
JSON answer with sources and latency |
POST |
/query/stream |
SSE token stream |
POST |
/agent/query |
Agentic RAG (Anthropic tool-use) |
GET |
/metrics |
Prometheus metrics |
GET |
/collections |
List Chroma collections |
DELETE |
/collections/{name} |
Delete a collection |
POST |
/collections/rebuild |
Re-embed all stored sources |
All endpoints except /health and /metrics require X-API-Key when API_KEY is set in .env.
Copy .env.example to .env and adjust values:
cp .env.example .envKey settings:
| Variable | Default | Description |
|---|---|---|
API_KEY |
(empty) | Require X-API-Key header (leave empty to disable auth) |
LLM_BACKEND |
ollama |
LLM provider: ollama, openai, or anthropic |
OLLAMA_BASE_URL |
http://localhost:11434 |
Ollama server URL |
OLLAMA_EMBED_MODEL |
nomic-embed-text |
Embedding model |
OLLAMA_LLM_MODEL |
llama3.2 |
Chat model for Ollama backend |
OPENAI_API_KEY |
(empty) | OpenAI key (required for openai backend) |
OPENAI_MODEL |
gpt-4o-mini |
OpenAI model tag |
ANTHROPIC_API_KEY |
(empty) | Anthropic key (required for anthropic backend or agent) |
ANTHROPIC_MODEL |
claude-haiku-4-5 |
Anthropic model tag |
CHROMA_PERSIST_PATH |
./data/chroma |
Where ChromaDB stores vectors |
CHROMA_COLLECTION_NAME |
localrag |
ChromaDB collection name |
RAG_TOP_K |
5 |
Chunks retrieved per query |
LOG_LEVEL |
INFO |
Logging level (JSON in production, colored in TTY) |
uv run localrag --help
# Ingest
uv run localrag ingest ./docs
uv run localrag ingest-dir ./docs --recursive
# Query
uv run localrag query "How does chunking work?"
# Eval
uv run localrag eval --offline
# Collections
uv run localrag collections list
uv run localrag collections rebuilddocker compose up --buildStarts: localrag-api, ollama, chromadb, prometheus, grafana.
Pull models in the Ollama container after startup:
docker exec -it <ollama_container_name> ollama pull nomic-embed-text
docker exec -it <ollama_container_name> ollama pull llama3.2Then open:
- API:
http://localhost:8000/docs - Grafana:
http://localhost:3000(admin / admin) - Prometheus:
http://localhost:9090
Run the offline evaluation suite against the bundled dataset:
uv run localrag eval --offlineResults are written to evals/results/. The nightly GitHub Actions workflow (.github/workflows/evals.yml) also runs evals automatically.
The eval dataset (evals/dataset.json) contains 20 balanced Q/A/context triplets covering in-scope and out-of-scope cases. Baseline metrics on the bundled dataset:
| Metric | Target |
|---|---|
| faithfulness | ≥ 0.7 |
| answer_relevancy | ≥ 0.7 |
| context_precision | ≥ 0.6 |
| context_recall | ≥ 0.6 |
Run uv run localrag eval --offline to get current numbers.
Apply the manifests under k8s/:
kubectl apply -f k8s/configmap.yaml
kubectl apply -f k8s/secret.yaml
kubectl apply -f k8s/deployment.yaml
kubectl apply -f k8s/service.yaml
kubectl apply -f k8s/hpa.yamlEdit k8s/secret.yaml to add your actual API keys before applying.
uv sync
uv run pytest
uv run ruff check .
uv run ruff format .
uv run mypy localrag/ --ignore-missing-imports --no-strict-optionalInstall pre-commit hooks:
uv run pre-commit installSee docs/agent-navigation.md for codebase navigation and docs/architecture.md for the full architecture description.
- docs/ollama.md — Installing Ollama
- docs/architecture.md — Architecture deep-dive
- docs/agent-navigation.md — Fast codebase orientation for agents
- docs/adr/ — Architecture Decision Records