A production-grade Retrieval-Augmented Generation (RAG) API for the Climate Academy student book. Users ask questions and receive answers grounded strictly in the book's content, with inline section citations (§x.y.z).
- The HTML book is parsed into sections, chunked, and embedded into a vector database (one-time ingestion)
- A user sends a question via the frontend or API
- Flask dispatches the question to a Celery background worker
- The worker embeds the question, finds the most relevant book passages via ChromaDB, and sends them to Ollama (local LLM)
- The LLM generates a grounded answer citing section numbers
- The result is returned to the client via polling
Internet
│ HTTPS :443
▼
Nginx reverse proxy, SSL, rate limiting
│ Unix socket
▼
Gunicorn production WSGI server (4 async workers)
│
Flask REST API — session management, task dispatch
│
Celery async RAG pipeline execution
│
┌─────────────────────────────────────────────┐
│ Redis message broker + session store │
│ ChromaDB persistent vector database │
│ MiniLM local embedding model │
│ Ollama local LLM (Llama 3.1 8B / A100) │
└─────────────────────────────────────────────┘
| Layer | Technology | Purpose |
|---|---|---|
| Web framework | Flask + Gunicorn | REST API server |
| Async tasks | Celery | RAG pipeline execution |
| Message broker | Redis | Flask ↔ Celery communication |
| Session store | Redis | Conversation history per user |
| Vector database | ChromaDB | Semantic chunk retrieval |
| Embedding model | all-MiniLM-L6-v2 | Local text → vector (384-dim) |
| LLM runtime | Ollama — Llama 3.1 8B | Answer generation on GPU |
| Reverse proxy | Nginx | SSL, rate limiting, routing |
| Book parser | BeautifulSoup4 | HTML → section records → chunks |
| Package manager | uv | Dependency management + venv |
Climate-Chatbot/
├── app/
│ ├── __init__.py Flask app factory + global error handlers
│ ├── routes.py API endpoint definitions
│ ├── tasks.py Celery task — RAG pipeline orchestration
│ ├── retriever.py ChromaDB semantic search
│ ├── embedder.py MiniLM embedding wrapper (singleton)
│ ├── llm.py Ollama LLM call wrapper
│ ├── session.py Redis session management
│ └── logger.py Shared structured logging
├── deploy/
│ ├── nginx.conf.example Reference Nginx config
│ ├── gunicorn.service.example Reference systemd service
│ ├── celery.service.example Reference systemd service
│ └── DEPLOY.md Full server deployment guide
├── input/
│ └── climate_academy.html Source HTML book
├── html_sectioning.py HTML parser → IndexedChunk objects
├── ingest.py One-time ingestion pipeline script
├── config.py All environment variable configuration
├── run.py Development server entry point
├── run_production.sh Production startup script (used by systemd)
├── gunicorn.conf.py Gunicorn production configuration
├── pyproject.toml Project dependencies
├── uv.lock Locked dependency versions
└── .env.example Environment variable template
| Method | Endpoint | Description | Response |
|---|---|---|---|
GET |
/health |
Health check | {"status":"ok"} |
POST |
/session |
Create chat session | {"session_id":"..."} |
DELETE |
/session/<id> |
Clear session + history | {"message":"..."} |
POST |
/chat |
Send message | {"task_id":"..."} |
GET |
/result/<task_id> |
Poll for answer | {"status":"done","answer":"...","sources":[...]} |
# 1. Create session
curl -X POST http://localhost:5000/session
# → {"session_id": "abc-123"}
# 2. Send message
curl -X POST http://localhost:5000/chat \
-H "Content-Type: application/json" \
-d '{"session_id": "abc-123", "message": "What is the greenhouse effect?"}'
# → {"task_id": "xyz-456"}
# 3. Poll until done (status changes from "pending" to "done")
curl http://localhost:5000/result/xyz-456
# → {"status": "done", "answer": "The greenhouse effect...", "sources": [...]}| Tool | Install |
|---|---|
| Python 3.11+ | sudo apt install python3 / brew install python |
| uv | curl -LsSf https://astral.sh/uv/install.sh | sh |
| Redis | sudo apt install redis-server / brew install redis |
| Ollama | curl -fsSL https://ollama.com/install.sh | sh |
# 1. Clone
git clone https://github.com/your-org/climate-rag-backend.git
cd climate-rag-backend
# 2. Install all dependencies
uv sync
# 3. Configure environment
cp .env.example .env
# defaults work for local development — no changes needed
# 4. Start Redis
sudo systemctl start redis-server
redis-cli ping # should return PONG
# 5. Pull the LLM model and start Ollama
ollama pull llama3.1:8b
ollama serve &
# 6. Run ingestion (once, or whenever the book changes)
uv run python ingest.py
# 7. Start services (four separate terminals)
uv run python run.py # Terminal 1 — Flask
uv run celery -A app.tasks worker --concurrency=2 --loglevel=info # Terminal 2 — Celery
# 8. Test
curl http://localhost:5000/health
# → {"status": "ok"}If Ollama runs on a separate GPU server, use SSH port forwarding:
# Stop local Ollama to free the port
sudo systemctl stop ollama
# Open the tunnel — keep this terminal open
ssh -L 11434:localhost:11434 user@gpu-server-ip -N
# Verify the tunnel hits the GPU server (not local)
curl http://localhost:11434/api/tags
# Must show installed models — not {"models":[]}No .env changes needed. The app calls localhost:11434 which the tunnel forwards transparently.
Important: Always start the SSH tunnel before starting Celery workers.
| Variable | Default | Description |
|---|---|---|
FLASK_ENV |
development |
Flask environment |
FLASK_DEBUG |
1 |
Debug mode (set to 0 in production) |
SECRET_KEY |
— | Flask secret key (change in production) |
REDIS_URL |
redis://localhost:6379/0 |
Redis connection URL |
CHROMA_PATH |
./chroma_db |
ChromaDB storage path |
CHROMA_COLLECTION |
climate_academy |
Collection name |
EMBEDDING_MODEL |
all-MiniLM-L6-v2 |
Sentence transformer model |
OLLAMA_BASE_URL |
http://localhost:11434 |
Ollama server URL |
OLLAMA_MODEL |
llama3.1:8b |
Model name (must match ollama list) |
CHUNK_SIZE |
150 |
Words per chunk |
CHUNK_OVERLAP |
30 |
Overlap words between chunks |
TOP_K |
5 |
Chunks retrieved per query |
DISTANCE_THRESHOLD |
0.7 |
Max cosine distance for relevant chunks |
SESSION_TTL_SECONDS |
86400 |
Session expiry (24 hours) |
LOG_LEVEL |
INFO |
Logging level (DEBUG / INFO / WARNING) |
Run whenever the source book changes:
uv run python ingest.py
sudo systemctl restart celery-climate # restart workers to reload collectionCheck indexed chunks:
uv run python -c "
import chromadb; from config import Config
c = chromadb.PersistentClient(path=Config.CHROMA_PATH)
print('Chunks:', c.get_collection(Config.CHROMA_COLLECTION).count())
"Development: all logs print to stdout with format:
2026-05-21 15:30:01 | INFO | app.retriever | Retrieved 5 chunks — distance range: 0.31–0.58
Production: Gunicorn writes to /var/log/climate-rag/. View live:
sudo journalctl -u gunicorn-climate -f
sudo journalctl -u celery-climate -fSee deploy/DEPLOY.md for the complete step-by-step server deployment guide.
| Constraint | Detail |
|---|---|
| LLM concurrency | Ollama handles one inference at a time — concurrent users queue via Celery |
| Vector DB scale | ChromaDB is single-node — suitable for ≤50 concurrent users |
| SSH tunnel | Remote Ollama requires an active SSH tunnel — automated via systemd in production |
| Chunk size | MiniLM max input is ~256 tokens ≈ 180 words — CHUNK_SIZE=150 stays safely within limit |