A deterministic crawl kernel. Same input + same seed = identical crawl, identical artifacts, identical replay.
Not a crawler. Not a Wayback clone. The canonical memory layer of the web. See CLAUDE.md for the full philosophy.
# Build
cargo build --release
# Crawl a site
palimpsest crawl https://example.com -d 2 -m 50 -o ./output
# Browser capture (JS-rendered DOM + sub-resources)
palimpsest crawl https://example.com --browser -d 1 -m 10 -o ./output
# Replay a captured URL
palimpsest replay https://example.com/ --data-dir ./output
# Show capture history
palimpsest history https://example.com/ --data-dir ./output
# Extract clean text + RAG chunks
palimpsest extract https://example.com/ --data-dir ./output --json# Single crawl
docker build -t palimpsest .
docker run -v ./output:/data palimpsest crawl https://example.com -d 2 -o /data
# Full stack (API + frontier + worker)
docker compose upThe compose file runs four services:
| Service | Purpose | Port |
|---|---|---|
api |
Retrieval API (/v1/content, /v1/chunks, /v1/search) |
8080 |
frontier |
Distributed frontier server | 8090 |
worker |
Fetch worker (connects to frontier) | -- |
crawl |
One-shot crawl job | -- |
palimpsest crawl <SEEDS>... [OPTIONS]
Options:
-d, --depth <DEPTH> Max crawl depth [default: 2]
-m, --max-urls <MAX> Max URLs to fetch [default: 100]
-s, --seed <SEED> Deterministic seed value [default: 42]
-o, --output-dir <DIR> Persist to disk (blobs + index + WARC)
--browser Use headless Chrome for JS rendering
--user-agent <UA> User-Agent string [default: PalimpsestBot/0.1]
--politeness-ms <MS> Per-host delay in ms [default: 1000]
-c, --config <FILE> Config file (TOML)
Reconstruct a previously captured URL from stored artifacts.
palimpsest replay <URL> --data-dir <DIR>
Show all captures of a URL with timestamps and content hashes.
palimpsest history <URL> --data-dir <DIR>
Extract clean text and RAG-ready chunks from a captured URL.
palimpsest extract <URL> --data-dir <DIR> [--json]
Compare Palimpsest output against legacy crawler WARC files.
palimpsest shadow-compare --legacy ./heritrix-warcs --palimpsest ./output [--json]
Reads .warc and .warc.gz files. Normalizes URLs (scheme, fragments, query params) for cross-crawler comparison.
Start a distributed frontier server. Workers connect to pop URLs and push discoveries.
palimpsest serve --port 8090 --seed 42 --politeness-ms 1000
Connect to a frontier server and crawl.
palimpsest worker --server http://localhost:8090 --output-dir ./data
Start the retrieval API server (content, chunks, search, history).
palimpsest api --port 8080 --data-dir ./output
palimpsest stats # Print workspace statistics
palimpsest migrate --data-dir ./output # Run storage migrations
| Crate | Responsibility |
|---|---|
palimpsest-core |
Types, BLAKE3 hashing, seeded PRNG, error taxonomy |
palimpsest-envelope |
Sealed execution context (immutable after construction) |
palimpsest-frontier |
Deterministic seed-driven URL scheduler with politeness |
palimpsest-artifact |
WARC++ records, capture groups, reader/writer |
palimpsest-storage |
Content-addressed blobs (memory, filesystem, S3/GCS/Azure) |
palimpsest-index |
Temporal graph index (in-memory + SQLite) |
palimpsest-fetch |
HTTP client + browser capture (CDP) + link extraction |
palimpsest-replay |
Reconstruct from stored artifacts |
palimpsest-crawl |
Orchestrator (main crawl loop) |
palimpsest-shadow |
Shadow comparison engine |
palimpsest-extract |
HTML-to-text extraction + RAG chunking with provenance |
palimpsest-embed |
Embedding generation, vector search, LCS change detection |
palimpsest-server |
HTTP frontier server + retrieval API + Prometheus metrics |
palimpsest-sim |
Deterministic simulation testing (6 adversarial universes) |
palimpsest-cli |
Command-line interface |
Every design decision bends around these invariants:
- Determinism -- Frontier ordering is seed-driven. No hidden randomness.
- Idempotence -- Same URL + same execution context = identical artifact hash.
- Content Addressability -- All artifacts are BLAKE3 hash-addressed.
- Temporal Integrity -- Every capture binds wall clock + logical clock + crawl context.
- Replay Fidelity -- Stored artifacts sufficient to reconstruct HTTP exchange, DOM, resource graph.
- Observability as Proof -- Every decision is queryable. Every failure is replayable.
palimpsest crawl https://zuub.com --browser -d 0 -m 1 -o ./outputLaunches headless Chrome via CDP (Chrome DevTools Protocol). Captures:
- Rendered DOM after JavaScript execution
- All sub-resources (CSS, JS, images, fonts) via CDP Network events
- Resource dependency graph with load ordering
- Determinism overrides:
Date.now(),Math.random(),performance.now()seeded fromCrawlSeed
Run a frontier server and connect N workers for horizontal scaling:
# Terminal 1: Start frontier
palimpsest serve --port 8090 --seed 42
# Terminal 2: Seed URLs
curl -X POST http://localhost:8090/seeds \
-H 'Content-Type: application/json' \
-d '{"urls": ["https://example.com"]}'
# Terminal 3+: Start workers
palimpsest worker --server http://localhost:8090 --output-dir ./dataWorkers pop URLs from the frontier, fetch pages, push discovered links back. The frontier maintains deterministic ordering and politeness across all workers.
The api server exposes captured content for AI consumption:
palimpsest api --port 8080 --data-dir ./output| Endpoint | Description |
|---|---|
GET /v1/content?url=... |
Raw captured content |
GET /v1/chunks?url=... |
RAG-ready chunks with provenance |
GET /v1/history?url=... |
All captures with timestamps and hashes |
GET /v1/search?q=... |
Full-text search across captured content |
GET /metrics |
Prometheus-compatible metrics |
Validate Palimpsest output against legacy crawlers:
# Crawl with both tools
wget --warc-file=legacy -r -l 1 https://example.com/
palimpsest crawl https://example.com -d 1 -o ./palimpsest-out
# Compare
palimpsest shadow-compare --legacy ./ --palimpsest ./palimpsest-outReads Heritrix, Warcprox, and wget WARC files (.warc and .warc.gz). Reports matched URLs, mismatches with byte-level size diffs, and coverage gaps.
Prometheus metrics are exposed at /metrics on the API server:
palimpsest_urls_fetched
palimpsest_urls_failed
palimpsest_urls_discovered
palimpsest_robots_blocked
palimpsest_bytes_stored
palimpsest_blobs_stored
palimpsest_api_requests
palimpsest_frontier_pops
palimpsest_frontier_pushes
# Run all tests (301 tests)
cargo test --workspace
# Simulation tests (determinism verification)
cargo test -p palimpsest-sim
# Scale tests (1K and 5K pages)
cargo test -p palimpsest-sim --test scale_test
# Stress test (10K pages across 5 adversarial universes)
cargo test -p palimpsest-sim --test stress_testThe simulation framework (palimpsest-sim) verifies determinism across six adversarial universes: LinkMaze, EncodingHell, MalformedDom, RedirectLabyrinth, ContentTrap, TemporalDrift. Same seed = identical crawl results, proven at 10,000 pages with zero divergence.
MIT OR Apache-2.0