Palimpsest

A deterministic crawl kernel. Same input + same seed = identical crawl, identical artifacts, identical replay.

Not a crawler. Not a Wayback clone. The canonical memory layer of the web. See CLAUDE.md for the full philosophy.

Quick Start

# Build
cargo build --release

# Crawl a site
palimpsest crawl https://example.com -d 2 -m 50 -o ./output

# Browser capture (JS-rendered DOM + sub-resources)
palimpsest crawl https://example.com --browser -d 1 -m 10 -o ./output

# Replay a captured URL
palimpsest replay https://example.com/ --data-dir ./output

# Show capture history
palimpsest history https://example.com/ --data-dir ./output

# Extract clean text + RAG chunks
palimpsest extract https://example.com/ --data-dir ./output --json

Docker

# Single crawl
docker build -t palimpsest .
docker run -v ./output:/data palimpsest crawl https://example.com -d 2 -o /data

# Full stack (API + frontier + worker)
docker compose up

The compose file runs four services:

Service	Purpose	Port
`api`	Retrieval API (`/v1/content`, `/v1/chunks`, `/v1/search`)	8080
`frontier`	Distributed frontier server	8090
`worker`	Fetch worker (connects to frontier)	--
`crawl`	One-shot crawl job	--

CLI Reference

crawl

palimpsest crawl <SEEDS>... [OPTIONS]

Options:
  -d, --depth <DEPTH>          Max crawl depth [default: 2]
  -m, --max-urls <MAX>         Max URLs to fetch [default: 100]
  -s, --seed <SEED>            Deterministic seed value [default: 42]
  -o, --output-dir <DIR>       Persist to disk (blobs + index + WARC)
      --browser                Use headless Chrome for JS rendering
      --user-agent <UA>        User-Agent string [default: PalimpsestBot/0.1]
      --politeness-ms <MS>     Per-host delay in ms [default: 1000]
  -c, --config <FILE>          Config file (TOML)

replay

Reconstruct a previously captured URL from stored artifacts.

palimpsest replay <URL> --data-dir <DIR>

history

Show all captures of a URL with timestamps and content hashes.

palimpsest history <URL> --data-dir <DIR>

extract

Extract clean text and RAG-ready chunks from a captured URL.

palimpsest extract <URL> --data-dir <DIR> [--json]

shadow-compare

Compare Palimpsest output against legacy crawler WARC files.

palimpsest shadow-compare --legacy ./heritrix-warcs --palimpsest ./output [--json]

Reads .warc and .warc.gz files. Normalizes URLs (scheme, fragments, query params) for cross-crawler comparison.

serve

Start a distributed frontier server. Workers connect to pop URLs and push discoveries.

palimpsest serve --port 8090 --seed 42 --politeness-ms 1000

worker

Connect to a frontier server and crawl.

palimpsest worker --server http://localhost:8090 --output-dir ./data

api

Start the retrieval API server (content, chunks, search, history).

palimpsest api --port 8080 --data-dir ./output

stats / migrate

palimpsest stats           # Print workspace statistics
palimpsest migrate --data-dir ./output  # Run storage migrations

Architecture

Crate	Responsibility
`palimpsest-core`	Types, BLAKE3 hashing, seeded PRNG, error taxonomy
`palimpsest-envelope`	Sealed execution context (immutable after construction)
`palimpsest-frontier`	Deterministic seed-driven URL scheduler with politeness
`palimpsest-artifact`	WARC++ records, capture groups, reader/writer
`palimpsest-storage`	Content-addressed blobs (memory, filesystem, S3/GCS/Azure)
`palimpsest-index`	Temporal graph index (in-memory + SQLite)
`palimpsest-fetch`	HTTP client + browser capture (CDP) + link extraction
`palimpsest-replay`	Reconstruct from stored artifacts
`palimpsest-crawl`	Orchestrator (main crawl loop)
`palimpsest-shadow`	Shadow comparison engine
`palimpsest-extract`	HTML-to-text extraction + RAG chunking with provenance
`palimpsest-embed`	Embedding generation, vector search, LCS change detection
`palimpsest-server`	HTTP frontier server + retrieval API + Prometheus metrics
`palimpsest-sim`	Deterministic simulation testing (6 adversarial universes)
`palimpsest-cli`	Command-line interface

The Six Laws

Every design decision bends around these invariants:

Determinism -- Frontier ordering is seed-driven. No hidden randomness.
Idempotence -- Same URL + same execution context = identical artifact hash.
Content Addressability -- All artifacts are BLAKE3 hash-addressed.
Temporal Integrity -- Every capture binds wall clock + logical clock + crawl context.
Replay Fidelity -- Stored artifacts sufficient to reconstruct HTTP exchange, DOM, resource graph.
Observability as Proof -- Every decision is queryable. Every failure is replayable.

Browser Capture

palimpsest crawl https://zuub.com --browser -d 0 -m 1 -o ./output

Launches headless Chrome via CDP (Chrome DevTools Protocol). Captures:

Rendered DOM after JavaScript execution
All sub-resources (CSS, JS, images, fonts) via CDP Network events
Resource dependency graph with load ordering
Determinism overrides: Date.now(), Math.random(), performance.now() seeded from CrawlSeed

Distributed Crawling

Run a frontier server and connect N workers for horizontal scaling:

# Terminal 1: Start frontier
palimpsest serve --port 8090 --seed 42

# Terminal 2: Seed URLs
curl -X POST http://localhost:8090/seeds \
  -H 'Content-Type: application/json' \
  -d '{"urls": ["https://example.com"]}'

# Terminal 3+: Start workers
palimpsest worker --server http://localhost:8090 --output-dir ./data

Workers pop URLs from the frontier, fetch pages, push discovered links back. The frontier maintains deterministic ordering and politeness across all workers.

Retrieval API

The api server exposes captured content for AI consumption:

palimpsest api --port 8080 --data-dir ./output

Endpoint	Description
`GET /v1/content?url=...`	Raw captured content
`GET /v1/chunks?url=...`	RAG-ready chunks with provenance
`GET /v1/history?url=...`	All captures with timestamps and hashes
`GET /v1/search?q=...`	Full-text search across captured content
`GET /metrics`	Prometheus-compatible metrics

Shadow Comparison

Validate Palimpsest output against legacy crawlers:

# Crawl with both tools
wget --warc-file=legacy -r -l 1 https://example.com/
palimpsest crawl https://example.com -d 1 -o ./palimpsest-out

# Compare
palimpsest shadow-compare --legacy ./ --palimpsest ./palimpsest-out

Reads Heritrix, Warcprox, and wget WARC files (.warc and .warc.gz). Reports matched URLs, mismatches with byte-level size diffs, and coverage gaps.

Monitoring

Prometheus metrics are exposed at /metrics on the API server:

palimpsest_urls_fetched
palimpsest_urls_failed
palimpsest_urls_discovered
palimpsest_robots_blocked
palimpsest_bytes_stored
palimpsest_blobs_stored
palimpsest_api_requests
palimpsest_frontier_pops
palimpsest_frontier_pushes

Testing

# Run all tests (301 tests)
cargo test --workspace

# Simulation tests (determinism verification)
cargo test -p palimpsest-sim

# Scale tests (1K and 5K pages)
cargo test -p palimpsest-sim --test scale_test

# Stress test (10K pages across 5 adversarial universes)
cargo test -p palimpsest-sim --test stress_test

The simulation framework (palimpsest-sim) verifies determinism across six adversarial universes: LinkMaze, EncodingHell, MalformedDom, RedirectLabyrinth, ContentTrap, TemporalDrift. Same seed = identical crawl results, proven at 10,000 pages with zero divergence.

License

MIT OR Apache-2.0

Name		Name	Last commit message	Last commit date
Latest commit History 49 Commits
.claude		.claude
.github/workflows		.github/workflows
book		book
crates		crates
docs		docs
examples		examples
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
Dockerfile		Dockerfile
README.md		README.md
docker-compose.yml		docker-compose.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Palimpsest

Quick Start

Docker

CLI Reference

crawl

replay

history

extract

shadow-compare

serve

worker

api

stats / migrate

Architecture

The Six Laws

Browser Capture

Distributed Crawling

Retrieval API

Shadow Comparison

Monitoring

Testing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Palimpsest

Quick Start

Docker

CLI Reference

crawl

replay

history

extract

shadow-compare

serve

worker

api

stats / migrate

Architecture

The Six Laws

Browser Capture

Distributed Crawling

Retrieval API

Shadow Comparison

Monitoring

Testing

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages