Skip to content

copyleftdev/palimpsest

Repository files navigation

Palimpsest

A deterministic crawl kernel. Same input + same seed = identical crawl, identical artifacts, identical replay.

Not a crawler. Not a Wayback clone. The canonical memory layer of the web. See CLAUDE.md for the full philosophy.

Quick Start

# Build
cargo build --release

# Crawl a site
palimpsest crawl https://example.com -d 2 -m 50 -o ./output

# Browser capture (JS-rendered DOM + sub-resources)
palimpsest crawl https://example.com --browser -d 1 -m 10 -o ./output

# Replay a captured URL
palimpsest replay https://example.com/ --data-dir ./output

# Show capture history
palimpsest history https://example.com/ --data-dir ./output

# Extract clean text + RAG chunks
palimpsest extract https://example.com/ --data-dir ./output --json

Docker

# Single crawl
docker build -t palimpsest .
docker run -v ./output:/data palimpsest crawl https://example.com -d 2 -o /data

# Full stack (API + frontier + worker)
docker compose up

The compose file runs four services:

Service Purpose Port
api Retrieval API (/v1/content, /v1/chunks, /v1/search) 8080
frontier Distributed frontier server 8090
worker Fetch worker (connects to frontier) --
crawl One-shot crawl job --

CLI Reference

crawl

palimpsest crawl <SEEDS>... [OPTIONS]

Options:
  -d, --depth <DEPTH>          Max crawl depth [default: 2]
  -m, --max-urls <MAX>         Max URLs to fetch [default: 100]
  -s, --seed <SEED>            Deterministic seed value [default: 42]
  -o, --output-dir <DIR>       Persist to disk (blobs + index + WARC)
      --browser                Use headless Chrome for JS rendering
      --user-agent <UA>        User-Agent string [default: PalimpsestBot/0.1]
      --politeness-ms <MS>     Per-host delay in ms [default: 1000]
  -c, --config <FILE>          Config file (TOML)

replay

Reconstruct a previously captured URL from stored artifacts.

palimpsest replay <URL> --data-dir <DIR>

history

Show all captures of a URL with timestamps and content hashes.

palimpsest history <URL> --data-dir <DIR>

extract

Extract clean text and RAG-ready chunks from a captured URL.

palimpsest extract <URL> --data-dir <DIR> [--json]

shadow-compare

Compare Palimpsest output against legacy crawler WARC files.

palimpsest shadow-compare --legacy ./heritrix-warcs --palimpsest ./output [--json]

Reads .warc and .warc.gz files. Normalizes URLs (scheme, fragments, query params) for cross-crawler comparison.

serve

Start a distributed frontier server. Workers connect to pop URLs and push discoveries.

palimpsest serve --port 8090 --seed 42 --politeness-ms 1000

worker

Connect to a frontier server and crawl.

palimpsest worker --server http://localhost:8090 --output-dir ./data

api

Start the retrieval API server (content, chunks, search, history).

palimpsest api --port 8080 --data-dir ./output

stats / migrate

palimpsest stats           # Print workspace statistics
palimpsest migrate --data-dir ./output  # Run storage migrations

Architecture

Crate Responsibility
palimpsest-core Types, BLAKE3 hashing, seeded PRNG, error taxonomy
palimpsest-envelope Sealed execution context (immutable after construction)
palimpsest-frontier Deterministic seed-driven URL scheduler with politeness
palimpsest-artifact WARC++ records, capture groups, reader/writer
palimpsest-storage Content-addressed blobs (memory, filesystem, S3/GCS/Azure)
palimpsest-index Temporal graph index (in-memory + SQLite)
palimpsest-fetch HTTP client + browser capture (CDP) + link extraction
palimpsest-replay Reconstruct from stored artifacts
palimpsest-crawl Orchestrator (main crawl loop)
palimpsest-shadow Shadow comparison engine
palimpsest-extract HTML-to-text extraction + RAG chunking with provenance
palimpsest-embed Embedding generation, vector search, LCS change detection
palimpsest-server HTTP frontier server + retrieval API + Prometheus metrics
palimpsest-sim Deterministic simulation testing (6 adversarial universes)
palimpsest-cli Command-line interface

The Six Laws

Every design decision bends around these invariants:

  1. Determinism -- Frontier ordering is seed-driven. No hidden randomness.
  2. Idempotence -- Same URL + same execution context = identical artifact hash.
  3. Content Addressability -- All artifacts are BLAKE3 hash-addressed.
  4. Temporal Integrity -- Every capture binds wall clock + logical clock + crawl context.
  5. Replay Fidelity -- Stored artifacts sufficient to reconstruct HTTP exchange, DOM, resource graph.
  6. Observability as Proof -- Every decision is queryable. Every failure is replayable.

Browser Capture

palimpsest crawl https://zuub.com --browser -d 0 -m 1 -o ./output

Launches headless Chrome via CDP (Chrome DevTools Protocol). Captures:

  • Rendered DOM after JavaScript execution
  • All sub-resources (CSS, JS, images, fonts) via CDP Network events
  • Resource dependency graph with load ordering
  • Determinism overrides: Date.now(), Math.random(), performance.now() seeded from CrawlSeed

Distributed Crawling

Run a frontier server and connect N workers for horizontal scaling:

# Terminal 1: Start frontier
palimpsest serve --port 8090 --seed 42

# Terminal 2: Seed URLs
curl -X POST http://localhost:8090/seeds \
  -H 'Content-Type: application/json' \
  -d '{"urls": ["https://example.com"]}'

# Terminal 3+: Start workers
palimpsest worker --server http://localhost:8090 --output-dir ./data

Workers pop URLs from the frontier, fetch pages, push discovered links back. The frontier maintains deterministic ordering and politeness across all workers.

Retrieval API

The api server exposes captured content for AI consumption:

palimpsest api --port 8080 --data-dir ./output
Endpoint Description
GET /v1/content?url=... Raw captured content
GET /v1/chunks?url=... RAG-ready chunks with provenance
GET /v1/history?url=... All captures with timestamps and hashes
GET /v1/search?q=... Full-text search across captured content
GET /metrics Prometheus-compatible metrics

Shadow Comparison

Validate Palimpsest output against legacy crawlers:

# Crawl with both tools
wget --warc-file=legacy -r -l 1 https://example.com/
palimpsest crawl https://example.com -d 1 -o ./palimpsest-out

# Compare
palimpsest shadow-compare --legacy ./ --palimpsest ./palimpsest-out

Reads Heritrix, Warcprox, and wget WARC files (.warc and .warc.gz). Reports matched URLs, mismatches with byte-level size diffs, and coverage gaps.

Monitoring

Prometheus metrics are exposed at /metrics on the API server:

palimpsest_urls_fetched
palimpsest_urls_failed
palimpsest_urls_discovered
palimpsest_robots_blocked
palimpsest_bytes_stored
palimpsest_blobs_stored
palimpsest_api_requests
palimpsest_frontier_pops
palimpsest_frontier_pushes

Testing

# Run all tests (301 tests)
cargo test --workspace

# Simulation tests (determinism verification)
cargo test -p palimpsest-sim

# Scale tests (1K and 5K pages)
cargo test -p palimpsest-sim --test scale_test

# Stress test (10K pages across 5 adversarial universes)
cargo test -p palimpsest-sim --test stress_test

The simulation framework (palimpsest-sim) verifies determinism across six adversarial universes: LinkMaze, EncodingHell, MalformedDom, RedirectLabyrinth, ContentTrap, TemporalDrift. Same seed = identical crawl results, proven at 10,000 pages with zero divergence.

License

MIT OR Apache-2.0

About

Deterministic crawl kernel in Rust — same seed, identical crawl, identical replay. Browser capture via CDP, BLAKE3 content-addressed storage, WARC++ output, shadow comparison against Heritrix/wget. Built with Claude Code in one session: 269 tests, 12 crates, zero invariant violations.

Topics

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages