diff --git a/docs/design/benchmarking.md b/docs/design/benchmarking.md
new file mode 100644
index 0000000..9ca2aba
--- /dev/null
+++ b/docs/design/benchmarking.md
@@ -0,0 +1,222 @@
+# Benchmarking
+
+> Status: design rationale for the benchmark suite under [`benches/`](../../benches)
+> and shared benchmark support under [`bench-support/`](../../bench-support).
+> Companion to [`design.md`](design.md) §10 and the benchmark reference docs.
+
+cachekit benchmarks are designed to answer cache questions, not just produce
+fast-looking numbers. A cache policy can be excellent on uniform keys and weak
+under scans, or fast on micro-operations and poor at preserving hit rate. The
+benchmark suite therefore separates micro-operation cost, policy effectiveness,
+trace-shaped workloads, reporting, and machine-readable artifacts.
+
+## Goals
+
+- Compare policies under workload shapes that resemble real cache traffic.
+- Keep measured loops free of allocator noise and dynamic dispatch.
+- Produce both human-readable reports and stable JSON artifacts.
+- Preserve enough metadata to reproduce a run: git commit, branch, dirty bit,
+  rustc version, host triple, CPU model, capacity, universe, operations, seed.
+- Make adding a policy or workload a registry edit, not a benchmark rewrite.
+
+## Benchmark Layers
+
+The benchmark suite has four layers:
+
+| Layer | Files | Purpose |
+|---|---|---|
+| Criterion measurements | `benches/workloads.rs`, `benches/ops.rs`, `benches/comparison.rs`, `benches/policy/*.rs` | statistically sampled latency and throughput |
+| Console reports | `benches/reports.rs` | fast, readable tables without Criterion overhead |
+| JSON artifact runner | `benches/runner.rs` | structured output for docs, charts, CI, historical comparison |
+| Shared support crate | `bench-support/` | policy registry, workloads, metrics, JSON schema, doc renderer |
+
+This split is deliberate. Criterion is good for micro-benchmark statistics; the
+artifact runner is good for automation; console reports are good while tuning a
+policy locally. No single binary is forced to serve every audience.
+
+## Monomorphic Policy Registry
+
+Benchmarks iterate policies through `for_each_policy!` in
+[`bench-support/src/registry.rs`](../../bench-support/src/registry.rs):
+
+```rust,ignore
+for_each_policy! {
+    with |policy_id, display_name, make_cache| {
+        let mut cache = make_cache(CAPACITY);
+        // measured workload...
+    }
+}
+```
+
+The macro expands to one block per concrete policy type. This avoids dynamic
+dispatch in the measured loop while keeping policy iteration centralized.
+`POLICIES` in the same module provides presentation metadata (stable id,
+display name, chart color) for renderers and reports.
+
+The trade-off is that adding a policy touches the macro and metadata table. A
+test (`policies_metadata_matches_macro`) keeps the two from drifting. This is
+the same explicit-boilerplate-over-magic choice as `DynCache`: more arms in
+source, fewer surprises in hot code.
+
+## Workload Registry
+
+Workload definitions live in `bench-support/src/registry.rs`; generators live in
+[`bench-support/src/workload.rs`](../../bench-support/src/workload.rs). The
+current standard workloads cover:
+
+- Uniform random keys for raw overhead baselines.
+- Hot-set access for explicit skew.
+- Sequential scan for scan-pollution stress.
+- Zipfian and scrambled Zipfian for power-law access.
+- Latest / recency-biased access.
+- Shifting hotspots and flash crowds for adaptation.
+- Composite scan-resistance mixes.
+
+[`docs/benchmarks/workloads.md`](../benchmarks/workloads.md) is the catalog. It
+also contains a large roadmap of workloads that should not be confused with
+implemented cases. New workloads should land first in the support crate, then in
+the docs, then in reports.
+
+## Value Construction Discipline
+
+`benches/runner.rs` pre-allocates one `Arc<u64>` per key in the universe and
+passes a closure that returns `Arc::clone`:
+
+```rust,ignore
+fn preallocate_values() -> Vec<Arc<u64>> {
+    (0..UNIVERSE).map(Arc::new).collect()
+}
+```
+
+The rule is: **do not allocate values inside the measured operation loop**.
+Allocating on every miss makes the benchmark measure the allocator and value
+constructor, not the policy. A cheap `Arc::clone` isolates hit/miss behaviour,
+eviction order, and policy metadata overhead.
+
+This is especially important because policies store values differently:
+`FastLru` stores `V` directly, while LRU / LFU / Heap-LFU use `Arc<V>` in some
+paths. Pre-allocation keeps those representation differences from dominating
+the benchmark.
+
+## Artifact Schema
+
+`bench-support/src/json_results.rs` defines the stable JSON schema for results:
+
+- `SCHEMA_VERSION` follows semantic schema rules.
+- Major bumps remove or rename required fields.
+- Minor bumps add optional fields.
+- Renderers accept any artifact with a matching major.
+
+Each `BenchmarkArtifact` contains:
+
+- `metadata`: timestamp, git commit, branch, dirty bit, rustc, host, CPU,
+  benchmark config.
+- `results`: rows keyed by policy, workload, and `case_id`.
+- `metrics`: optional typed sections for hit rate, throughput, latency,
+  eviction, scan resistance, adaptation speed.
+
+The schema is presentation-neutral. Markdown tables and charts are rendered
+later by `bench-support/src/bin/render_docs.rs`, so measurement and presentation
+can evolve independently.
+
+## Case IDs
+
+Use `case_id::*` constants from `json_results.rs` instead of string literals:
+
+- `hit_rate`
+- `comprehensive`
+- `scan_resistance`
+- `adaptation`
+
+This catches typos at compile time and prevents a result section from silently
+disappearing from rendered docs. Adding a new case means adding a constant,
+teaching the runner to populate it, and teaching the renderer how to display it.
+
+## What Each Benchmark Answers
+
+| Benchmark | Question |
+|---|---|
+| `ops.rs` | What is the raw cost of `get` / `insert` / policy-specific operations? |
+| `workloads.rs` | Which policies preserve hit rate under standard workloads? |
+| `comparison.rs` | How does cachekit compare with external crates (`lru`, `quick_cache`)? |
+| `policy/*.rs` | What is the cost of each policy's unique operations? |
+| `reports.rs` | What should a human inspect while tuning? |
+| `runner.rs` | What should CI and docs consume? |
+
+Do not overload one benchmark to answer all questions. If you need policy
+micro-cost, use `ops.rs`; if you need hit rate under scans, use `workloads.rs`
+or `runner.rs`.
+
+## Reproducibility Rules
+
+- Seed every workload. Default seed is 42 unless a benchmark is explicitly
+  sweeping seeds.
+- Record the git dirty bit. Dirty runs are useful locally but should not be
+  published as release baselines without a note.
+- Keep capacity, universe, and operation count visible in the artifact.
+- Prefer `ScrambledZipfian` over raw `Zipfian` for cross-policy comparison when
+  hardware prefetch could bias hot-key locality.
+- Do not compare results across machines without CPU metadata. Tail latency and
+  pointer-heavy policy cost are machine-sensitive.
+
+## CI and Documentation Flow
+
+The docs pipeline runs the benchmark suite, writes
+`target/benchmarks/<run-id>/results.json`, and renders
+`docs/benchmarks/latest/` plus charts. Release-tag snapshots live under
+`docs/benchmarks/vX.Y.Z/`.
+
+Manual workflow:
+
+```bash
+cargo bench --bench runner
+./scripts/update_benchmark_docs.sh
+```
+
+The script is the high-level path for refreshing published benchmark docs. Use
+individual benches (`cargo bench --bench ops`, `cargo bench --bench reports -- scan`)
+while developing a policy.
+
+## Adding a Policy to Benchmarks
+
+1. Add the policy to `for_each_policy!` with a concrete constructor.
+2. Add matching `PolicyMeta` in `POLICIES`.
+3. Run the registry drift test.
+4. Run `cargo bench --bench reports -- hit_rate` for a quick sanity check.
+5. Run `cargo bench --bench runner` before publishing docs.
+
+Keep constructors comparable. If one policy needs `Arc<u64>` and another stores
+`u64`, choose the value shape that preserves fairness and document the exception
+in the registry comment.
+
+## Adding a Workload
+
+1. Implement the generator in `bench-support/src/workload.rs`.
+2. Add a `WorkloadCase` in the registry with stable id and display name.
+3. Add docs in [`docs/benchmarks/workloads.md`](../benchmarks/workloads.md).
+4. Add renderer support if the workload needs a custom section.
+5. Run at least one policy family expected to behave differently (for example,
+   LRU vs S3-FIFO for scan-heavy workloads).
+
+Do not add a workload just because it is mathematically interesting. It should
+answer a policy-selection question.
+
+## Non-goals
+
+- Benchmarks are not formal proofs of policy optimality.
+- Benchmarks are not stable ABI. The JSON schema is versioned, but Criterion
+  names and report formatting can change.
+- Benchmarks do not hide hardware effects. They record enough metadata for the
+  reader to judge them.
+- Benchmarks do not replace fuzzing or invariant tests; they measure behaviour
+  under selected workloads.
+
+## See Also
+
+- [Design overview](design.md) - §10 frames benchmarking at the principles level
+- [Metrics](metrics.md) - recorder / snapshot / exporter split
+- [Benchmark docs](../benchmarks/README.md)
+- [Workload catalog](../benchmarks/workloads.md)
+- [`bench-support/src/registry.rs`](../../bench-support/src/registry.rs)
+- [`bench-support/src/json_results.rs`](../../bench-support/src/json_results.rs)
+- [`benches/runner.rs`](../../benches/runner.rs)
diff --git a/docs/design/builder-and-dyn-dispatch.md b/docs/design/builder-and-dyn-dispatch.md
new file mode 100644
index 0000000..c7d8032
--- /dev/null
+++ b/docs/design/builder-and-dyn-dispatch.md
@@ -0,0 +1,454 @@
+# Builder and Runtime Dispatch
+
+> Status: design rationale for [`CacheBuilder`](../../src/builder.rs),
+> [`CachePolicy`](../../src/builder.rs), and [`DynCache<K, V>`](../../src/builder.rs).
+> Companion to [`design.md`](design.md) §13, [`trait-hierarchy.md`](trait-hierarchy.md),
+> and [`concurrency.md`](concurrency.md).
+
+cachekit ships 18 implemented eviction policies. The runtime dispatcher
+currently wires 17 of them; CAR exists as a concrete policy but is not yet a
+`CachePolicy` / `DynCache` variant. Most application code wants to pick a
+policy — possibly at runtime, based on configuration — without writing one
+monomorphized call site per policy. This document explains why that runtime
+choice is delivered through an enum dispatcher rather than a `Box<dyn Cache>`,
+what the user-visible cost is, and how to extend the surface when a new policy
+lands.
+
+## The problem
+
+A user with a `policy: String` configuration value wants to write:
+
+```rust,ignore
+let mut cache = build_cache_from_config(config);
+cache.insert(key, value);
+cache.get(&key);
+```
+
+without enumerating every builder-wired policy at each call site. The cache
+type must therefore be **uniform across policies** — the concrete type the
+caller holds cannot depend on which policy was chosen.
+
+Two Rust mechanisms give a uniform type:
+
+1. **Trait objects** — `Box<dyn Cache<K, V>>`, with dispatch through a
+   vtable per method call.
+2. **Enum dispatch** — a closed sum of every policy, with dispatch
+   through a `match` per method call.
+
+cachekit picks mechanism 2. The rest of this document explains why and
+what it costs.
+
+## Enum dispatch vs `Box<dyn Cache>`
+
+`Cache<K, V>` is deliberately object-safe (see
+[`trait-hierarchy.md`](trait-hierarchy.md#object-safety)) precisely so
+`Box<dyn Cache<K, V>>` *can* be used; cachekit consumers can still take
+that route in their own code. But the **library-provided** runtime
+dispatcher is an enum, for five reasons:
+
+| Property | `Box<dyn Cache<K, V>>` | `DynCache<K, V>` (enum) |
+|---|---|---|
+| Dispatch cost per call | Indirect call via vtable | Branch-predicted `match` |
+| Devirtualization | No (opaque) | Yes (compiler sees the arm) |
+| Inlining of policy body | No | Yes when the arm is statically reachable |
+| Heap allocation per cache | One `Box` per cache | None (enum lives inline) |
+| Closed vs open extension | Open (any `impl Cache`) | Closed (`#[non_exhaustive]` enum) |
+| API stability for new policies | Adding a method is a breaking change | Adding a variant is a non-breaking change with `#[non_exhaustive]` |
+
+The dominant terms are dispatch cost and devirtualization. A `match` on
+an enum tag is a single branch that predicts well in tight loops; the
+optimizer often hoists it out entirely when the enum tag is invariant
+across a benchmark inner loop. A vtable call cannot be devirtualized
+without inlining context and forces the policy body to live behind an
+opaque indirection.
+
+The cost is in extensibility. `Box<dyn Cache>` accepts any
+out-of-tree policy that implements `Cache<K, V>`; `DynCache` does not.
+Users with their own policy implementations still use them directly —
+`MyCache::new(…)` returns a concrete `MyCache<K, V>` and works with any
+code generic over `Cache<K, V>`. The enum is only the **library-provided
+dispatcher**, not a general substrate.
+
+## `CachePolicy` — config-carrying tag
+
+`CachePolicy` ([`src/builder.rs`](../../src/builder.rs)) is the
+user-facing enum that selects a policy. It is **separate** from the
+internal `CacheInner` enum, and it carries per-policy configuration:
+
+```rust,ignore
+#[non_exhaustive]
+#[derive(Debug, Clone, Copy, PartialEq)]
+pub enum CachePolicy {
+    Fifo,
+    Lru,
+    FastLru,
+    LruK { k: usize },
+    Lfu { bucket_hint: Option<usize> },
+    HeapLfu,
+    TwoQ { probation_frac: f64 },
+    S3Fifo { small_ratio: f64, ghost_ratio: f64 },
+    Arc,
+    Lifo,
+    Mfu,
+    Mru,
+    Random,
+    Slru { probationary_frac: f64 },
+    Clock,
+    ClockPro,
+    Nru,
+}
+```
+
+Three design decisions are worth naming:
+
+- **`#[non_exhaustive]`.** Adding a new variant (e.g. when LIRS lands
+  off the roadmap) is a **minor** version bump rather than a major
+  one. Downstream `match` statements over `CachePolicy` must include a
+  `_ =>` arm, which is the standard `non_exhaustive` discipline.
+- **Config carried inline.** `LruK { k }` rather than `LruK` + separate
+  `set_k`. The variant is the place where the parameter is
+  type-checked, and `CachePolicy` stays `Copy` because every payload
+  is `Copy`. This makes `let policy: CachePolicy = config.into();`
+  trivial and lets callers pass `CachePolicy` by value without
+  ceremony.
+- **Tag separated from implementation.** `CachePolicy::Lru` is a
+  user-facing intent; `CacheInner::Lru(LruCore<K, V>)` is the
+  internal storage. Keeping them separate means the internal type can
+  change (e.g. swap `LruCore` for a new implementation) without
+  touching the public enum.
+
+## `DynCache<K, V>` — uniform runtime type
+
+The public dispatcher:
+
+```rust,ignore
+pub struct DynCache<K, V>
+where
+    K: Copy + Eq + Hash + Ord,
+    V: Clone + Debug,
+{
+    inner: CacheInner<K, V>,
+}
+
+enum CacheInner<K, V> /* same bounds */ {
+    #[cfg(feature = "policy-fifo")]    Fifo(FifoCache<K, V>),
+    #[cfg(feature = "policy-lru")]     Lru(LruCore<K, V>),
+    #[cfg(feature = "policy-fast-lru")] FastLru(FastLru<K, V>),
+    #[cfg(feature = "policy-lru-k")]   LruK(LrukCache<K, V>),
+    #[cfg(feature = "policy-lfu")]     Lfu(LfuCache<K, V>),
+    #[cfg(feature = "policy-heap-lfu")] HeapLfu(HeapLfuCache<K, V>),
+    #[cfg(feature = "policy-two-q")]   TwoQ(TwoQCore<K, V>),
+    #[cfg(feature = "policy-s3-fifo")] S3Fifo(S3FifoCache<K, V>),
+    #[cfg(feature = "policy-arc")]     Arc(ArcCore<K, V>),
+    #[cfg(feature = "policy-lifo")]    Lifo(LifoCore<K, V>),
+    #[cfg(feature = "policy-mfu")]     Mfu(MfuCore<K, V>),
+    #[cfg(feature = "policy-mru")]     Mru(MruCore<K, V>),
+    #[cfg(feature = "policy-random")]  Random(RandomCore<K, V>),
+    #[cfg(feature = "policy-slru")]    Slru(SlruCore<K, V>),
+    #[cfg(feature = "policy-clock")]   Clock(ClockCache<K, V>),
+    #[cfg(feature = "policy-clock-pro")] ClockPro(ClockProCache<K, V>),
+    #[cfg(feature = "policy-nru")]     Nru(NruCache<K, V>),
+}
+```
+
+`CacheInner` is **private**. Users only see `DynCache`. Two consequences:
+
+- Internal policy structs (`LruCore`, `S3FifoCache`, …) do not leak
+  into the public type system through the dispatcher. They can be
+  refactored without breaking SemVer.
+- Pattern-matching on the variant from outside the crate is
+  impossible, which forces feature requests through method additions
+  rather than match-arm proliferation in user code.
+
+### CAR builder gap
+
+CAR is implemented as a concrete policy (`src/policy/car.rs`) and has a
+`policy-car` feature flag, but this branch does **not** currently expose it
+through `CachePolicy` / `DynCache`. Users who want CAR instantiate the concrete
+`CarCore<K, V>` type directly. Closing the gap means adding a
+`CachePolicy::Car` variant, a `CacheInner::Car(CarCore<K, V>)` variant, and the
+usual method / builder / test arms listed in [Adding a new policy](#adding-a-new-policy).
+
+Until that lands, read "implemented policies" and "`DynCache` variants" as two
+different sets:
+
+- **Implemented concrete policies:** 18.
+- **Runtime-dispatch variants:** 17.
+
+## Type bounds: heavier than `Cache<K, V>`
+
+`Cache<K, V>` requires only what each individual policy implementation
+needs (typically `K: Eq + Hash`, sometimes `K: Copy`). `DynCache`
+requires the **union** of all policies' bounds:
+
+```rust,ignore
+K: Copy + Eq + Hash + Ord
+V: Clone + Debug
+```
+
+Each bound exists because at least one variant needs it:
+
+- `K: Copy` — many policies rely on cheap key copies in eviction paths.
+- `K: Eq + Hash` — every hashmap-backed lookup.
+- `K: Ord` — `HeapLfuCache` orders keys in a min-heap.
+- `V: Clone` — variants that store `Arc<V>` internally (LRU, LFU,
+  HeapLFU) fall back to `(*arc).clone()` when `Arc::try_unwrap` fails
+  on `insert` / `remove` (see below).
+- `V: Debug` — `DynCache: Debug` delegates to the variant's `Debug`.
+
+This is the **library-provided dispatcher tax**. Users who do not want
+to pay `K: Ord` can call `LruCore::new(…)` directly and bypass
+`DynCache`; the tax only applies when crossing the runtime-dispatch
+boundary. The tax is documented at the `DynCache` doc comment so
+users picking the dispatcher route know what to expect.
+
+If a future policy adds a heavier bound (e.g. `K: Serialize` for a
+persistent-cache policy), it forces every `DynCache` user to satisfy
+that bound. The mitigation, when that happens, is a separate
+dispatcher type (`DynPersistentCache<K, V>`) rather than tightening
+the existing `DynCache` bounds — preserving SemVer for users who
+don't need persistence.
+
+## The `Arc<V>` round-trip
+
+Three policies — `LruCore`, `LfuCache`, `HeapLfuCache` — internally
+store `Arc<V>` rather than `V`. The rationale lives in those modules
+(zero-copy sharing between `peek` and `get`, predictable eviction-time
+move, alignment with the concurrent wrappers' `Arc<V>` returns). At
+the `DynCache` boundary this creates a small impedance:
+
+```rust,ignore
+CacheInner::Lru(lru) => {
+    let arc_value = Arc::new(value);
+    lru.insert(key, arc_value)
+        .map(|arc| Arc::try_unwrap(arc).unwrap_or_else(|arc| (*arc).clone()))
+},
+```
+
+`insert` wraps the value in `Arc` for the policy and tries to unwrap
+the returned `Arc<V>` on the way out. `try_unwrap` is O(1) when the
+refcount is 1 (the common case for sequential `DynCache`); it falls
+back to `(*arc).clone()` only when another reference outlived the
+caller's, which happens on iteration paths where the policy held a
+secondary reference. The fallback is the reason `V: Clone` is required
+on `DynCache`.
+
+The cost is one `Arc::new` per insert and one branch (`try_unwrap`) per
+return on Arc-storing variants. It does not affect FIFO, LIFO, MFU,
+MRU, 2Q, S3-FIFO, ARC, Clock, Clock-PRO, NRU, Random, SLRU, LRU-K,
+or FastLru, which store `V` directly. Users sensitive to this round
+trip should pick a `V`-storing policy or use the concrete type
+directly.
+
+## Feature gating discipline
+
+Every `CachePolicy` variant, every `CacheInner` variant, every match
+arm in every `DynCache` method, every `CacheBuilder::build` arm, and
+every `validate_policy` arm is gated by `#[cfg(feature = "policy-X")]`.
+The discipline:
+
+- A user building with `default-features = false, features = ["policy-lru"]`
+  gets a `CachePolicy` enum with **one variant** and a `DynCache` enum
+  with **one inner variant**. Match exhaustiveness still holds because
+  every arm vanishes with its variant.
+- The internal `match` in each `DynCache` method is **always
+  exhaustive** at the active feature set, because every arm and every
+  variant share the same set of `cfg` predicates.
+- `policy-all` is a convenience feature that turns on every
+  `policy-*` feature at once. The default is a curated subset
+  (`policy-s3-fifo`, `policy-lru`, `policy-fast-lru`, `policy-lru-k`,
+  `policy-clock`) chosen to cover the most-recommended workloads from
+  [`docs/policies/README.md`](../policies/README.md).
+
+The cost is that adding a new policy involves edits in *six*
+synchronized locations (see [Adding a new policy](#adding-a-new-policy)).
+The benefit is that a "policy-lru-only" build is genuinely small —
+none of the other 16 policies appear in the resulting binary.
+
+## Validation: panic vs `Result`
+
+`CacheBuilder::build` panics on invalid configuration:
+
+```rust,ignore
+assert!(self.capacity > 0, "cache capacity must be greater than 0");
+// …
+match policy {
+    CachePolicy::LruK { k } => assert!(*k > 0, "LruK: k must be greater than 0"),
+    CachePolicy::TwoQ { probation_frac } =>
+        check_frac("TwoQ: probation_frac", *probation_frac),
+    // …
+}
+```
+
+This is consistent with cachekit's broader error model
+([`src/error.rs`](../../src/error.rs)): panics for **programming
+errors** (programmer hands the builder a `k = 0`, which has no sensible
+behavior), `Result<_, ConfigError>` reserved for **user-supplied
+configuration** that arrives through deserialization or external
+input.
+
+Callers that need to validate untrusted configuration before calling
+`build` should branch on the `CachePolicy` variant and inspect the
+payload themselves, or use the per-policy fallible constructors
+(`S3FifoCache::try_with_ratios`, future `LrukCache::try_with_k`)
+directly. The builder deliberately does not provide a `try_build` —
+adding one would split the API surface for marginal gain when the
+panic path already catches the bug at the call site.
+
+## `Send + Sync` is conditional
+
+`DynCache<K, V>: Send + Sync` is **not** unconditional. The
+`FastLru` policy uses `NonNull<Node>` for single-threaded performance
+and is therefore `!Send + !Sync`. The test in
+[`src/builder.rs`](../../src/builder.rs) encodes this:
+
+```rust,ignore
+#[cfg(all(feature = "policy-lru", not(feature = "policy-fast-lru")))]
+const _: () = {
+    fn assert_send<T: Send>() {}
+    fn check() { assert_send::<DynCache<u64, String>>(); }
+};
+```
+
+In words: `DynCache<K, V>` is `Send + Sync` whenever no
+`!Send`-or-`!Sync` policy variant is enabled. With the default feature
+set (which includes `policy-fast-lru`), `DynCache` is **not**
+`Send + Sync`. Users who want a sendable `DynCache` should disable
+`policy-fast-lru` and use `policy-lru` for the LRU path.
+
+This is a known sharp edge. The alternative — making `FastLru: Send`
+via an unsafe impl — would invalidate `FastLru`'s entire design
+premise (raw-pointer recency list without atomics). The current
+trade prioritises `FastLru`'s single-threaded speed over `DynCache`'s
+universal sendability, on the grounds that callers wanting concurrent
+access should use a `Concurrent*` wrapper directly (see
+[`concurrency.md`](concurrency.md)), not `DynCache`.
+
+## Maintenance cost
+
+The dispatcher's runtime cost is small. The **maintenance** cost is
+real:
+
+- **17 inner variants** × **~10 `DynCache` methods** = **~170 match
+  arms** that must stay in sync today. CAR will make this 18 variants
+  once it is wired into the dispatcher.
+- A `Debug` impl, a `default()` (where applicable), and a
+  `validate_policy` arm per variant.
+- A `Cargo.toml` feature flag per variant.
+- A documentation entry per variant in `docs/policies/`.
+
+The mitigations in place:
+
+1. **A single regression test** (`test_all_policies_basic_ops` in
+   [`src/builder.rs`](../../src/builder.rs)) loops over every enabled
+   policy and exercises `insert` / `get` / `contains` / `len` /
+   update / `clear`. Adding a variant immediately surfaces if any arm
+   was missed.
+2. **Compile-time exhaustiveness** in the inner `match`. Forgetting an
+   arm is a build error, not a runtime bug.
+3. **`#[non_exhaustive]` on `CachePolicy`** keeps downstream code
+   from depending on the full set of variants.
+
+Even with those, the line count of `src/builder.rs` (~1300) is
+disproportionate to its semantic content. A `macro_rules!` to generate
+per-method dispatchers has been considered and rejected — the
+explicit `match` is grep-friendly, readable in source review, and
+each arm sometimes diverges from the boilerplate (the `Arc<V>` round
+trip is the visible case; future TTL integration is another). Macros
+would compress the file but obscure the points where the dispatcher
+intervenes.
+
+## Adding a new policy
+
+Checklist for landing a new policy, ordered to minimise compile-time
+churn:
+
+1. Implement the policy core: `MyPolicyCache<K, V>` with a `Cache<K, V>`
+   impl. Add `MyPolicyCache::new(capacity: usize)` and any config
+   constructors. Land this with its own tests.
+2. Add a `policy-my-policy` feature in [`Cargo.toml`](../../Cargo.toml).
+   Add it to `policy-all`. Decide whether it joins `default = […]`.
+3. Add the `CachePolicy::MyPolicy { … }` variant, gated by the new
+   feature. Include any config fields as inline payload.
+4. Add the `CacheInner::MyPolicy(MyPolicyCache<K, V>)` variant under
+   the same `cfg`.
+5. Add a match arm in every `DynCache` method (`insert`, `get`, `peek`,
+   `contains`, `len`, `capacity`, `remove`, `clear`, `Debug` impl).
+6. Add a `CachePolicy::MyPolicy { … } => CacheInner::MyPolicy(…)` arm
+   in `CacheBuilder::build`. Add validation in `validate_policy` if
+   the variant has constraints (frac in 0..=1, non-zero K, etc.).
+7. Add the variant to `all_enabled_policies()` in the test module so
+   the regression sweep covers it.
+8. Document the policy in `docs/policies/my-policy.md`; if it's a
+   roadmap policy graduating to implementation, move the doc from
+   `docs/policies/roadmap/` per the rule in
+   [`docs/policies/roadmap/README.md`](../policies/roadmap/README.md).
+9. Update [`docs/policies/README.md`](../policies/README.md) and
+   [`docs/guides/choosing-a-policy.md`](../guides/choosing-a-policy.md).
+
+The work is mechanical. A CR template that lists these nine steps as
+checkboxes would reduce the chance of missed updates further.
+
+## Future: `DynExpiringCache<K, V>`
+
+When the `ttl` feature lands ([`ttl.md`](ttl.md) §4(c)), TTL **does
+not** modify `DynCache`. Instead, `with_default_ttl` on the builder
+returns a sibling type:
+
+```rust,ignore
+let mut cache = CacheBuilder::new(1024)
+    .with_default_ttl(Duration::from_secs(60))
+    .build::<u64, String>(CachePolicy::Lru);
+// `cache: DynExpiringCache<u64, String>`, not DynCache.
+```
+
+`DynExpiringCache<K, V>` mirrors `DynCache`'s match-arm boilerplate
+one level out: each method threads the expiry check through the
+inner policy's `Cache` call. The key design choice — argued in detail
+in [`ttl.md`](ttl.md) §1, §4(c) — is that `DynExpiringCache` is a
+**distinct type**, not `impl Cache for DynCache` plus a wrapper.
+Distinctness makes `Expiring<Expiring<DynCache>>` structurally
+unrepresentable, which prevents the "two clocks, two indexes"
+double-wrapping bug at the type level.
+
+The duplication is real: a parallel ~170 arms today, rising with the
+dispatcher variant count. It is bounded (one type per cross-cutting
+capability) and the trade favours type-level safety over deduplication.
+
+## When not to use `DynCache`
+
+`DynCache` is the right tool when:
+
+- The policy is chosen at runtime from configuration.
+- The caller wants a single concrete type that can hold any policy.
+- The dispatch cost is amortised over enough work that the `match`
+  doesn't dominate.
+
+It is the wrong tool when:
+
+- The policy is known at compile time. Use the concrete type
+  (`LruCache::new(…)`, `S3FifoCache::new(…)`) and let monomorphization
+  do its work.
+- The hottest inner loop is `get`-bound and devirtualization matters
+  beyond what enum dispatch provides. Concrete types still win for
+  raw throughput on benchmarks (see
+  [`benches/comparison.rs`](../../benches/comparison.rs)).
+- The caller needs `Send + Sync` and the build includes
+  `policy-fast-lru`. See [`Send + Sync`](#send--sync-is-conditional)
+  above; use the relevant `Concurrent*` wrapper instead.
+- A user wants to plug in their own policy. `DynCache` is closed;
+  generic code over `Cache<K, V>` is open.
+
+## See also
+
+- [Design overview](design.md) — §13 frames compile-time and runtime
+  composition at the principles level
+- [Cache trait hierarchy](trait-hierarchy.md) — kernel trait and
+  capability traits
+- [Concurrency](concurrency.md) — `Send + Sync` interaction, why
+  `Concurrent*` is a separate path
+- [TTL design](ttl.md) — `DynExpiringCache` as a worked extension of
+  the dispatcher pattern
+- [Error model](../../src/error.rs) — `ConfigError` vs panic discipline
+- [`src/builder.rs`](../../src/builder.rs) — the canonical
+  implementation
diff --git a/docs/design/concurrency.md b/docs/design/concurrency.md
new file mode 100644
index 0000000..e418976
--- /dev/null
+++ b/docs/design/concurrency.md
@@ -0,0 +1,366 @@
+# Concurrency
+
+> Status: design rationale for the concurrent surface that ships today
+> behind the `concurrency` feature flag. Companion to the cross-cutting
+> principles in [`docs/design/design.md`](design.md) §3 and the trait
+> rationale in [`docs/design/trait-hierarchy.md`](trait-hierarchy.md).
+
+cachekit's default surface is single-threaded. Concurrency is opt-in,
+delivered through a parallel set of types and traits gated by the
+`concurrency` Cargo feature. This document explains why the concurrent
+surface looks the way it does, what invariants the wrappers promise,
+and where the gaps are.
+
+## Non-goals
+
+- **`no_std`.** Concurrency relies on `parking_lot`, `std::sync::Arc`,
+  and `std::sync::atomic`. No `loom`/`no_std` support is planned.
+- **Lock-free policies.** Mostly-lock-free or strictly lock-free
+  policies are out of scope today; see [Future directions](#future-directions).
+- **Async-native traits.** `AsyncCacheFuture` is a Phase 2 placeholder
+  ([`src/traits.rs`](../../src/traits.rs)); no policy implements it
+  meaningfully yet.
+
+## The dominant pattern: sequential core, concurrent wrapper
+
+cachekit's concurrent types all keep the sequential core unaware of locking,
+but they do **not** all have the same struct shape. There are three families.
+
+### Cloneable policy handles
+
+Policy-level wrappers are shared handles around a locked policy core:
+
+```text
+ConcurrentPolicy<K, V> { inner: Arc<RwLock<Policy<K, V>>> }
+```
+
+This shape is used by:
+
+- `ConcurrentLruCache` — [`src/policy/lru.rs`](../../src/policy/lru.rs)
+- `ConcurrentFifoCache` — [`src/policy/fifo.rs`](../../src/policy/fifo.rs)
+- `ConcurrentS3FifoCache` — [`src/policy/s3_fifo.rs`](../../src/policy/s3_fifo.rs)
+
+These types implement `Clone` via `Arc::clone`, so callers can hand cheap
+handles to threads. They expose owned / `Arc<V>` returns instead of borrowed
+`&V` because no reference can safely outlive the lock guard it came from.
+
+### Owning store and data-structure wrappers
+
+Store and data-structure wrappers usually own the lock directly:
+
+```text
+ConcurrentX<K, V> { inner: RwLock<X<K, V>>, ... }
+```
+
+Examples:
+
+- `ConcurrentHashMapStore`, `ShardedHashMapStore` — [`src/store/hashmap.rs`](../../src/store/hashmap.rs)
+- `ConcurrentSlabStore` — [`src/store/slab.rs`](../../src/store/slab.rs)
+- `ConcurrentWeightStore` — [`src/store/weight.rs`](../../src/store/weight.rs)
+- `ConcurrentHandleStore` — [`src/store/handle.rs`](../../src/store/handle.rs)
+- `ConcurrentSlotArena` — [`src/ds/slot_arena.rs`](../../src/ds/slot_arena.rs)
+- `ConcurrentIntrusiveList` — [`src/ds/intrusive_list.rs`](../../src/ds/intrusive_list.rs)
+- `ConcurrentClockRing` — [`src/ds/clock_ring.rs`](../../src/ds/clock_ring.rs)
+
+These wrappers are not necessarily cloneable handles. If a caller wants shared
+ownership, they can wrap the whole type in `Arc<_>`. Keeping the `Arc` out of
+the struct avoids an unnecessary refcount on users who only need a single owner.
+
+### Sharded primitives
+
+Sharded types own multiple independently locked shards:
+
+```text
+ShardedX<K, V> {
+    shards: Vec<RwLock<ShardState<K, V>>>,
+    selector: ShardSelector,
+}
+```
+
+Examples:
+
+- `ShardedSlotArena` — [`src/ds/slot_arena.rs`](../../src/ds/slot_arena.rs)
+- `ShardedFrequencyBuckets` — [`src/ds/frequency_buckets.rs`](../../src/ds/frequency_buckets.rs)
+- `ShardedHashMapStore` — [`src/store/hashmap.rs`](../../src/store/hashmap.rs)
+
+The common design is not "`Arc<RwLock<_>>` everywhere"; it is **lock at the
+wrapper boundary and keep the sequential core lock-free**. The exact ownership
+shape depends on whether the type is intended to be a cloneable cache handle,
+an owning concurrent store, or a sharded primitive.
+
+## Why `Concurrent*` does not implement `Cache<K, V>`
+
+`Cache<K, V>` is the sequential trait. Its method signatures encode
+sequential ownership:
+
+```rust
+fn peek(&self, key: &K) -> Option<&V>;
+fn get(&mut self, key: &K) -> Option<&V>;
+fn insert(&mut self, key: K, value: V) -> Option<V>;
+```
+
+Three of these are unimplementable on `Arc<RwLock<…>>`:
+
+- **`peek` and `get` return `&V`.** A borrowed reference cannot
+  outlive the `RwLockReadGuard`/`RwLockWriteGuard` it was extracted
+  from. There is no safe lifetime that ties `&V` to `&self` rather
+  than to the (anonymous) guard. Returning `&V` would force the
+  caller to hold the lock across the borrow, which serializes readers
+  and defeats `RwLock`.
+- **`get` takes `&mut self`.** With shared ownership through
+  `Arc<RwLock<…>>` the wrapper only ever holds `&self`. Forcing
+  `&mut self` would require `Arc::make_mut` or external locking,
+  defeating the point of the inner lock.
+
+The concurrent wrappers therefore expose their own concrete API:
+
+```rust
+pub fn get(&self, key: &K) -> Option<Arc<V>>;
+pub fn peek(&self, key: &K) -> Option<Arc<V>>;
+pub fn insert(&self, key: K, value: V) -> Option<Arc<V>>;
+pub fn insert_arc(&self, key: K, value: Arc<V>) -> Option<Arc<V>>;
+pub fn remove(&self, key: &K) -> Option<Arc<V>>;
+```
+
+Returning `Arc<V>` is the contract. It costs one atomic refcount bump
+on hit, which is cheap relative to the lock acquisition itself, and it
+lets callers hold the value past lock release, send it across threads,
+or stash it in another structure without lifetime gymnastics.
+
+For uniformity across the store layer there is a parallel trait family
+that **does** model the `&self` + `Arc<V>` shape:
+
+| Sequential ([`src/store/traits.rs`](../../src/store/traits.rs)) | Concurrent ([`src/store/traits.rs`](../../src/store/traits.rs)) |
+|---|---|
+| `StoreRead` (`&mut self`, `&V`) | `ConcurrentStoreRead` (`&self`, `Arc<V>`) |
+| `StoreMut` (`&mut self`) | `ConcurrentStore` (`&self`) |
+| `StoreFactory` | `ConcurrentStoreFactory` |
+
+The policy layer does not yet have a counterpart family — see
+[Future directions](#future-directions).
+
+## Lock primitive choice
+
+Every concurrent wrapper uses **`parking_lot::RwLock`**. Two things
+drove this:
+
+- **Reader / writer split matches the access pattern.** `peek` /
+  `contains` / `len` only need shared access. `get` (which mutates
+  recency or frequency state) and `insert` / `remove` need exclusive
+  access. `Mutex` would serialize all of these.
+- **Fairness and uncontended speed.** `parking_lot::RwLock` is small
+  (one `AtomicUsize` on 64-bit), uncontended-fast, and tunable via
+  fairness traits. The `RwLock<HashMap<K, Arc<V>>>` and
+  `RwLock<SlotArena<T>>` shapes throughout the codebase rely on this.
+
+`Mutex` is intentionally absent from the wrappers. The few `Mutex`
+references in the source tree are in doctests and rustdoc prose
+describing how a user would wrap a non-concurrent cache themselves —
+they are not on any hot path.
+
+The `parking_lot` choice is **not** absolute. On Rust 1.85+ the
+futex-based `std::sync::Mutex` is competitive for the uncontended
+single-writer case on Linux/macOS, and revisiting this is reasonable
+if `parking_lot` ever becomes a build burden. The `RwLock` advantage
+is more durable: `std::sync::RwLock` still has writer-starvation
+hazards on some platforms that `parking_lot` avoids by default.
+
+## The `get` / `peek` lock-level asymmetry
+
+`peek` and `get` both look up by key, but they differ in what they
+mutate:
+
+- **`peek`** is side-effect-free. The wrapper takes a **read lock**
+  and clones the `Arc<V>`. Multiple readers proceed in parallel.
+- **`get`** updates policy state (LRU recency, LFU frequency, Clock
+  reference bit, …). The wrapper takes a **write lock**. Only one
+  thread proceeds.
+
+This asymmetry is the single most important reason `peek` and `get`
+are distinct methods at all (see
+[`trait-hierarchy.md`](trait-hierarchy.md) for the rationale at the
+trait level). Without `peek`, every read would serialize through the
+write lock. With `peek`, read-heavy workloads — buffer pools, immutable
+metadata caches — scale linearly across cores.
+
+The cost is that callers must choose, and choosing `get` on a
+read-heavy workload silently kills scalability. The rustdoc on each
+wrapper's `peek` and `get` says so explicitly; benchmarks under
+[`benches/`](../../benches) compare the two.
+
+## Atomic check-and-act
+
+Compound operations must stay inside a single lock acquisition. The
+rule is **check, decide, mutate, release** — all under the same write
+lock. Splitting the steps across two acquisitions allows a concurrent
+writer to invalidate the decision between them.
+
+The pattern shows up in three places worth naming:
+
+- **Insert-on-full.** Capacity check + eviction + insert must be one
+  critical section. `WeightStore::try_insert` and the policy `insert`
+  methods both follow this.
+- **Replace-and-return.** `insert` returns the previous value if one
+  existed. The "did this key exist?" check and the replace must
+  happen under the same write lock; otherwise two concurrent inserts
+  can both observe "key absent" and both return `None`.
+- **Future: expiry + remove.** TTL (see
+  [`docs/design/ttl.md`](ttl.md) §4(e)) requires the expiry check and
+  the removal to be one atomic operation under a write lock. A
+  read-locked fast path that observes `expires_at <= now` and
+  escalates to a write lock is safe **only** if the write-locked
+  path re-checks the deadline before acting, because a concurrent
+  `set_ttl` may have renewed the entry in between.
+
+The atomicity rule is a wrapper-level discipline, not a trait-level
+one. The single-threaded core can't enforce it because it doesn't
+know about locks.
+
+## Cloning the wrapper
+
+Every `Concurrent*` type implements `Clone` via `Arc::clone(&self.inner)`.
+Cloning the wrapper is cheap (one atomic increment) and produces a
+second handle to the **same** underlying cache. This is the intended
+way to share a cache across threads:
+
+```rust,ignore
+let cache = ConcurrentLruCache::<u64, String>::new(1_000);
+let cache2 = cache.clone();
+std::thread::spawn(move || {
+    cache2.insert(1, "hello".into());
+});
+cache.get(&1);
+```
+
+There is no separate `Arc<ConcurrentLruCache<…>>` wrapping needed;
+the inner `Arc` is the sharing primitive. Callers who want
+`Arc<dyn ConcurrentCache>` for type erasure are still free to wrap, but
+in practice the concrete clone is what's used in the codebase.
+
+## `ConcurrentCache`: marker trait, not capability trait
+
+`ConcurrentCache` lives in [`src/traits.rs`](../../src/traits.rs) and
+is declared `unsafe trait ConcurrentCache: Send + Sync {}`. It has
+no methods. Its job is to **promise**, at the type system level,
+that "this type is safe to share across threads in the cache sense" —
+specifically that its `Cache`-like operations (whatever those happen
+to be — concrete `Concurrent*` types do not implement `Cache<K, V>`)
+take care of internal synchronization.
+
+The `unsafe` is load-bearing. Implementing `ConcurrentCache`
+incorrectly cannot be caught by the type system; it's an
+implementer-side soundness claim, which is why only the wrappers
+implement it (`ConcurrentFifoCache`, `ConcurrentS3FifoCache` today;
+`ConcurrentLruCache` is a candidate but does not yet have the impl).
+
+Users writing generic code that requires a thread-safe cache should
+bound on `ConcurrentCache + Send + Sync`. They should **not** bound on
+`Cache<K, V> + Send + Sync` and expect that to suffice — that bound
+is satisfied by single-threaded caches whose user is responsible for
+external locking.
+
+## Sharded primitives
+
+For data structures where a single `RwLock` becomes the bottleneck,
+cachekit ships sharded variants:
+
+- **`ShardedHashMapStore<K, V, S>`** — N independent shards, each its
+  own `RwLock<HashMap<…>>`. Shard selected by hashing the key with
+  the store's `BuildHasher`.
+- **`ShardedSlotArena<T>`** — N independent arenas with sharded
+  `SlotId`s. Same shape, applied to slab-style storage.
+- **`ShardedFrequencyBuckets<K>`** — N independent frequency-bucket
+  shards for LFU-family policies that want concurrent frequency
+  updates.
+
+Sharding lives at the **data-structure** layer, not the policy layer,
+because the shard count, hash function, and shard-aware key type
+(`ShardedSlotId`) all need to be visible to the policy that uses the
+primitive. A `ShardedLruCache` does not yet exist as a single type;
+it would be built by composing a `ShardedHashMapStore` with sharded
+recency lists, and that composition is roadmap.
+
+When sharding is **not** what you want:
+
+- A single concurrent wrapper is simpler and faster for caches that
+  fit on one or two cores' worth of contention.
+- Sharding multiplies the working-set fragmentation across shards.
+  A 1 M-entry cache split 16 ways has 16 caches of ~62 K each, and
+  evictions on one shard cannot rescue items on another.
+- Per-shard eviction is correct for capacity bookkeeping (each shard
+  tracks its own capacity) but **not** globally optimal — a single-
+  shard LRU strictly dominates a sharded LRU on hit rate.
+
+## Concurrent policy coverage
+
+Of the 18 implemented policies, **3 ship with a `Concurrent*` wrapper
+today**: LRU, FIFO, S3-FIFO. The remaining 15 require external locking
+by the caller — typically `Arc<parking_lot::RwLock<CacheCore>>`. The
+relevant rustdoc on those policies (e.g. `LfuCache`, `HeapLfuCache`,
+`MfuCache`) calls this out.
+
+This is a coverage gap rather than a design choice. The pattern is
+mechanical: wrap the sequential core in `Arc<RwLock<…>>`, expose the
+`&self` API with `Arc<V>` returns, decide read-lock vs. write-lock per
+method, implement `Clone` via `Arc::clone`, and implement
+`unsafe impl ConcurrentCache`. The work is bounded; what's missing is
+the discipline to do it consistently across all 18 policies.
+
+## Failure modes
+
+Three failure modes worth naming:
+
+- **Poisoning.** `parking_lot` does **not** poison locks on panic.
+  A panic inside a critical section unwinds, releases the lock, and
+  leaves the inner core in whatever state the panic interrupted.
+  The single-threaded cores are designed to be panic-safe for
+  `Cache::insert` / `get` / `remove` — invariants are restored
+  before any potentially-panicking operation (allocation, user
+  hashing). This is a property of each core, not of the wrapper.
+- **Deadlock.** Cachekit never holds two locks at once in the
+  current code. Sharded primitives acquire exactly one shard lock
+  per operation. Any future work that composes locks (e.g. a sharded
+  LRU that touches a shared recency list) must document its locking
+  order.
+- **Starvation.** `parking_lot::RwLock` defaults to writer-friendly
+  fairness; readers do not starve writers. Heavy `get`-dominated
+  workloads still serialize through the write lock, which is the
+  underlying constraint, not a fairness bug.
+
+## Future directions
+
+Tracked roughly in priority order:
+
+1. **Coverage parity.** `Concurrent*` wrappers for the remaining 14
+   policies (LFU, Heap-LFU, MFU, LRU-K, 2Q, ARC, CAR, Clock,
+   Clock-PRO, NRU, SLRU, MRU, LIFO, Random). Mechanical work; the
+   pattern is fixed.
+2. **`ConcurrentExpiring<C>`.** TTL's concurrent wrapper, per
+   [`docs/design/ttl.md`](ttl.md) §4(e). Distinct from `Concurrent*`
+   policies because the expiry-check + remove must be atomic across
+   *both* the inner cache and the expiration index.
+3. **Sharded `Cache` wrappers.** A generic `Sharded<C: Cache<K, V>>`
+   that hashes keys to N independent inner caches. The design
+   question is how to model capacity: per-shard capacity (simple,
+   imperfect global behaviour) vs. global capacity with cross-shard
+   victim selection (correct, requires inter-shard locking).
+4. **Lock-free reads.** `peek` and `contains` paths that avoid the
+   `RwLock` entirely — `arc-swap` or seqlock-style techniques —
+   for caches whose recency state can tolerate eventual consistency.
+   Out of scope until benchmarks show the read lock is the bottleneck.
+5. **Loom testing.** Once concurrent coverage stabilises, model-check
+   the wrapper invariants under `loom`. Particularly valuable for the
+   atomic check-and-act sequences in TTL and sharded composition.
+
+## See also
+
+- [Design overview](design.md) — §3 frames concurrency at the
+  principles level
+- [TTL design](ttl.md) — applied case for `ConcurrentExpiring<C>`
+- [Cache trait hierarchy](trait-hierarchy.md) — read/mutate split and
+  object-safety rationale
+- [Stores](../stores/README.md) — `ConcurrentStoreRead` /
+  `ConcurrentStore` trait family
+- [`src/store/traits.rs`](../../src/store/traits.rs) — concurrent
+  store traits
+- [`src/traits.rs`](../../src/traits.rs) — `ConcurrentCache` marker
diff --git a/docs/design/design.md b/docs/design/design.md
index 92f84d6..dcc01e7 100644
--- a/docs/design/design.md
+++ b/docs/design/design.md
@@ -1,14 +1,23 @@
-Designing high-performance caches in Rust is a multi-disciplinary problem: data structures, memory layout, concurrency, workload modeling, and systems-level performance all matter. The points below reflect what moves the needle in practice across systems, services, and libraries.
+# Design Overview
 
-For interface and API decisions, the [Rust API Guidelines checklist](https://rust-lang.github.io/api-guidelines/checklist.html) is a useful companion for consistent, ergonomic design.
+This document collects the design principles that shape `cachekit`. Each
+section pairs a principle with the concrete artifact in the source tree
+that realizes it, so the prose stays grounded in the code rather than
+floating as advice.
+
+For a worked example that applies every principle below to one feature,
+see the [TTL design doc](ttl.md). For interface conventions, the
+[Rust API Guidelines checklist](https://rust-lang.github.io/api-guidelines/checklist.html)
+is the companion reference; module-level documentation follows the
+[doc style guide](style-guide.md).
 
 ## 1. Workload First, Policy Second
 
 Cache policy only matters relative to workload.
 
 Identify access patterns:
-- Hotset-heavy traffic: skewed keys, high churn.
-- Scan-heavy traffic: large working sets, weak locality.
+- Hot-set traffic: skewed keys, low churn on the hot set, high churn at the tail.
+- Scan-heavy traffic: large working sets, weak temporal locality.
 - Mixed traffic: bursts of hot data over large cold sets.
 
 Measure:
@@ -17,29 +26,43 @@ Measure:
 - Temporal vs spatial locality.
 
 Choose policies accordingly:
-- LRU: good for temporal locality, bad for scans.
-- LRU-K / 2Q (roadmap): better at filtering one-off accesses.
-- Clock / ARC (roadmap): lower overhead, more adaptive.
+- `LRU` / `Clock`: good for temporal locality, vulnerable to scans.
+- `LRU-K` / `2Q` / `SLRU`: better at filtering one-off accesses.
+- `ARC` / `CAR`: adaptive recency/frequency balance without manual tuning.
+- `S3-FIFO` / `Heap-LFU`: strong general-purpose defaults under scans.
+
+All of the above ship today; see [`docs/policies/`](../policies/README.md)
+for the implemented catalog and [`docs/policies/roadmap/`](../policies/roadmap/README.md)
+for planned policies (LIRS, TinyLFU, SIEVE, GDS/GDSF, etc.).
 
-Never design a "general purpose" cache first; design for the workload you expect.
+When picking a policy or tuning a cache, design for the workload you
+expect — not the average of all workloads.
 
 ## 2. Memory Layout Matters More Than Algorithms
 
 In a cache, memory layout often dominates policy.
 
 Prefer:
-- Contiguous storage (Vec, slabs, arenas).
+- Contiguous storage (`Vec`, slabs, arenas).
 - Index-based indirection over pointer chasing.
 
 Avoid:
-- Excessive Box, Arc, linked lists.
-- HashMap lookups in hot paths if avoidable.
+- Excessive `Box`, `Arc`, linked lists with heap-allocated nodes.
+- `HashMap` lookups in hot paths if avoidable.
 
 Techniques:
 - Store metadata (recency, freq, flags) in tightly packed structs.
 - Separate hot metadata from cold payloads.
 - Use slab allocators for fixed-size entries.
 
+cachekit realizes this through reusable building blocks under
+[`src/ds/`](../../src/ds): [`SlotArena`](../../src/ds/slot_arena.rs)
+hands out stable `Handle`s backed by a `Vec`, [`IntrusiveList`](../../src/ds/intrusive_list.rs)
+threads recency lists through those slots without per-node allocation,
+and [`ClockRing`](../../src/ds/clock_ring.rs) keeps Clock-style state in
+a single contiguous array. See [`docs/policy-ds/`](../policy-ds/README.md)
+for the full primitive catalog.
+
 Cache misses caused by your own data structure are as bad as upstream misses.
 
 ## 3. Concurrency Strategy Is Core Design, Not a Wrapper
@@ -48,35 +71,54 @@ Locking strategy shapes everything.
 
 Options:
 - Global lock: simple, often fast enough for small cores, dies under high contention.
-- Sharded caches: hash key -> shard, each shard independently locked.
+- Sharded caches: hash key → shard, each shard independently locked.
 - Lock-free or mostly-lock-free: hard in Rust, only worth it if contention dominates.
 
+cachekit ships the first option today via the `concurrency` feature:
+`Concurrent*` wrappers (e.g. `ConcurrentLruCache`, `ConcurrentSlotArena`,
+`ConcurrentClockRing`) place a `parking_lot::RwLock` around the
+single-threaded core. The wrappers deliberately do **not** implement
+`Cache<K, V>` directly when that would force returning `&V` across a
+lock boundary — they expose `Option<Arc<V>>` style APIs instead. See
+[`src/policy/lru.rs`](../../src/policy/lru.rs),
+[`src/ds/slot_arena.rs`](../../src/ds/slot_arena.rs), and
+[`src/ds/clock_ring.rs`](../../src/ds/clock_ring.rs).
+
 Rust-specific notes:
-- When `std` is available, prefer `parking_lot` locks over `std::sync` for lower overhead and better ergonomics.
-- Avoid Arc<Mutex<...>> in hot paths.
-- Consider per-thread caches with periodic merge.
-- Consider RCU-style read paths for read-heavy caches.
+- For `RwLock`, prefer `parking_lot` for fairness control and lower
+  uncontended overhead. For `Mutex`, the futex-based `std::sync::Mutex`
+  on Rust 1.85+ is competitive on Linux/macOS; `parking_lot::Mutex`
+  still wins on raw uncontended speed and offers nicer guard ergonomics.
+- Avoid `Arc<Mutex<…>>` in hot paths.
+
+Future directions worth exploring but **not currently implemented**:
+sharded caches (hash key → shard, per-shard lock), per-thread caches with
+periodic merge, and RCU-style read paths for read-heavy workloads.
 
 ## 4. Avoid Per-Operation Allocation
 
 Allocations kill throughput.
 
 Pre-allocate:
-- Entry pools.
-- Node arrays.
+- Entry pools — see [`SlotArena`](../../src/ds/slot_arena.rs) and the
+  free-list discipline in [`src/store/slab.rs`](../../src/store/slab.rs).
+- Node arrays — intrusive lists thread through arena slots rather than
+  allocating per-node (see [`src/ds/intrusive_list.rs`](../../src/ds/intrusive_list.rs)).
 
 Reuse:
-- Free lists.
-- Slabs.
+- Free lists (slab-backed).
+- Slabs sized once at construction time via `CacheBuilder::new(capacity)`.
 
 Use:
-- Vec with capacity management.
-- Custom allocators if necessary.
+- `Vec` with explicit capacity management.
+- `rustc-hash` (via the `rustc-hash` dep) for cheap key hashing in
+  hot-path lookups.
 
 Avoid:
-- Creating new Arc, String, Vec per lookup.
+- Creating new `Arc`, `String`, `Vec` per lookup.
+- Hidden clones of `K` on the eviction path.
 
-If malloc shows up in your flamegraph, your cache is already slow.
+If `malloc` shows up in your flamegraph, your cache is already slow.
 
 ## 5. Eviction Must Be Predictable and Cheap
 
@@ -87,12 +129,17 @@ O(1) eviction is the goal.
 Avoid unbounded tree walks or scans in eviction paths.
 
 Maintain:
-- Direct pointers/indices to eviction candidates.
-- Eviction lists or clock hands.
+- Direct indices / `Handle`s to eviction candidates (see
+  [`src/store/handle.rs`](../../src/store/handle.rs) and the
+  [`Cache`](../../src/store/traits.rs) trait).
+- Eviction lists or clock hands (intrusive list head, `ClockRing` hand).
+- Lazy heaps where amortized O(log n) is acceptable
+  ([`LazyMinHeap`](../../src/ds/lazy_heap.rs); used by Heap-LFU and TTL).
 
 Be careful with:
 - Background eviction threads (synchronization overhead).
-- Lazy cleanup that grows unbounded.
+- Lazy cleanup that grows unbounded; bound it with rebuild thresholds
+  (e.g. `LazyMinHeap::with_auto_rebuild`).
 
 Eviction cost must be comparable to lookup cost, not orders of magnitude higher.
 
@@ -102,13 +149,21 @@ You cannot tune what you do not measure.
 
 Track at least:
 - Hit / miss rate.
-- Eviction count and reason.
+- Eviction count and reason (capacity vs. expiration).
 - Insert/update rate.
+
+cachekit exposes these through [`StoreMetrics`](../../src/store/traits.rs)
+and per-policy metric structs (e.g. `LruMetrics`), gated behind the
+`metrics` feature so non-instrumented builds pay nothing. The
+`expirations` counter on `Expiring<C>` follows the same pattern (see
+[`src/policy/expiring.rs`](../../src/policy/expiring.rs)).
+
+Roadmap counters:
 - Scan pollution rate.
-- Lock contention or wait time (roadmap).
+- Lock contention or wait time.
 
 Expose:
-- Lightweight counters in hot path.
+- Lightweight counters in the hot path.
 - Optional detailed metrics behind feature flags.
 
 Metrics should guide design decisions, not justify them afterward.
@@ -116,14 +171,24 @@ Metrics should guide design decisions, not justify them afterward.
 ## 7. Separate Policy From Storage
 
 Design in layers:
-- Storage layer: how entries live in memory, allocation, layout, indexing.
-- Policy layer: LRU, FIFO, LFU, LRU-K (roadmap: Clock/ARC/2Q, etc; see [Policy roadmap](../policies/roadmap/README.md)); only manipulates metadata and ordering.
-- Integration layer: ties application objects, payloads, or IDs into cache entries.
+- Storage layer: how entries live in memory, allocation, layout,
+  indexing — [`src/store/`](../../src/store).
+- Policy layer: LRU, FIFO, LFU, LRU-K, 2Q, ARC, CAR, Clock, Clock-PRO,
+  S3-FIFO, … — manipulates metadata and ordering only
+  ([`src/policy/`](../../src/policy)).
+- Capability layer: opt-in extension traits ([`RecencyTracking`](../../src/traits.rs),
+  `FrequencyTracking`, `HistoryTracking`, `ExpiringCache`) that policies
+  implement when the underlying signal exists. This is how `Expiring<C>`
+  composes over any policy without touching policy code.
+- Integration layer: ties application objects, payloads, or IDs into
+  cache entries via [`CacheBuilder`](../../src/builder.rs) and the
+  `DynCache` runtime dispatcher.
 
 Related docs:
 - [Policy overview](../policies/README.md)
 - [Policy roadmap](../policies/roadmap/README.md)
 - [Policy data structures](../policy-ds/README.md)
+- [Read-only traits](../guides/read-only-traits.md)
 
 This makes:
 - Benchmarking easier.
@@ -135,15 +200,16 @@ This makes:
 Ergonomics often cost performance.
 
 Avoid in critical loops:
-- Heavy generics causing code bloat.
+- Heavy generics causing code bloat across many monomorphizations.
 - Trait objects for hot dispatch.
 - Closures capturing state.
-- Iterator chains instead of simple loops.
+- Iterator chains where a plain `for` loop would do.
 
 Prefer:
 - Explicit loops.
-- Concrete types.
-- Monomorphized fast paths.
+- Concrete types and monomorphized fast paths.
+- Enum dispatch over `Box<dyn Trait>` when polymorphism is needed at the
+  edges — this is exactly the trade `DynCache` makes (see §13).
 
 You can wrap fast internals in nice APIs at the edges.
 
@@ -154,15 +220,17 @@ In scan-heavy workloads:
 Large sequential reads destroy LRU-style caches.
 
 Solutions:
-- Scan-resistant policies (LRU-K, 2Q/ARC are roadmap).
+- Scan-resistant policies: `LRU-K`, `2Q`, `SLRU`, `ARC`, `CAR`,
+  `Clock-PRO`, `S3-FIFO`, `Heap-LFU` — all implemented today.
 - Explicit "scan mode" hints from the caller or workload layer.
 - Bypass cache for known one-shot reads.
 
-If you ignore scans, your cache will look great in microbenchmarks and terrible in production.
+If you ignore scans, your cache will look great in microbenchmarks and
+terrible in production.
 
 ## 10. Benchmark Like a System, Not a Library
 
-Do not rely on random key benchmarks.
+Do not rely on uniform-random key benchmarks.
 
 Use:
 - Zipfian distributions.
@@ -176,22 +244,31 @@ Measure:
 - Memory overhead.
 - Eviction cost.
 
-A cache that is 5% faster on random keys but 50% worse under scans is a bad cache.
+cachekit's benchmark harness covers these dimensions; see
+[`docs/benchmarks/workloads.md`](../benchmarks/workloads.md) and the
+runners under [`benches/`](../../benches).
+
+A cache that is 5 % faster on uniform-random keys but 50 % worse under
+scans is a bad cache.
 
-## 11. Rust-Specific Pitfalls
+## 11. Rust Hot-Path Hazards Beyond Allocation
 
-Arc is expensive in hot paths.
+`Arc` is expensive in hot paths; minimize it and lift `Arc::clone` out
+of inner loops.
 
-Borrow checker can push you toward indirection—fight it with:
-- Index-based access.
-- Interior mutability only where unavoidable.
+The borrow checker can push you toward indirection — fight it with:
+- Index-based access (`Handle`s, slot indices) instead of `&mut` chains.
+- Interior mutability only where unavoidable; prefer `Cell<T>` over
+  `RefCell<T>` when `T: Copy`, and atomics when the value lives behind
+  a shared reference.
 
 Beware of:
-- Hidden clones.
-- Trait object dispatch.
-- Over-generic designs.
+- Hidden clones, particularly of keys on the eviction path.
+- Trait object dispatch on read/insert.
+- Over-generic designs whose monomorphization cost dwarfs their benefit.
 
-Rust can be as fast as C, but only if you design like a systems programmer, not a library author.
+Rust can match C on hot paths, but only when systems-level discipline
+survives contact with the type system.
 
 ## 12. Design for Failure Modes
 
@@ -207,13 +284,79 @@ Add:
 
 A cache that collapses under stress is worse than no cache.
 
+## 13. Compile-Time and Runtime Composition
+
+cachekit's externally visible surface is shaped by two composition
+mechanisms that together let users pay only for what they use.
+
+**Per-policy feature flags.** Every policy is behind a Cargo feature
+(`policy-lru`, `policy-s3-fifo`, …), with `policy-all` for "everything"
+and a small default of `policy-s3-fifo`, `policy-lru`, `policy-fast-lru`,
+`policy-lru-k`, `policy-clock`. Optional capabilities are gated the
+same way: `metrics`, `concurrency`, `serde`, and `ttl`. Downstream
+crates can disable defaults and select the minimum surface they need;
+see [`Cargo.toml`](../../Cargo.toml).
+
+**Capability traits + runtime dispatch.** Extension traits
+([`RecencyTracking`](../../src/traits.rs), `FrequencyTracking`,
+`HistoryTracking`, `ExpiringCache`) keep optional behavior off the
+core `Cache<K, V>` trait. For ergonomic builder construction without
+forcing trait objects on the user, [`CacheBuilder`](../../src/builder.rs)
+returns a [`DynCache<K, V>`](../../src/builder.rs) that dispatches via
+an internal enum match rather than `Box<dyn Cache>`. When TTL is
+enabled, the builder returns a sibling `DynExpiringCache<K, V>` that
+threads the expiry check around each variant's `Cache` call — a worked
+example of capability composition. See [`docs/design/ttl.md`](ttl.md)
+for the full design and [`src/policy/expiring.rs`](../../src/policy/expiring.rs)
+for the decorator itself.
+
 ## Bottom Line
 
-High-performance caches are not about clever algorithms—they are about:
+High-performance caches are not about clever algorithms — they are about:
 - Memory layout.
 - Allocation discipline.
 - Contention control.
 - Eviction predictability.
 - Workload realism.
 
-In Rust, your main enemy is not safety—it is abstraction overhead and accidental allocation. Design from the metal upward, then wrap it in something pleasant to use.
+In Rust, your main enemy is not safety — it is abstraction overhead and
+accidental allocation. Design from the metal upward, then wrap it in
+something pleasant to use.
+
+## See Also
+
+Design docs:
+- [Concurrency](concurrency.md) — `Concurrent*` wrappers, `RwLock`
+  discipline, sharded primitives, `ConcurrentCache` marker
+- [Cache trait hierarchy](trait-hierarchy.md) — `Cache<K, V>` kernel,
+  capability traits, read/mutate split, object safety
+- [Builder and runtime dispatch](builder-and-dyn-dispatch.md) —
+  `CachePolicy`, `DynCache`, enum-vs-`Box<dyn>` trade-off, adding new
+  policies
+- [Weighted eviction](weighted-eviction.md) — `WeightStore` dual
+  limits, weight function contract, GDS/GDSF pre-staging
+- [Metrics](metrics.md) — recorder / snapshot / exporter split,
+  `MetricsCell`, Prometheus exporter, feature gating
+- [Error model](error-model.md) — panic vs `Result` discipline,
+  four error types, debug-only invariant checks
+- [Benchmarking](benchmarking.md) — benchmark layers, monomorphic policy
+  registry, JSON artifact schema, reproducibility rules
+- [Hashing and key identity](hashing.md) — hasher choices, `KeyInterner`,
+  `ShardSelector`, HashDoS trade-offs
+- [Sharding](sharding.md) — current sharded primitives, routing,
+  capacity semantics, roadmap for sharded caches
+- [Serialization](serialization.md) — current `serde` surface, cache-state
+  persistence boundaries, TTL and hash-seed rules
+- [Non-goals](non-goals.md) — explicit boundaries for what cachekit does
+  not try to be
+- [TTL](ttl.md) — applied example of every principle above
+- [Doc style guide](style-guide.md)
+
+Reference docs:
+- [Policy overview](../policies/README.md) and [roadmap](../policies/roadmap/README.md)
+- [Policy data structures](../policy-ds/README.md)
+- [Stores](../stores/README.md)
+- [Read-only traits](../guides/read-only-traits.md)
+- [Choosing a policy](../guides/choosing-a-policy.md)
+- [Benchmarks overview](../benchmarks/overview.md) and [workloads](../benchmarks/workloads.md)
+- [Rust API Guidelines checklist](https://rust-lang.github.io/api-guidelines/checklist.html)
diff --git a/docs/design/error-model.md b/docs/design/error-model.md
new file mode 100644
index 0000000..4f73a8f
--- /dev/null
+++ b/docs/design/error-model.md
@@ -0,0 +1,341 @@
+# Error Model
+
+> Status: design rationale for cachekit's panic-vs-`Result` discipline,
+> the four error types in the public API, and the debug-only invariant
+> checks. Companion to [`design.md`](design.md) and [`src/error.rs`](../../src/error.rs).
+
+cachekit treats error handling as a design question, not an ergonomics
+question. The rule is:
+
+> **Panic on programming errors. Return `Result` for user-supplied
+> input. Reserve invariant checks for `debug_assertions`.**
+
+This document explains where each side of that rule applies, why the
+four shipped error types each exist as separate types, and what
+discipline a new error type needs to follow.
+
+## The three tiers
+
+cachekit divides every failure mode into one of three tiers, each with
+its own response:
+
+| Tier | Cause | Response | Example |
+|---|---|---|---|
+| 1. Programming error | Bug in the caller's code, statically detectable in principle | Panic | `LruK::with_k(10, 0)` (k = 0) |
+| 2. User-supplied input | Configuration arriving from outside the program | `Result<_, ErrorType>` | `S3FifoCache::try_with_ratios(_, 2.0, _)` |
+| 3. Invariant violation | Internal data-structure corruption (cannot reach in normal use) | `debug_assert` + `InvariantError` (test/debug only) | `pop_front` while queue length is zero |
+
+The tiers are not opinions — they map to specific Rust constructs and
+runtime behaviours. Mixing them (panicking on tier 2, returning
+`Result` from tier 3) produces APIs that are either ergonomically
+heavy or operationally unsafe.
+
+## Tier 1: panic on programming errors
+
+A "programming error" is a precondition violation the caller could
+have prevented with a `if` or a type. cachekit panics in this case
+rather than returning `Result`, because:
+
+- The bug is in **the caller's code**, not in untrusted input the
+  caller is forwarding.
+- The right fix is for the caller to fix their code, not to handle
+  an error path at the call site.
+- Forcing every call site to handle `Result<_, "you passed 0 for capacity">`
+  for a bug they could have prevented adds friction without
+  catching anything new.
+
+The shipped examples:
+
+- `CacheBuilder::build` panics on `capacity == 0`, `k == 0` for LRU-K,
+  and `probation_frac > 1.0` for 2Q. The validation is centralised in
+  `validate_policy` ([`src/builder.rs`](../../src/builder.rs)).
+- Direct constructors (`LruCore::new`, `S3FifoCache::new`) panic on
+  invalid arguments. The fallible counterparts (`try_with_ratios`,
+  `try_with_capacity`) exist for tier 2.
+- `assert!(*k > 0, "LruK: k must be greater than 0")` in
+  `CacheBuilder::validate_policy` is the canonical shape: a clear
+  message that identifies the parameter and the constraint.
+
+The cost is that a panicking call site terminates under the crate's
+default `panic = "abort"` release profile. This is intentional —
+cachekit's `panic = "abort"` is documented in the
+[`Cargo.toml`](../../Cargo.toml) release profile, and the rationale
+is that a panic in cache code under load is a bug worth surfacing
+through the supervisor / restart strategy, not unwinding.
+
+## Tier 2: `Result` for user-supplied input
+
+When the failure mode is "user passes us configuration we don't
+recognise as valid," return `Result`. The shipped error types each
+cover a specific surface:
+
+### `ConfigError` — invalid configuration parameters
+
+```rust,ignore
+pub struct ConfigError(String);
+```
+
+Defined in [`src/error.rs`](../../src/error.rs). Returned by fallible
+constructors that accept user-tunable knobs:
+
+- `S3FifoCache::try_with_ratios(capacity, small_ratio, ghost_ratio)`
+- Future `try_build` variants on `CacheBuilder`
+
+The contained `String` carries a human-readable description of which
+parameter failed validation. By convention messages are lowercase,
+unpunctuated, and identify the parameter: `"capacity must be greater
+than zero"`, `"small_ratio must be in 0.0..=1.0"`.
+
+`ConfigError`'s presence on a constructor signals that the parameter
+set can legitimately come from outside the program — a config file,
+a CLI flag, an HTTP request — and the caller should handle invalid
+input gracefully rather than crashing the process.
+
+### `StoreFull` — capacity-bound failure
+
+```rust,ignore
+pub struct StoreFull;
+```
+
+Zero-sized type defined in
+[`src/store/traits.rs`](../../src/store/traits.rs). Returned by
+`StoreMut::try_insert` and `ConcurrentStore::try_insert` when the
+store is at capacity and the insert would exceed it. The contract:
+
+- **`StoreFull` is not a panic.** A full store under capacity
+  pressure is the **expected** outcome of `try_insert`. The caller —
+  typically a policy layered on top — must respond by evicting and
+  retrying.
+- **The store does not evict on its own.** `StoreFull` is the
+  signal that says "you, policy, decide who to evict." This is the
+  core of the policy/storage separation rule from
+  [`design.md`](design.md) §7.
+- **The error carries no data.** The caller knows what they tried
+  to insert; `StoreFull` adds nothing useful by retaining it.
+
+`StoreFull` is **not** in `src/error.rs` despite being an error
+type. It lives alongside the trait that returns it because the
+two are co-evolving and the surface is small enough that the
+co-location aids readability.
+
+### `LazyMinHeapError` — `ds`-layer fallible construction
+
+```rust,ignore
+pub enum LazyMinHeapError {
+    CapacityTooLarge { requested: usize, max: usize },
+    Allocation(std::collections::TryReserveError),
+}
+```
+
+Defined in [`src/ds/lazy_heap.rs`](../../src/ds/lazy_heap.rs).
+Returned by `LazyMinHeap::try_with_capacity` when:
+
+- The requested capacity exceeds the internal `MAX_CAPACITY` bound,
+  or
+- The allocator cannot satisfy the reservation.
+
+The enum exposes both failure modes distinctly because a caller may
+want to retry on `Allocation` (transient memory pressure) but not on
+`CapacityTooLarge` (logic bug or genuinely-too-big request that
+won't recover).
+
+The pattern generalises: a future "fallible-construction" error type
+on any `ds` primitive that pre-allocates should distinguish "you
+asked for too much" from "we couldn't get what you asked for."
+
+### `std::collections::TryReserveError` — passthrough
+
+Some `try_new` constructors (`HashMapStore::try_new`,
+`ConcurrentHashMapStore::try_new`) return the standard
+`TryReserveError` directly rather than wrapping it. The reason: the
+only failure mode is allocator pressure, and `TryReserveError`
+already says exactly that. Wrapping it would add a layer for no
+information.
+
+The shape is: if cachekit has a distinct failure mode of its own
+(`CapacityTooLarge`, `StoreFull`), wrap or define a new type; if the
+only failure mode is "the allocator said no," return the standard
+type and let the caller's error-handling stack absorb it.
+
+## Tier 3: invariant checks (debug-only)
+
+```rust,ignore
+pub struct InvariantError(String);
+```
+
+Defined in [`src/error.rs`](../../src/error.rs). Returned by
+`check_invariants` methods on internal data structures:
+
+```rust,ignore
+impl<K, V> S3FifoCache<K, V> {
+    #[cfg(any(debug_assertions, test))]
+    pub fn check_invariants(&self) -> Result<(), InvariantError> {
+        if self.small.len() + self.main.len() != self.map.len() {
+            return Err(InvariantError::new("queue length mismatch"));
+        }
+        // …
+        Ok(())
+    }
+}
+```
+
+Three properties define the tier:
+
+- **Off the hot path.** `check_invariants` is called from tests,
+  fuzz harnesses, and `debug_assertions` paths. It is never called
+  from normal `insert` / `get` / `evict`.
+- **Internal-only.** The invariants are about data-structure
+  integrity: "the queue length matches the map length", "the heap
+  is in heap order", "the ghost list hasn't grown past its bound."
+  No caller program would meaningfully react to one of these
+  failing — the cache is corrupted, the right response is to
+  capture state and bail.
+- **Returns `Result`, not panics.** Counter-intuitive given the
+  tier-1 rule. The reason: `check_invariants` is called by
+  diagnostic code that wants to **report** the violation (in a test
+  failure message, a fuzz reproducer, a debug-mode assertion's
+  output) rather than crash. Returning `Result` lets the caller
+  format the failure; if they want to panic, they `unwrap()`.
+
+`InvariantError` carries the same `String`-message shape as
+`ConfigError`, by the same convention: lowercase, unpunctuated,
+identifying the specific invariant.
+
+## Why four error types, not one
+
+A single `CachekitError` enum could in principle subsume all four.
+cachekit doesn't ship one, deliberately. Three reasons:
+
+- **Each surface has different recovery semantics.** `StoreFull`
+  means "evict and retry"; `ConfigError` means "fix your config";
+  `LazyMinHeapError::Allocation` means "back off and retry";
+  `InvariantError` means "we have a bug, capture state." A unified
+  enum forces every caller to either match exhaustively (most of
+  which can't happen at their call site) or use a catch-all that
+  loses information.
+- **Each lives near the trait that uses it.** `StoreFull` lives in
+  `src/store/traits.rs`; `LazyMinHeapError` lives in
+  `src/ds/lazy_heap.rs`; `ConfigError` and `InvariantError` live
+  in `src/error.rs`. Co-location helps maintenance — adding a new
+  failure mode to one surface doesn't ripple through the others.
+- **Sum types compose poorly across abstractions.** A unified
+  enum would propagate every variant up through every layer that
+  touched it. The current shape lets a layer convert (or
+  re-wrap) only the errors it cares about.
+
+The cost is that downstream code wanting to catch "any cachekit
+error" has to enumerate all four. The mitigation is that no
+realistic downstream code wants that — each call site touches one
+surface at a time and handles that surface's error.
+
+## Operational contract: panic profile
+
+The crate's release profile sets `panic = "abort"`:
+
+```toml
+[profile.release]
+panic = "abort"
+```
+
+Two implications worth naming:
+
+- **A panic terminates the process.** No unwind, no destructors,
+  no observer recovery. A panicking weight function in
+  `ConcurrentWeightStore` (see
+  [`weighted-eviction.md`](weighted-eviction.md)) kills the
+  process; a `parking_lot` lock-poisoning concern is moot under
+  `panic = "abort"` because the process is gone before any
+  observer can read poisoned state.
+- **Callers who override the profile take on more contract.**
+  Callers building with `panic = "unwind"` get unwind safety up
+  to the documented invariants. The
+  [`weighted-eviction.md`](weighted-eviction.md) clear-ordering
+  rule and the
+  [`concurrency.md`](concurrency.md#failure-modes) panic-safety
+  notes apply only to this mode.
+
+The interplay matters for error model design: under `abort`, tier 1
+panics are terminal and need to be debugged at development time;
+under `unwind`, they are catchable but should still be treated as
+bugs because the cache may be in an unspecified-but-not-corrupt
+state.
+
+## What `Result` does **not** cover
+
+Three failure modes are deliberately not represented as `Result`:
+
+- **OOM in non-`try_*` constructors.** `LruCore::new(huge)` aborts
+  on allocator failure. Use `try_with_capacity` to get a `Result`
+  surface (where available).
+- **Logic errors in policy code.** Eviction picking the wrong
+  victim is a bug, not a return value. Detected (when detected) by
+  `check_invariants` or by the policy's tests.
+- **Concurrent contention.** `parking_lot::RwLock` doesn't poison,
+  doesn't time out by default, and doesn't return `Result`. A
+  contended cache blocks until it can proceed. Callers who need
+  timeouts wrap the cache themselves with a wider locking
+  discipline.
+
+## Adding a new error
+
+Checklist for a new failure mode:
+
+1. **Decide the tier.** Programming error, user-supplied input, or
+   internal invariant?
+2. **Pick or define the type.**
+   - Tier 1: use `assert!` / `debug_assert!` / `panic!`. No new
+     type needed.
+   - Tier 2: define a new type if the failure has data the caller
+     needs and no existing type fits. Otherwise reuse `ConfigError`
+     (with a clear message) or pass through `TryReserveError`.
+   - Tier 3: add a `check_invariants` method on the affected type
+     that returns `Result<(), InvariantError>`.
+3. **Co-locate.** Types specific to a trait live with the trait
+   (`StoreFull` in `src/store/traits.rs`). Types specific to a
+   primitive live with the primitive (`LazyMinHeapError`).
+   Cross-cutting types (`ConfigError`, `InvariantError`) live in
+   `src/error.rs`.
+4. **Implement `Display` and `Error`.** Both are required for
+   `?` interop with `Box<dyn Error>`. The convention is:
+   ```rust,ignore
+   impl fmt::Display for MyError { … }
+   impl std::error::Error for MyError {}
+   ```
+   `Display` writes the message; `Error` is empty unless the type
+   wraps another error (then `source` returns the inner error).
+5. **`Send + Sync + Clone`.** All existing error types satisfy this.
+   The convention is `#[derive(Debug, Clone, PartialEq, Eq, Hash)]`
+   for value types and matching impls for enums. Errors that flow
+   between threads must be `Send + Sync`; errors that get cloned
+   into snapshots / test fixtures must be `Clone`.
+
+## Compatibility with `?` and `anyhow`/`thiserror`
+
+The cachekit error types are intentionally **plain types, not
+`thiserror`-derived**, to avoid forcing a `thiserror` dependency on
+downstream users. They implement `std::error::Error` directly, so
+they work with `?`, `Box<dyn Error>`, and any error-aggregation
+crate (including `anyhow` and `thiserror::Error` in user code).
+
+A downstream `thiserror`-derived enum that includes a `#[from]
+cachekit::ConfigError` works. A downstream `anyhow::Result<_>` that
+absorbs cachekit errors via `?` works. The choice not to bundle
+either crate keeps the error layer dependency-free and gives
+downstream the standard `From` and `Display` shape they expect.
+
+## See also
+
+- [Design overview](design.md) — §12 frames failure modes at the
+  principles level
+- [Concurrency](concurrency.md) — `parking_lot` non-poisoning,
+  atomic check-and-act, lock-acquisition failure modes
+- [Builder and runtime dispatch](builder-and-dyn-dispatch.md) —
+  panic-in-`build` validation, `try_build`-deliberately-absent
+  rationale
+- [Weighted eviction](weighted-eviction.md) — `StoreFull`'s role
+  and unwind-safety in `clear`
+- [`src/error.rs`](../../src/error.rs) — `ConfigError`,
+  `InvariantError`
+- [`src/store/traits.rs`](../../src/store/traits.rs) — `StoreFull`
+- [`src/ds/lazy_heap.rs`](../../src/ds/lazy_heap.rs) —
+  `LazyMinHeapError`
diff --git a/docs/design/hashing.md b/docs/design/hashing.md
new file mode 100644
index 0000000..5af6b4b
--- /dev/null
+++ b/docs/design/hashing.md
@@ -0,0 +1,166 @@
+# Hashing and Key Identity
+
+> Status: design rationale for hasher choices, key interning, and hash-based
+> routing. Companion to [`concurrency.md`](concurrency.md), [`sharding.md`](sharding.md),
+> and the security notes in store/data-structure modules.
+
+cachekit uses hashing in three different roles:
+
+- Lookup indexes (`HashMapStore`, policy maps, ghost indexes).
+- Compact key identity (`KeyInterner`).
+- Shard routing (`ShardSelector`).
+
+Those roles have different threat models. Some code paths choose `FxHash` for
+speed on trusted keys; others default to `RandomState` or keyed SipHash because
+untrusted keys can create HashDoS or single-shard contention. This document
+explains those choices and when callers should override them.
+
+## The Decision Matrix
+
+| Component | Default hasher | Why | Caller override? |
+|---|---|---|---|
+| `HashMapStore` | `RandomState` | public store API, safer default | yes, `with_hasher` |
+| `ClockRing` | `RandomState` | can be keyed by user input | yes, with explicit trust acknowledgement |
+| `KeyInterner` | `FxBuildHasher` | hot internal mapping, trusted-key bias | yes, `with_hasher` |
+| `WeightStore` | `FxHashMap` | speed, large-value target | no generic hasher today |
+| Policy internals | mostly `FxHashMap` | hot metadata paths | generally no |
+| `ShardSelector` | keyed SipHash-1-3 | routing must resist shard pinning | seed or randomized constructor |
+
+The rule: **default to DoS-resistant hashing at public boundaries; use faster
+hashing inside policy metadata when keys are trusted or already admitted.**
+
+## `RandomState`: Safe Public Default
+
+`HashMapStore` and `ClockRing` default to
+`std::collections::hash_map::RandomState`. This is the right public default
+because callers often pass keys derived from request paths, tenant ids, URLs,
+or filenames. Randomized hashing prevents an attacker from precomputing many
+keys that collide in one bucket.
+
+The cost is per-hash overhead. For workloads with fully trusted keys (for
+example, dense integer ids generated by the process), callers can use
+`with_hasher` to opt into a faster hasher. That opt-in is intentionally explicit:
+the call site documents the threat-model decision.
+
+`ClockRing` goes further by using a `KeysAreTrusted` acknowledgement for faster
+non-randomized hashers. The extra marker makes the security trade visible in
+review rather than hidden in a type alias.
+
+## `FxHash`: Hot Internal Default
+
+Many policy internals use `rustc_hash::FxHashMap`:
+
+- LRU-family maps from key to node pointer / slot id.
+- LFU/MFU frequency maps.
+- 2Q / SLRU / Clock-PRO resident and ghost indexes.
+- `WeightStore`'s index.
+- `KeyInterner`'s default index.
+
+`FxHash` is fast and deterministic. It is also non-cryptographic and not
+HashDoS-resistant. The intended use is trusted, already-admitted keys where the
+hash map is not directly exposed as an unbounded public endpoint.
+
+The sharp edge is `WeightStore`: its target use case (variable-size objects
+like images, documents, blobs) often has user-derived keys. Its module docs call
+this out directly: pre-hash keys with a keyed hash or use `HashMapStore` if the
+key source is adversarial.
+
+## `KeyInterner`: Identity Compression, Not Security
+
+`KeyInterner` maps external keys to compact `u64` handles:
+
+```text
+index: HashMap<K, u64, S>     keys: Vec<K>
+"user:123" -> 0               keys[0] = "user:123"
+```
+
+The design goals:
+
+- Avoid repeated key cloning in hot paths.
+- Use compact handles in policy metadata and frequency maps.
+- Resolve a handle back to a key in O(1).
+
+Handles are **not capability tokens**. They are sequential integers. A handle
+from one interner can silently resolve to a different key in another interner,
+and handles are reused after `clear`. Callers that store handles externally
+must pair them with `generation()` and reject stale generations.
+
+Security implications:
+
+- The default `FxBuildHasher` is for trusted input.
+- Use `with_hasher` / `with_capacity_and_hasher` with `RandomState` when keys
+  are derived from untrusted input.
+- `KeyInterner` is append-only until `clear`, so unique-key attacks can drive
+  memory growth. Use `try_intern` and your own admission bound for untrusted
+  keys.
+- `Debug` intentionally omits interned keys to avoid leaking URLs, user ids, or
+  auth material into logs.
+
+## `ShardSelector`: Hashing for Routing
+
+Shard routing has a different failure mode than lookup maps. A lookup hash
+collision slows one map; a routing collision pins the whole workload to one
+shard and defeats concurrency.
+
+`ShardSelector` therefore uses keyed SipHash-1-3:
+
+- `ShardSelector::randomized(shards)` draws key material from `RandomState`.
+  Use this for normal production sharding.
+- `ShardSelector::new(shards, seed)` is deterministic and reproducible. Treat
+  `seed` as secret key material if adversaries can influence keys.
+
+The selector reduces hash output to `[0, shards)` using fast range reduction
+rather than `%`, keeping distribution unbiased and cheap. The shard count is
+clamped to `[1, MAX_SHARDS]` to prevent user-controlled configs from allocating
+an unbounded number of locks or vectors.
+
+## Custom Hasher Rules
+
+When adding a hasher parameter to a public type:
+
+1. Default to `RandomState` unless the type is clearly internal-only.
+2. Expose `with_hasher` and `try_with_hasher` if callers have legitimate
+   trusted-key fast paths.
+3. Document the threat model at the constructor, not only at module level.
+4. Never hide a non-randomized hasher behind a harmless-sounding `new`.
+5. If the hasher affects shard routing, prefer `ShardSelector` over ad hoc
+   hashing so the keyed-routing contract stays centralized.
+
+When using `FxHashMap` internally:
+
+1. Keep it behind the policy or data-structure boundary.
+2. Do not expose arbitrary insertions from untrusted users without a separate
+   capacity/admission guard.
+3. Mention the assumption in the module's security notes if keys may be user
+   controlled.
+
+## Serialization and Hash Seeds
+
+Do not serialize hash seeds or hasher state unless the type is explicitly a
+deterministic routing artifact. `ShardSelector::new(shards, seed)` is the one
+place where reproducible routing is part of the public contract. `RandomState`
+and policy-internal hash maps should be reconstructed on deserialization.
+
+Serializing raw hash-map order is also wrong. Hash-map iteration order changes
+with seeds and implementation details; serialized cache state should use stable
+semantic fields (keys, values, policy order) rather than map buckets.
+
+## Future Direction: Hasher Audit
+
+The codebase intentionally mixes `RandomState`, `FxHashMap`, and SipHash. That
+mix is valid only while every use site has a documented threat model. A useful
+future hardening pass:
+
+- List every public constructor that accepts a key type.
+- Classify whether keys are trusted, user-supplied, or mixed.
+- Ensure user-supplied defaults are randomized.
+- Add `KeysAreTrusted`-style acknowledgement to any public non-randomized path.
+
+## See Also
+
+- [Sharding](sharding.md) - shard routing and contention trade-offs
+- [Weighted eviction](weighted-eviction.md) - `WeightStore` HashDoS caveat
+- [`src/ds/interner.rs`](../../src/ds/interner.rs)
+- [`src/ds/shard.rs`](../../src/ds/shard.rs)
+- [`src/store/hashmap.rs`](../../src/store/hashmap.rs)
+- [`src/ds/clock_ring.rs`](../../src/ds/clock_ring.rs)
diff --git a/docs/design/metrics.md b/docs/design/metrics.md
new file mode 100644
index 0000000..88ace2b
--- /dev/null
+++ b/docs/design/metrics.md
@@ -0,0 +1,510 @@
+# Metrics
+
+> Status: design rationale for the metrics infrastructure under
+> [`src/metrics/`](../../src/metrics), gated by the `metrics` Cargo
+> feature. Companion to [`design.md`](design.md) §6.
+
+cachekit's metrics surface is bigger than "two counters behind a
+feature flag." It mirrors the cache trait hierarchy — recorder /
+snapshot / exporter — so each concern lives in the smallest trait
+that captures it, and policy code stays free of monitoring plumbing.
+This document explains the three-trait separation, the
+`&self`-vs-`&mut self` split, the `MetricsCell` interior-mutability
+escape hatch, the Prometheus exporter contract, and what guarantees
+counters do and do not provide.
+
+## Goals and non-goals
+
+The metrics module is shaped for:
+
+- **Lightweight in-process counters** that a policy can increment on
+  its hot path without measurable overhead when enabled.
+- **Zero overhead when disabled.** The entire `metrics` module
+  compiles away under `#[cfg(feature = "metrics")]`.
+- **Decoupled consumption.** Tests, benchmarks, and production
+  monitoring should each consume metrics in the shape they need
+  without dragging recording concerns along.
+- **Per-policy specificity.** A Clock policy's `hand_advance` count
+  matters; a FIFO's `pop_oldest_empty_or_stale` count matters. The
+  trait surface preserves these signals rather than flattening to
+  one shape.
+
+It is **not** shaped for:
+
+- **High-cardinality labels.** Counters are flat scalars. Tag
+  dimensions (per-key, per-tenant) are out of scope.
+- **Histograms or sliding windows.** Counters and gauges only.
+  Latency distributions live in the user's monitoring stack via
+  external instrumentation.
+- **Audit-grade accounting.** Counters use `Relaxed` atomics
+  ([`src/store/weight.rs`](../../src/store/weight.rs)) and wrap on
+  overflow in release. Best-effort observability, not financial
+  ledger.
+
+## Three-trait separation
+
+```text
+                                ┌─────────────────────────────┐
+                                │     CoreMetricsRecorder     │
+                                │  record_get_hit, _miss,     │
+                                │  _insert_*, _evict_*,       │
+                                │  _clear                     │
+                                └──────────────┬──────────────┘
+                                               │ extends
+        ┌──────────┬───────────┬───────────────┼───────────┬────────────┐
+        ▼          ▼           ▼               ▼           ▼            ▼
+   FifoRec    LruRec       LfuRec          ArcRec      ClockRec    S3FifoRec
+                │                                                       …
+                ▼
+            LruKRec
+            (further extends LruRec)
+
+   Consumption (decoupled from recording):
+   ┌──────────────────────────────┐    ┌──────────────────────────────┐
+   │ MetricsSnapshotProvider<S>   │    │ MetricsExporter<S>           │
+   │ + MetricsReset               │    │ PrometheusTextExporter       │
+   │ (bench / test)               │    │ (production monitoring)      │
+   └──────────────────────────────┘    └──────────────────────────────┘
+```
+
+Three responsibilities, three trait families:
+
+- **Record.** Per-policy `*MetricsRecorder` traits live in
+  [`src/metrics/traits.rs`](../../src/metrics/traits.rs). Every
+  policy-specific recorder extends `CoreMetricsRecorder` and adds
+  policy-specific methods (`record_hand_advance` for Clock,
+  `record_b1_ghost_hit` for ARC, etc.). The policy itself calls
+  these methods on its hot path.
+- **Snapshot.** `MetricsSnapshotProvider<S>` returns a `Copy`
+  `*MetricsSnapshot` struct ([`src/metrics/snapshot.rs`](../../src/metrics/snapshot.rs))
+  — a point-in-time scalar copy of every counter. Snapshots are
+  `#[non_exhaustive]` for SemVer headroom and gated on `serde` for
+  cross-process transport.
+- **Export.** `MetricsExporter<S>` consumes a snapshot and pushes it
+  to an external system. The shipped implementation,
+  `PrometheusTextExporter` ([`src/metrics/exporter.rs`](../../src/metrics/exporter.rs)),
+  writes Prometheus exposition format to any `W: Write + Send`.
+
+Splitting these three lets:
+
+- **Policy code stay minimal.** A policy needs only the recorder
+  trait. It does not import snapshots or exporters.
+- **Tests bypass production.** Bench harnesses use
+  `MetricsSnapshotProvider` + `MetricsReset` and never touch
+  `MetricsExporter`. Production code does the inverse.
+- **Exporters multiply without policy churn.** Adding a StatsD or
+  OpenTelemetry exporter is a new `impl MetricsExporter<S>` for the
+  snapshot types — no policy changes.
+
+## Per-policy recorder traits
+
+Every policy gets its own recorder trait extending
+`CoreMetricsRecorder`. The shipped set:
+
+| Trait | Adds counters for |
+|---|---|
+| `FifoMetricsRecorder` | scan steps, stale skips, `pop_oldest` calls |
+| `LruMetricsRecorder` | `pop_lru`, `peek_lru`, `touch`, `recency_rank` |
+| `LruKMetricsRecorder` | extends `LruMetricsRecorder` + K-distance counters |
+| `LfuMetricsRecorder` | `pop_lfu`, `peek_lfu`, frequency reads / mutates |
+| `MfuMetricsRecorder` | mirrors LFU for most-frequent eviction |
+| `ArcMetricsRecorder` | T1→T2 promotions, B1/B2 ghost hits, `p` movement |
+| `CarMetricsRecorder` | recent→frequent, ghost hits, hand sweeps |
+| `ClockMetricsRecorder` | hand advances, ref-bit resets |
+| `ClockProMetricsRecorder` | cold↔hot transitions, test entries |
+| `NruMetricsRecorder` | sweep steps, ref-bit resets |
+| `SlruMetricsRecorder` | probationary→protected, protected evictions |
+| `TwoQMetricsRecorder` | A1in→Am promotions, A1out ghost hits |
+| `S3FifoMetricsRecorder` | promotions, main reinserts, ghost hits |
+
+Two design principles drive the granularity:
+
+- **Each counter answers a tuning question.** "Are my LRU-K
+  promotions worth the metadata?" "Is my ARC ghost list catching
+  meaningful hits?" Generic `evictions: u64` cannot answer either.
+- **Counters live near their semantics.** `record_a1in_to_am_promotion`
+  belongs to 2Q because A1in/Am are 2Q concepts. Putting it on
+  `CoreMetricsRecorder` would force every other policy to either
+  implement a meaningless method or document a no-op.
+
+The trade is API surface: 14 recorder traits with ~5-10 methods
+each. The mitigation is that **users do not implement them** — they
+implement the shipped `*Metrics` structs through inherent methods on
+each policy, and they read snapshots, not recorders.
+
+## The `&self`-vs-`&mut self` split
+
+Several `Cache<K, V>` methods take `&self`:
+[`trait-hierarchy.md`](trait-hierarchy.md#peek-vs-get--the-readmutate-split)
+explains why. The metrics system has to honour this — a `&self`
+read path cannot call a `&mut self` recorder. The shipped solution
+is a parallel `*MetricsReadRecorder` family for each policy whose
+read paths increment counters:
+
+| Mutable trait | Read-only counterpart |
+|---|---|
+| `FifoMetricsRecorder` | `FifoMetricsReadRecorder` |
+| `LruMetricsRecorder` | `LruMetricsReadRecorder` |
+| `LruKMetricsRecorder` | `LruKMetricsReadRecorder` |
+| `LfuMetricsRecorder` | `LfuMetricsReadRecorder` |
+| `MfuMetricsRecorder` | `MfuMetricsReadRecorder` |
+
+The read-only traits take `&self` on every method. They are
+implemented through interior mutability on the concrete metrics
+struct — specifically `MetricsCell`, the internal type that wraps
+`Cell<u64>` with an `unsafe impl Sync` (covered below).
+
+Two questions this design avoided:
+
+- **"Why not put `Cell<u64>` directly on the metrics struct?"**
+  Because `Cell<u64>` is `!Sync`, which propagates and prevents
+  every policy struct that embeds metrics from being `Sync`. The
+  thin `MetricsCell` wrapper makes the synchronisation discipline
+  explicit at one site instead of N.
+- **"Why not just `AtomicU64` for everything?"** Because counters
+  on `&mut self` paths (the majority — `insert`, `get`, `evict`)
+  do not need atomic semantics; the policy already holds exclusive
+  access. However, `MetricsCell` is only sound when `&self` metric
+  increments are protected by exclusive synchronization or are known
+  to be single-threaded. It is **not** a substitute for atomics under
+  shared `RwLock::read` access.
+
+## `MetricsCell`: interior mutability under external lock
+
+```rust,ignore
+#[repr(transparent)]
+#[derive(Debug, Default, Clone, PartialEq, Eq)]
+pub(crate) struct MetricsCell(Cell<u64>);
+
+unsafe impl Sync for MetricsCell {}
+unsafe impl Send for MetricsCell {}
+```
+
+This is the only `unsafe impl Sync` in the metrics surface, so its
+contract must be narrow:
+
+- **Exclusive external synchronization is required.** A shared
+  `RwLock::read` guard does **not** serialize readers, so it is not
+  sufficient protection for `Cell<u64>`. `MetricsCell` may be used
+  on single-threaded policy paths, or behind a write lock / mutex,
+  but not for counters mutated concurrently through read-locked
+  `&self` methods.
+- **Observation-only does not relax Rust's aliasing rules.** It is
+  acceptable for metrics to be approximate; it is not acceptable for
+  approximation to be implemented as unsynchronized `Cell` mutation.
+  Concurrent read-path counters must use `AtomicU64`, take an
+  exclusive lock, or be disabled for that path.
+- **`pub(crate)`.** The type does not escape the crate.
+  Down-stream code can read counters through the snapshot API but
+  cannot construct `MetricsCell` itself, which prevents misuse from
+  outside the codebase.
+
+The alternatives considered and rejected:
+
+- `Mutex<u64>` — cost dominates the counter increment.
+- `AtomicU64` — the correct choice for counters that can be
+  incremented concurrently through shared references; unnecessary
+  for single-threaded or exclusively locked counters.
+- `RefCell<u64>` — runtime borrow checking with panic on contention;
+  not desirable on a metrics increment path.
+
+`MetricsCell` is the smallest tool for single-threaded or exclusively
+locked metric counters. Any policy or wrapper that records metrics
+from a read-locked path must not rely on `MetricsCell` for soundness.
+
+## Snapshots: cheap, copyable, optionally serializable
+
+Every snapshot struct in [`src/metrics/snapshot.rs`](../../src/metrics/snapshot.rs)
+follows the same shape:
+
+```rust,ignore
+#[derive(Debug, Default, Clone, Copy, PartialEq, Eq)]
+#[non_exhaustive]
+#[cfg_attr(feature = "serde", derive(serde::Serialize, serde::Deserialize))]
+pub struct LruMetricsSnapshot {
+    pub get_calls: u64,
+    pub get_hits: u64,
+    pub get_misses: u64,
+    pub insert_calls: u64,
+    pub insert_updates: u64,
+    pub insert_new: u64,
+    pub evict_calls: u64,
+    pub evicted_entries: u64,
+    pub pop_lru_calls: u64,
+    pub pop_lru_found: u64,
+    pub peek_lru_calls: u64,
+    pub peek_lru_found: u64,
+    pub touch_calls: u64,
+    pub touch_found: u64,
+    pub recency_rank_calls: u64,
+    pub recency_rank_found: u64,
+    pub recency_rank_scan_steps: u64,
+    pub cache_len: usize,
+    pub insertion_order_len: usize,
+    pub capacity: usize,
+}
+```
+
+Five intentional properties:
+
+- **`Copy`.** A snapshot is a flat block of `u64`s and `usize`s.
+  Copying is a `memcpy` and snapshots can flow through channels,
+  futures, and test assertions without ceremony.
+- **`Default`.** Equivalent to "no operations recorded." Useful for
+  test fixtures and explicit reset comparisons.
+- **`#[non_exhaustive]`.** Adding a new counter (e.g. when a
+  policy variant gains a new internal step) is a minor version
+  bump. Downstream code matching on the struct must accept new
+  fields gracefully — the standard `non_exhaustive` discipline.
+- **`PartialEq + Eq`.** Snapshot equality is well-defined and
+  useful in tests. Two snapshots compare equal iff every counter
+  matches.
+- **Optionally `serde`.** Gated on `serde`, not unconditional, so
+  the metrics module doesn't drag serde into builds that don't
+  want it.
+
+Gauges (`cache_len`, `insertion_order_len`, `capacity`) live
+alongside counters and snapshot together. The Prometheus exporter
+writes the right `# TYPE` line for each, which matters for the
+scraper.
+
+## Recording is push, consumption is pull
+
+Two operating models coexist:
+
+- **Recording is push from the policy.** The policy calls
+  `m.record_get_hit()` directly. The recorder method has the
+  cheapest possible body (one `+= 1`). This is the hot-path
+  contract.
+- **Consumption is pull from the consumer.** Tests / benches /
+  exporters call `m.snapshot()` whenever they want a value, and
+  `MetricsReset::reset_metrics(&self)` when they want to clear.
+  Nothing about the policy timing depends on consumption.
+
+Specifically, the policy does **not** push to the exporter. There
+is no observer-pattern hook from the recorder to the exporter, no
+synchronous flush on every increment, and no async channel between
+them. The pull model lets benches consume at known checkpoints
+(once per iteration), and lets production scrapers poll on their
+own cadence (every 10 s, every minute, etc.).
+
+The cost of the pull model is that an exporter cannot react to a
+specific event (e.g. "evictions spiked above N"). cachekit users
+who need event-driven reactions instrument at the application
+layer, not the metrics layer.
+
+## Prometheus text exporter
+
+The shipped exporter (`PrometheusTextExporter` in
+[`src/metrics/exporter.rs`](../../src/metrics/exporter.rs)) writes
+the Prometheus text exposition format to any `W: Write + Send`:
+
+```rust,ignore
+let exporter = PrometheusTextExporter::new("myapp_cache", io::stdout());
+let snapshot = lru_cache.snapshot();
+exporter.export(&snapshot);
+```
+
+Three design choices worth naming:
+
+- **Per-prefix instance.** The prefix (`myapp_cache`) is set at
+  construction, not per call. This keeps the call site simple and
+  enforces a single metric namespace per exporter instance.
+- **I/O errors are silently dropped.** A failing write does not
+  panic the cache or surface a `Result`. The contract is
+  "fire-and-forget monitoring" — a transient `EPIPE` to a metrics
+  socket must not interrupt cache operations. Callers who need
+  guaranteed delivery should wrap their writer in something with
+  retry semantics and accept the cost.
+- **The writer is `Mutex<W>`, not `RwLock<W>`.** Writing is
+  always exclusive; there's no read path. Using `Mutex` here is
+  the right primitive even though most of cachekit uses
+  `parking_lot::RwLock`. (Note: this is `std::sync::Mutex`,
+  poisoning-aware. `export` panics on poisoning. This is a
+  deliberate divergence from `parking_lot` — the exporter is on
+  the cold path and the std mutex's poisoning behaviour is fine
+  there.)
+
+Other exporters (StatsD, OpenTelemetry, custom) plug in by
+implementing `MetricsExporter<S>` for each snapshot type they
+care about. No changes elsewhere in the crate are required.
+
+## Feature gating: all-or-nothing at compile time
+
+The entire metrics subsystem is gated on the `metrics` Cargo
+feature:
+
+```rust
+// src/lib.rs
+#[cfg(feature = "metrics")]
+pub mod metrics;
+```
+
+Inside each policy, recorder calls are wrapped:
+
+```rust,ignore
+#[cfg(feature = "metrics")]
+self.metrics.record_get_hit();
+```
+
+When `metrics` is **off**:
+
+- The entire `metrics` module disappears from the build.
+- Every `record_*` call site becomes a no-op (the `#[cfg]` block
+  compiles away).
+- Snapshot types are not in the public API.
+- Build time drops; binary size drops; no runtime cost.
+
+When `metrics` is **on**:
+
+- Recording costs one `u64 += 1` per call (or one `Cell::set` for
+  read-only counters). For a 17-policy `DynCache` that records on
+  every `get` / `insert`, the overhead is sub-nanosecond and shows
+  up in benches as flat regression.
+- The `metrics::snapshot` and `metrics::exporter` modules are in
+  the public API and exporting infrastructure is available.
+
+The trade-off is deliberate. No "low-cardinality always-on,
+detailed-on-demand" two-tier scheme exists — every counter is
+either always present (feature on) or absent (feature off). The
+discipline that keeps "always present" cheap is the recorder
+contract: methods do no work beyond incrementing a counter.
+
+## What about `StoreMetrics`?
+
+`StoreMetrics` ([`src/store/traits.rs`](../../src/store/traits.rs))
+is a **separate**, simpler structure that ships unconditionally
+(not behind `metrics`). It carries the universal counters every
+store-layer implementation tracks:
+
+```rust,ignore
+pub struct StoreMetrics {
+    pub hits: u64,
+    pub misses: u64,
+    pub inserts: u64,
+    pub updates: u64,
+    pub removes: u64,
+    pub evictions: u64,
+}
+```
+
+The two systems coexist:
+
+- `StoreMetrics` is the store-layer baseline. Always present, always
+  cheap, six counters.
+- `src/metrics/` (feature-gated) is the policy-layer detailed
+  metrics — recorder traits, snapshots, exporter, per-policy signals.
+
+A store typically backs `StoreMetrics` with `AtomicU64` counters
+(see `StoreCounters` in [`src/store/weight.rs`](../../src/store/weight.rs)),
+because stores are often behind concurrent wrappers and the
+increment paths can be `&self`. The split mirrors the
+sequential-vs-concurrent split at the trait level
+([`concurrency.md`](concurrency.md)).
+
+## Counter discipline
+
+Three rules every recorder method follows:
+
+1. **No allocation.** Counter increments are O(1) and allocation-free.
+2. **No fallible operations.** A counter must not be in a position
+   where it can fail — `+=` always succeeds; saturation is
+   acceptable for u64 wrap (it takes years at billions/sec).
+3. **No conditional logic beyond the counter itself.** A recorder
+   method that branches on cache state belongs in the policy, not
+   in metrics.
+
+The corollary: a policy that wants a derived counter ("number of
+evictions where the victim's recency rank was > 10") computes the
+condition itself and calls one of two existing methods accordingly.
+Putting the branching inside the recorder would couple metrics to
+policy state.
+
+## Adding a new metric
+
+Checklist for adding a per-policy counter:
+
+1. **Add the field.** Plain `u64` if it's updated on `&mut self`
+   paths; `MetricsCell` if it's updated on `&self` paths. Place it
+   in the corresponding `*Metrics` struct under
+   [`src/metrics/metrics_impl.rs`](../../src/metrics/metrics_impl.rs).
+2. **Add the recorder method.** On the relevant `*MetricsRecorder`
+   trait (or its `*ReadRecorder` counterpart for `&self`).
+3. **Implement on the policy's metrics struct.** One-line
+   `+= 1` body.
+4. **Wire the call site in the policy.** Wrap with
+   `#[cfg(feature = "metrics")]`.
+5. **Add the field to the snapshot.** In
+   [`src/metrics/snapshot.rs`](../../src/metrics/snapshot.rs). The
+   snapshot's `From<&*Metrics>` (or equivalent) needs the new
+   field.
+6. **Update the exporter.** Add a `write_counter` /
+   `write_gauge` call in `PrometheusTextExporter::export` for the
+   new field.
+
+Six locations is a lot of friction for a new counter. The friction
+is intentional — adding a counter is rarely the right answer to a
+debugging question, and the friction encourages reuse of existing
+counters where possible.
+
+## Adding a new metric **type** (gauge vs counter, histogram)
+
+Histograms and sliding windows are deliberately out of scope. Adding
+either is a wider design change:
+
+- The recorder traits assume `&mut u64 += 1` semantics. A histogram
+  needs `observe(value)` semantics and an aggregation strategy.
+- The snapshot types assume `Copy` and `u64` fields. A histogram
+  snapshot needs bucket arrays.
+- The Prometheus exporter writes counters and gauges only.
+
+If histograms become needed (the most likely use case is latency
+distribution per policy), the design has space: introduce a
+`HistogramRecorder` trait alongside `CoreMetricsRecorder` and a
+matching `HistogramSnapshot`. The existing exporter stays counter-
+and-gauge-only; a new `PrometheusHistogramExporter` handles the
+new shape. The current omission is a coverage decision, not a
+foundation problem.
+
+## Guarantees and non-guarantees
+
+What the metrics system guarantees:
+
+- **Eventual consistency in single-threaded builds.** Every recorded
+  event eventually appears in `snapshot()` for the same thread.
+- **Snapshot atomicity per counter.** A snapshot reads each
+  counter as a single load; no torn `u64` reads on 64-bit
+  platforms.
+- **No cache correctness impact.** Metrics never block, panic
+  (except `PrometheusTextExporter` on poisoned mutex), or alter
+  cache state.
+
+What it does **not** guarantee:
+
+- **Cross-counter snapshot consistency.** A snapshot reads counters
+  sequentially. A reader can observe `hits = 100, misses = 99`
+  while a concurrent writer is mid-update; the next snapshot may
+  show `hits = 100, misses = 101`. There is no "snapshot epoch."
+- **Concurrent `MetricsCell` recording.** `MetricsCell` must not be
+  incremented from multiple read-locked callers. Shared read locks do
+  not serialize readers, so those paths must use atomics or acquire an
+  exclusive lock before recording. Metrics may be best-effort, but
+  the implementation still has to be data-race-free.
+- **Wrap-safe arithmetic in release.** Release profile sets
+  `overflow-checks = false`. Counters wrap silently. At one billion
+  events per second, `u64` wraps in ~585 years — practically a
+  non-issue, formally not a guarantee.
+
+## See also
+
+- [Design overview](design.md) — §6 frames metrics at the
+  principles level
+- [Cache trait hierarchy](trait-hierarchy.md) — `&self` / `&mut self`
+  split that drives the read-vs-mutate recorder fork
+- [Concurrency](concurrency.md) — read/write lock model that
+  constrains where `MetricsCell` may be used
+- [Error model](error-model.md) — panic discipline shared by the
+  exporter's poisoning behaviour
+- [`src/metrics/`](../../src/metrics) — the canonical implementation
+- [`src/store/traits.rs`](../../src/store/traits.rs) —
+  `StoreMetrics`, the unconditional store-layer counterpart
diff --git a/docs/design/non-goals.md b/docs/design/non-goals.md
new file mode 100644
index 0000000..369e28c
--- /dev/null
+++ b/docs/design/non-goals.md
@@ -0,0 +1,166 @@
+# Non-Goals
+
+> Status: explicit boundaries for cachekit's design. Companion to
+> [`design.md`](design.md), which states what the crate optimizes for.
+
+Good design needs negative space. This document records what cachekit is **not**
+trying to be, so future features can be judged against the same boundaries.
+
+## Not a Distributed Cache
+
+cachekit is an in-process cache library. It does not provide:
+
+- network protocols;
+- replication;
+- cluster membership;
+- consistent hashing across processes;
+- cross-node invalidation;
+- persistence guarantees.
+
+Use Redis, Memcached, or a database/cache service when those are the problem.
+cachekit can still be useful inside a node in front of those systems.
+
+## Not a Full Application Cache Framework
+
+cachekit does not manage:
+
+- request coalescing / singleflight;
+- background refresh;
+- cache stampede suppression;
+- application-specific invalidation rules;
+- loader functions or read-through APIs as the primary abstraction.
+
+The library provides cache primitives and policies. Application frameworks can
+compose them into higher-level behaviours.
+
+## Not Async-Native Today
+
+`AsyncCacheFuture` exists as a placeholder, but the shipped policies are
+synchronous. Async-native traits are not currently implemented.
+
+The reason is not that async is unimportant. It is that async cache APIs need
+owned values, cancellation semantics, loader lifetime rules, and executor
+integration. Adding `async fn get_or_insert_with` to the core trait would break
+object safety and pull async choices into every policy.
+
+Future async support should be a separate layer, not a mutation of
+`Cache<K, V>`.
+
+## Not `no_std`
+
+cachekit uses:
+
+- `std::collections`;
+- `std::sync::Arc`;
+- `std::time` in planned TTL work;
+- `parking_lot` for concurrent wrappers;
+- benchmark and metrics tooling built around `std`.
+
+`no_std` would require a different allocator story, different synchronization
+surface, and feature-gated alternatives for large parts of the crate. It is not
+a current target.
+
+## Not Lock-Free
+
+The concurrency design is explicit and lock-based:
+
+- `Concurrent*` wrappers use `parking_lot::RwLock`;
+- sharded structures use one lock per shard;
+- future lock-free reads are a research direction, not current design.
+
+Lock-free structures would need a separate memory reclamation strategy,
+different value ownership rules, and a much larger unsafe surface. The current
+crate favours predictable, reviewable lock boundaries.
+
+## Not a HashDoS Firewall
+
+Some public surfaces use DoS-resistant hashing by default (`HashMapStore`,
+`ClockRing`, `ShardSelector::randomized`). Other hot internal surfaces use
+`FxHashMap` for speed.
+
+cachekit documents those choices, but it is not a general-purpose security
+boundary. Callers with adversarial keys must choose safe constructors, bound
+admission, and avoid exposing interned handles or `total_weight` across trust
+boundaries.
+
+## Not a Serialization Format for Live Caches
+
+The `serde` feature supports metrics snapshots and `StoreMetrics`, not live
+cache state. Serializing a policy means deciding what to do with recency lists,
+ghost history, clock hands, hash seeds, `Arc<V>` identity, and TTL deadlines.
+
+Until a policy has an explicit restore contract, do not derive serde for it.
+
+## Not a General Metrics Platform
+
+The metrics layer provides counters, gauges, snapshots, reset, and a Prometheus
+text exporter. It does not provide:
+
+- high-cardinality labels;
+- histograms;
+- sampling;
+- streaming events;
+- tracing spans;
+- alerting.
+
+Use your monitoring stack for those. cachekit exposes enough counters to make
+policy tuning possible without making the cache own observability.
+
+## Not a Policy Research Playground at the Cost of Hot Paths
+
+New policies are welcome, but they must fit the crate's constraints:
+
+- no per-operation allocation in hot paths;
+- predictable eviction cost;
+- feature-gated implementation;
+- docs and benchmarks;
+- clear workload motivation.
+
+A clever algorithm that needs tree walks, heap allocation on every access, or
+opaque trait-object dispatch in the hot loop belongs in a research branch until
+benchmarks justify it.
+
+## Not a Replacement for Workload Analysis
+
+cachekit ships many policies, but it cannot choose your workload for you.
+`CachePolicy::Lru` or `CachePolicy::S3Fifo` are defaults, not guarantees. Users
+still need to measure reuse distance, scan rate, write ratio, object sizes, and
+tail latency under representative traffic.
+
+The benchmark suite provides workload generators to help, but it cannot infer
+production behaviour automatically.
+
+## Not a Stability Promise for Internal Layout
+
+Public traits and documented constructors follow SemVer. Internal layout does
+not:
+
+- slot ids;
+- intrusive-list node fields;
+- heap tombstone representation;
+- ghost-list internals;
+- metric recorder implementation details;
+- `DynCache`'s private `CacheInner` enum.
+
+If downstream code depends on private layout, it is outside the compatibility
+contract.
+
+## How To Use This Doc
+
+When proposing a feature, ask:
+
+1. Does it violate one of these non-goals?
+2. If yes, is it a new layer that keeps the core intact?
+3. Can it be feature-gated so users who do not need it pay nothing?
+4. Does it preserve hot-path constraints?
+5. Does it belong in cachekit, or in an application/framework crate above it?
+
+If the answer is unclear, write a design doc before implementation.
+
+## See Also
+
+- [Design overview](design.md)
+- [Concurrency](concurrency.md)
+- [Serialization](serialization.md)
+- [Metrics](metrics.md)
+- [Benchmarking](benchmarking.md)
diff --git a/docs/design/serialization.md b/docs/design/serialization.md
new file mode 100644
index 0000000..a6fc80f
--- /dev/null
+++ b/docs/design/serialization.md
@@ -0,0 +1,195 @@
+# Serialization
+
+> Status: design rationale for the current `serde` feature and the boundaries
+> around future cache-state persistence. Companion to [`metrics.md`](metrics.md),
+> [`ttl.md`](ttl.md), and [`builder-and-dyn-dispatch.md`](builder-and-dyn-dispatch.md).
+
+cachekit has a narrow serialization surface today. The `serde` feature derives
+`Serialize` / `Deserialize` for metrics snapshots and `StoreMetrics`; it does
+**not** serialize cache contents, policy metadata, hash-map state, locks, or
+builder dispatchers.
+
+That boundary is intentional. Metrics are stable observations. Cache state is
+live data with policy invariants, hash seeds, pointer-like handles, and optional
+time semantics.
+
+## Current Surface
+
+With `features = ["serde"]`, these public value types derive serde:
+
+- `StoreMetrics` in [`src/store/traits.rs`](../../src/store/traits.rs).
+- Every metrics snapshot in [`src/metrics/snapshot.rs`](../../src/metrics/snapshot.rs).
+
+Properties:
+
+- They are flat value types (`u64`, `usize`, optional nested stats).
+- They are `#[non_exhaustive]`, so new fields are SemVer-compatible at the Rust
+  API level but still require schema discipline for serialized consumers.
+- They carry observations, not live handles into cache internals.
+
+No policy type implements serde today. No store type serializes entries today.
+
+## Why Metrics Are Safe To Serialize
+
+Metrics snapshots are point-in-time copies:
+
+```rust,ignore
+#[derive(Debug, Default, Clone, Copy, PartialEq, Eq)]
+#[cfg_attr(feature = "serde", derive(serde::Serialize, serde::Deserialize))]
+pub struct LruMetricsSnapshot {
+    pub get_calls: u64,
+    pub get_hits: u64,
+    // ...
+}
+```
+
+Serializing a snapshot cannot corrupt a cache on restore because there is no
+restore into a running policy. At most, a downstream dashboard sees old or
+partial counters. That matches the metrics contract: best-effort observability.
+
+## Why Cache State Is Not Serialized
+
+Serializing a cache is not just "serialize a map." A policy may contain:
+
+- Intrusive list pointers or slot ids.
+- Ghost-list history.
+- Clock hand position and reference bits.
+- ARC/CAR adaptive target parameters.
+- Lazy heap tombstones.
+- Hash seeds and randomized map order.
+- `Arc<V>` sharing state.
+- TTL deadlines based on monotonic time.
+
+Restoring only keys and values discards policy warm state. Restoring every
+internal field exposes private representation and risks accepting corrupted
+state from disk.
+
+The default position: **do not serialize policy internals until there is a
+specific restore contract for that policy.**
+
+## Two Possible Future Modes
+
+If cache-state serialization lands later, it should choose one of two modes per
+type.
+
+### Data-only restore
+
+Serialize only entries (`K`, `V`) plus capacity/config. On restore, rebuild the
+policy as if entries were inserted in serialized order.
+
+Pros:
+
+- Simple and robust.
+- No private invariants exposed.
+- Cross-version friendly.
+
+Cons:
+
+- Loses recency/frequency/ghost history.
+- Warm cache may behave cold after restore.
+- Restore order becomes a semantic choice.
+
+### Warm-state restore
+
+Serialize policy metadata too: list order, frequency counters, clock hand,
+ghost lists, ARC target, etc.
+
+Pros:
+
+- Better post-restore hit rate.
+- Useful for long-lived caches that restart often.
+
+Cons:
+
+- Representation becomes part of the serialization contract.
+- Every restore must validate invariants.
+- Version migration becomes policy-specific.
+
+Warm-state restore should be opt-in per policy, not a blanket derive.
+
+## TTL and Time
+
+TTL is the hardest serialization case because monotonic ticks are not portable
+across process restarts. The TTL design doc recommends serializing **relative
+remaining duration**, not raw `Instant`-derived ticks.
+
+Rules for future TTL serialization:
+
+- Never serialize raw monotonic `Tick` as if it were wall time.
+- Capture remaining duration at serialization time.
+- Restore by adding remaining duration to the new process clock.
+- Expired-at-serialization entries should either be omitted or restored as
+  expired and immediately purged. Prefer omission for data-only restore.
+- Wall-clock deadlines require a separate API and explicit drift semantics.
+
+This keeps `Clock` pluggable and avoids replaying meaningless old monotonic
+values.
+
+## Hash Seeds and Map Order
+
+Do not serialize:
+
+- `RandomState` seeds.
+- `ShardSelector::randomized` key material.
+- Hash-map bucket order.
+- Internal `FxHashMap` iteration order.
+
+Serialize semantic data only: keys, values, capacity, policy config, and, if
+warm restore is explicitly chosen, policy metadata in a stable schema.
+
+`ShardSelector::new(shards, seed)` is the exception because deterministic
+routing is its public contract. If a type exposes deterministic sharding as
+part of serialized config, the seed is config data and must be treated as
+secret if keys are attacker-controlled.
+
+## `Arc<V>` and Sharing
+
+Several policies and stores use `Arc<V>`. Serialization should treat `Arc<V>`
+as `V`, not as identity-preserving shared ownership:
+
+- Do not attempt to preserve `Arc::ptr_eq` relationships.
+- Do not serialize refcounts.
+- Do not serialize weak references.
+
+If multiple keys point at the same `Arc<V>`, data-only serialization will
+duplicate the value unless the caller provides a higher-level interning scheme.
+That is acceptable; cachekit should not infer value identity.
+
+## Schema Discipline
+
+For serialized artifacts controlled by cachekit (benchmark JSON, metrics
+snapshots), use explicit schema rules:
+
+- Additive optional fields are minor schema changes.
+- Removing or renaming required fields is a major schema change.
+- Stable identifiers should be constants, not string literals.
+- Include enough metadata for interpretation: version, feature set where
+  relevant, timestamp, and config.
+
+For serde-derived Rust structs, `#[non_exhaustive]` is not enough for external
+JSON compatibility. A downstream JSON consumer still sees fields. If stable
+wire compatibility matters, introduce an explicit versioned artifact type
+rather than serializing internal structs directly.
+
+## What Not To Derive
+
+Do not add `#[derive(Serialize, Deserialize)]` to a policy type just because it
+compiles. Check:
+
+- Does the serialized form expose private pointers, slot ids, or tombstones?
+- Can deserialization validate every invariant?
+- What happens if the target version has different metadata layout?
+- Are hash seeds or time ticks being persisted accidentally?
+- Does restoring this type produce a live, safe cache or only a bag of entries?
+
+If the answer is not clear, add a separate DTO (`SerializableLruCache`) and a
+fallible `try_from` restore path.
+
+## See Also
+
+- [Metrics](metrics.md) - current serde-supported snapshot types
+- [TTL design](ttl.md) - relative TTL serialization recommendation
+- [Hashing and key identity](hashing.md) - hash seeds and map order
+- [Error model](error-model.md) - fallible restore should use `Result`
+- [`src/metrics/snapshot.rs`](../../src/metrics/snapshot.rs)
+- [`bench-support/src/json_results.rs`](../../bench-support/src/json_results.rs)
diff --git a/docs/design/sharding.md b/docs/design/sharding.md
new file mode 100644
index 0000000..ef5c3ea
--- /dev/null
+++ b/docs/design/sharding.md
@@ -0,0 +1,157 @@
+# Sharding
+
+> Status: design rationale for sharded data structures that exist today and
+> roadmap notes for sharded cache policies. Companion to
+> [`concurrency.md`](concurrency.md) and [`hashing.md`](hashing.md).
+
+Sharding reduces contention by splitting one shared structure into N independent
+substructures, each with its own lock and capacity accounting. cachekit already
+uses this pattern at the data-structure and store layers. It does **not** yet
+ship a generic `ShardedCache<C>` or sharded policy wrapper.
+
+## Current Sharded Primitives
+
+| Type | Layer | Purpose |
+|---|---|---|
+| `ShardedHashMapStore<K, V, S>` | store | N locked hash maps with global size counter |
+| `ShardedSlotArena<T>` | data structure | N arenas addressed by `ShardedSlotId` |
+| `ShardedFrequencyBuckets<K>` | data structure | N frequency bucket sets for concurrent LFU-style metadata |
+| `ShardSelector` | helper | keyed hash routing from key to shard |
+
+The sharded primitives are building blocks, not full cache policies. A future
+`ShardedLruCache` would have to compose a sharded key index, per-shard recency
+metadata, and global capacity semantics. That composition is where the hard
+policy questions live.
+
+## Why Shard?
+
+A single `RwLock` wrapper is simple and often fast enough. It fails when:
+
+- many threads mutate policy metadata (`get` on LRU, LFU, Clock);
+- read paths still need atomics or lock acquisition;
+- one hot lock dominates profile samples;
+- cores spend more time waiting than doing cache work.
+
+Sharding turns one contended lock into N less-contended locks. The cost is that
+each shard is now a smaller cache with less global knowledge.
+
+## Shard Routing
+
+All routing should go through [`ShardSelector`](../../src/ds/shard.rs):
+
+```rust,ignore
+let selector = ShardSelector::randomized(16);
+let shard = selector.shard_for_key(&key);
+```
+
+Routing requirements:
+
+- Deterministic within a selector: same key maps to same shard.
+- Uniform: no systematic bias toward lower shards.
+- Keyed: adversaries should not be able to craft keys that all land on shard 0.
+- Bounded: shard count is clamped to `[1, MAX_SHARDS]`.
+
+Use `ShardSelector::randomized` unless reproducibility is required. If using
+`ShardSelector::new(shards, seed)`, treat `seed` as secret when keys are
+user-controlled.
+
+## Capacity Semantics
+
+Two capacity models are possible:
+
+| Model | Behaviour | Pros | Cons |
+|---|---|---|---|
+| Per-shard capacity | total capacity split across shards | simple, one lock per op | hit rate fragmentation |
+| Global capacity | one shared capacity budget | better utilization | cross-shard locking or global victim selection |
+
+The primitives today mostly follow **per-shard local state with global gauges**:
+each shard owns its data; aggregate `len` is tracked separately where needed.
+This keeps operations single-lock. It also means a full shard can evict even if
+another shard has spare room.
+
+That is acceptable for stores and metadata primitives. For a full cache policy,
+it is a hit-rate trade-off and must be documented at the policy level.
+
+## Locking Discipline
+
+Current sharded operations acquire at most **one shard lock**. This is the most
+important invariant:
+
+- No deadlock cycles.
+- Lock hold time stays bounded by one shard operation.
+- Callers do not need a global lock ordering table.
+
+Any future operation that touches two shards must define an ordering rule, for
+example "lock lower shard index first." Avoid two-shard operations unless the
+hit-rate improvement justifies the concurrency risk.
+
+## `ShardedSlotId`
+
+`ShardedSlotArena<T>` cannot use a plain `SlotId`. A slot id must identify both
+the shard and the local slot:
+
+```text
+ShardedSlotId = (shard_index, local_slot_id)
+```
+
+This is why sharding lives at the data-structure layer instead of being hidden
+behind a generic wrapper. Once a policy stores handles, the handle type is part
+of the policy's metadata layout.
+
+## Global Metrics
+
+Sharded types should expose aggregate metrics but record locally when possible.
+The rule:
+
+- Per-operation counters can be local or atomic.
+- Gauges like total `len` need either an atomic aggregate or a shard scan.
+- Snapshot consistency is best-effort; do not lock every shard just to make a
+  metrics snapshot globally atomic.
+
+This matches the metrics design: observability must not dominate the hot path.
+
+## Roadmap: `ShardedCache<C>`
+
+A generic sharded cache wrapper would look roughly like:
+
+```rust,ignore
+pub struct ShardedCache<C, K> {
+    shards: Vec<RwLock<C>>,
+    selector: ShardSelector,
+    capacity_per_shard: usize,
+    _key: PhantomData<K>,
+}
+```
+
+Open questions:
+
+- Does `C` have to be constructible by `CacheFactory`, or does the builder own
+  all construction?
+- Is capacity split evenly, weighted by shard traffic, or global?
+- Do policies expose per-shard metrics only, or aggregate metrics too?
+- How does `DynCache` integrate: `DynCache::Sharded(Box<...>)` or a sibling
+  `DynShardedCache`?
+- Should shard count be caller-specified, CPU-count-derived, or both?
+
+The conservative first version should use per-shard capacity and one-lock
+operations. Global victim selection should wait for benchmark evidence.
+
+## When Not To Shard
+
+- Cache fits on one lock without contention.
+- Hit rate matters more than write throughput.
+- Workload has a small hot set: all hot keys may still map to one shard.
+- Cache capacity is small: per-shard fragmentation dominates.
+- You need globally strict eviction order (true global LRU, ARC target `p`).
+
+Sharding is a concurrency optimization, not a policy upgrade.
+
+## See Also
+
+- [Concurrency](concurrency.md)
+- [Hashing and key identity](hashing.md)
+- [Metrics](metrics.md)
+- [`src/ds/shard.rs`](../../src/ds/shard.rs)
+- [`src/store/hashmap.rs`](../../src/store/hashmap.rs)
+- [`src/ds/slot_arena.rs`](../../src/ds/slot_arena.rs)
+- [`src/ds/frequency_buckets.rs`](../../src/ds/frequency_buckets.rs)
diff --git a/docs/design/trait-hierarchy.md b/docs/design/trait-hierarchy.md
new file mode 100644
index 0000000..2eee6ea
--- /dev/null
+++ b/docs/design/trait-hierarchy.md
@@ -0,0 +1,415 @@
+# Cache Trait Hierarchy
+
+> Status: design rationale for the trait surface in
+> [`src/traits.rs`](../../src/traits.rs). Companion to the cross-cutting
+> principles in [`docs/design/design.md`](design.md) §7 and the concurrency
+> rationale in [`docs/design/concurrency.md`](concurrency.md).
+
+cachekit exposes its policies through a small, layered trait hierarchy.
+One kernel trait (`Cache<K, V>`) covers what every policy must do;
+optional capability traits expose signals that some policies have and
+others don't. This document explains why the surface is shaped this
+way, what each trait promises, and how to add new capabilities without
+breaking the kernel.
+
+## Goals
+
+The trait surface optimizes for four things, roughly in order:
+
+1. **Code written against the kernel survives a policy swap.** Users
+   writing `fn warm<C: Cache<K, V>>(c: &mut C, …)` can pick any of
+   the 18 implemented concrete policies without changing call sites.
+2. **Optional behaviour is visible only when present.** A policy that
+   doesn't track frequency should not have a `frequency()` method that
+   returns garbage or panics. Capability traits exist so this remains
+   true.
+3. **The kernel stays object-safe.** `Box<dyn Cache<K, V>>` is needed
+   for runtime dispatch (the `DynCache` enum is the chosen alternative,
+   but object safety keeps the door open and keeps the trait usable in
+   trait objects elsewhere).
+4. **The read/mutate split is explicit.** `peek` and `contains` are
+   side-effect-free `&self` methods; `get` is `&mut self` because it
+   updates policy state. This drops out of point 3 but is worth naming
+   on its own because it shapes the concurrent surface
+   ([`docs/design/concurrency.md`](concurrency.md)).
+
+## Map of the hierarchy
+
+```text
+                            ┌───────────────────────┐
+                            │      Cache<K, V>      │   object-safe kernel
+                            │  contains, len,       │
+                            │  capacity, peek, get, │
+                            │  insert, remove,      │
+                            │  clear, is_empty      │
+                            └───────────┬───────────┘
+                                        │ extends
+            ┌───────────────┬───────────┼───────────┬──────────────────┐
+            ▼               ▼           ▼           ▼                  ▼
+    EvictingCache   VictimInspect   RecencyTrack   FrequencyTrack   HistoryTrack
+    evict_one()     peek_victim()   touch,         frequency()      access_count,
+                                    recency_rank                    k_distance,
+                                                                    access_history,
+                                                                    k_value
+
+   ConcurrentCache  CacheFactory + CacheConfig    AsyncCacheFuture  (utility traits,
+   (unsafe marker)  (constructor abstraction)     (Phase 2)         not extensions)
+```
+
+All capability traits in the upper row extend `Cache<K, V>`. They
+compose by being implemented additively — `LrukCache` implements
+`Cache`, `RecencyTracking`, `FrequencyTracking`, **and**
+`HistoryTracking` because it tracks all three signals.
+
+## Layer 1 — `Cache<K, V>`
+
+The kernel trait. Every policy implements it. The full signature lives
+in [`src/traits.rs`](../../src/traits.rs); the design decisions worth
+naming are:
+
+```rust
+pub trait Cache<K, V> {
+    fn contains(&self, key: &K) -> bool;
+    fn len(&self) -> usize;
+    fn is_empty(&self) -> bool { self.len() == 0 }
+    fn capacity(&self) -> usize;
+
+    fn peek(&self, key: &K) -> Option<&V>;
+    fn get(&mut self, key: &K) -> Option<&V>;
+    fn insert(&mut self, key: K, value: V) -> Option<V>;
+    fn remove(&mut self, key: &K) -> Option<V>;
+    fn clear(&mut self);
+}
+```
+
+### Object safety
+
+The signature deliberately avoids every feature that would break
+object safety:
+
+- No generic methods.
+- No `Self` in return position (except by reference, which is allowed).
+- No `where Self: Sized` bounds.
+- No `impl Trait` returns.
+
+This costs ergonomics — batch operations like `insert_many`,
+`get_or_insert_with(closure)`, and `extend(iter)` stay as inherent
+methods on each policy rather than landing on `Cache<K, V>` itself.
+That trade is intentional: keeping the trait object-safe means
+`DynCache<K, V>` is *able* to dispatch through it (even though the
+shipped `DynCache` is an enum dispatcher rather than a trait object —
+see [`design.md`](design.md) §13). It also keeps `Box<dyn Cache<K, V>>`
+available for users writing test harnesses, factories, or registries
+that need true type erasure.
+
+### `peek` vs `get` — the read/mutate split
+
+This is the most consequential design decision in the kernel:
+
+- **`peek(&self, …) -> Option<&V>`** does not update recency, frequency,
+  reference bits, segment placement, or any policy state. It is the
+  honest read.
+- **`get(&mut self, …) -> Option<&V>`** is the policy-tracked read.
+  An LRU `get` moves the entry to MRU; an LFU `get` bumps the
+  frequency counter; a Clock `get` sets the reference bit.
+
+Three things fall out of the split:
+
+1. **`peek` is usable behind a read lock.** Concurrent wrappers
+   ([`docs/design/concurrency.md`](concurrency.md)) implement their
+   `peek` with `RwLock::read`, allowing multiple readers to proceed
+   in parallel. `get` requires `RwLock::write` because it mutates.
+2. **`peek` is testable as a pure function.** Hit-rate measurements,
+   invariant assertions, and debug prints can use `peek` without
+   perturbing the policy.
+3. **`len` / `contains` / `capacity` are also `&self`.** They live
+   alongside `peek` in the read-locked surface of concurrent wrappers,
+   for the same reason.
+
+`contains` is its own method — not `peek(key).is_some()` — because
+some policies (S3-FIFO, ARC, CAR with ghost lists) can answer
+"is this key resident?" cheaper than they can return a value reference.
+
+### `&V` return positions
+
+Returning `&V` rather than `V`-by-value or `Arc<V>` is the right
+choice **for the sequential trait**. Callers who need ownership can
+clone; callers who don't pay nothing. The cost shows up in concurrent
+wrappers, which cannot return `&V` across a lock boundary — that's
+why `Concurrent*` types deviate from `Cache<K, V>` (covered in detail
+in [`concurrency.md`](concurrency.md)).
+
+### Default methods
+
+Only `is_empty` has a default. Adding more defaults — even ones that
+seem obviously implementable in terms of other methods — would push
+performance regressions onto policies that have cheaper specialised
+implementations. The hashmap-backed `contains` is faster than the
+default `peek(…).is_some()` because it skips fetching the value, and
+that difference matters on hot lookup paths.
+
+## Layer 2 — Capability traits
+
+Each capability trait extends `Cache<K, V>` and exposes a signal that
+**some but not all** policies have. The rule is:
+
+> Implement the capability trait only when the policy genuinely
+> exposes that signal. Do not stub out the methods with sentinel
+> returns.
+
+### `EvictingCache<K, V>: Cache<K, V>`
+
+```rust
+fn evict_one(&mut self) -> Option<(K, V)>;
+```
+
+Forces a single eviction by policy. Returns the evicted entry or
+`None` if the cache is empty. Useful for benchmarks ("evict 1 % of
+the cache and measure"), background cleanup, and capacity-on-demand
+patterns. Implemented by FIFO, LIFO, LRU, FastLRU, Heap-LFU, S3-FIFO,
+Clock, Clock-PRO, LRU-K, MFU, MRU, plus the LFU variants.
+
+Policies that **do not** implement it: ARC, CAR, NRU, Random, SLRU,
+2Q. The rustdoc on `EvictingCache` lists this set explicitly. The
+reason is policy-specific:
+
+- ARC / CAR evict via adaptive choice across two queues; "evict one
+  by policy" is ambiguous without an insertion that drives the
+  adaptation.
+- NRU sweeps reference bits; an isolated `evict_one` may scan the
+  whole cache.
+- Random has no order; users who want random eviction should call
+  `remove(random_key)` themselves.
+- SLRU / 2Q's victim depends on which segment is over-quota,
+  which only happens under capacity pressure.
+
+The trait is `#[must_use]` on its return because dropping the evicted
+entry on the floor is rarely what callers want.
+
+### `VictimInspectable<K, V>: Cache<K, V>`
+
+```rust
+fn peek_victim(&self) -> Option<(&K, &V)>;
+```
+
+Read-only access to the entry that would be evicted next. Only
+implemented by policies whose victim is cheap and stable to identify
+without mutating state — FIFO, LIFO, LRU, FastLRU. Clock-family
+policies don't implement it because identifying the victim requires
+advancing the hand (a mutation). LFU-family policies don't implement
+it because the heap top can be a stale entry that hasn't been popped
+yet ([`LazyMinHeap`](../../src/ds/lazy_heap.rs)).
+
+The signature is deliberately `&self`-only. Anything that would force
+`&mut self` (lazy heap rebuild, clock-hand advance, ARC adaptation)
+disqualifies the policy from implementing it.
+
+### `RecencyTracking<K, V>: Cache<K, V>`
+
+```rust
+fn touch(&mut self, key: &K) -> bool;
+fn recency_rank(&self, key: &K) -> Option<usize>;
+```
+
+For policies that order entries by access recency: LRU, FastLRU,
+LRU-K. `touch` is `get` without the value lookup — useful when you
+want to refresh recency for a key whose value you already have.
+`recency_rank` returns 0 for the MRU entry and `len() - 1` for the
+LRU. Both are stable across `peek`/`contains`/`len` calls but invalidate
+on any `&mut` call.
+
+### `FrequencyTracking<K, V>: Cache<K, V>`
+
+```rust
+fn frequency(&self, key: &K) -> Option<u64>;
+```
+
+For policies that track access frequency: LFU, Heap-LFU, MFU, LRU-K.
+The `u64` return is intentional even though some policies use smaller
+counters internally (LFU uses small saturating counters under
+[`FrequencyBuckets`](../../src/ds/frequency_buckets.rs)) — exposing
+`u64` keeps the trait stable across counter-width changes.
+
+### `HistoryTracking<K, V>: Cache<K, V>`
+
+```rust
+fn access_count(&self, key: &K) -> Option<usize>;
+fn k_distance(&self, key: &K) -> Option<u64>;
+fn access_history(&self, key: &K) -> Option<Vec<u64>>;
+fn k_value(&self) -> usize;
+```
+
+LRU-K style access-history inspection. Currently implemented only by
+`LrukCache`. The `access_history` return is a `Vec<u64>` because the
+history is bounded by K and callers typically inspect it as a unit;
+exposing the underlying [`FixedHistory`](../../src/ds/fixed_history.rs)
+would couple consumers to an internal type.
+
+`k_value()` is on the trait rather than as a constructor argument
+witness because LRU-K's K is policy-configured and consumers writing
+generic code over `HistoryTracking` need to read it without knowing
+the concrete type.
+
+## Why capability traits, not feature flags?
+
+cachekit could expose recency / frequency / history through methods
+on `Cache<K, V>` itself, gated by Cargo features. It doesn't, for
+three reasons:
+
+- **Compile-time gating doesn't match the actual gating signal.**
+  Whether a method is meaningful depends on the **policy**, not on
+  the **build**. A `policy-all` build still has policies that can't
+  answer `frequency()`.
+- **Method-level defaults that return `None` are a footgun.** Code
+  that calls `cache.frequency(&k)` on an LRU cache would silently
+  return `None` and pass through review.
+- **Trait bounds carry information.** `fn warm<C: FrequencyTracking>()`
+  documents at the type-system level that the function only makes
+  sense for frequency-tracking caches.
+
+The trade is one extra `use` statement at call sites — `use
+cachekit::traits::{Cache, RecencyTracking};` — which is a small price
+for the correctness gain.
+
+## Utility traits
+
+Three traits live alongside the hierarchy but are not extensions of
+`Cache<K, V>`.
+
+### `unsafe trait ConcurrentCache: Send + Sync`
+
+Marker trait, no methods. Implementing it asserts that the type
+handles internal synchronization safely. Covered in detail in
+[`concurrency.md`](concurrency.md#concurrentcache-marker-trait-not-capability-trait).
+
+### `CacheFactory<K, V>` and `CacheConfig`
+
+```rust
+pub trait CacheFactory<K, V> {
+    type Cache: Cache<K, V>;
+    fn new(capacity: usize) -> Self::Cache;
+    fn with_config(config: CacheConfig) -> Self::Cache;
+}
+```
+
+Constructor abstraction for generic code that needs to build caches
+without naming the concrete type. `CacheConfig` is a `#[non_exhaustive]`
+struct with builder-style `with_*` methods, mirroring the wider
+`CacheBuilder` shape in [`src/builder.rs`](../../src/builder.rs).
+
+In practice most code constructs caches directly (`LruCache::new(…)`)
+or through `CacheBuilder`. `CacheFactory` mostly exists for test
+harnesses and benchmark runners that want to parameterise across
+policies; the trait's `Cache` associated type makes that ergonomic.
+
+### `AsyncCacheFuture<K, V>: Send + Sync`
+
+Phase 2 placeholder. The methods (`supports_async_get`,
+`supports_async_insert`) default to `false` and no policy overrides
+them. The trait exists so that async-native policies can be added in
+the future without breaking the existing surface.
+
+## Read/mutate split rationale (recapitulated)
+
+Worth stating once more in one place: the methods on `Cache<K, V>`
+split cleanly into two groups:
+
+| `&self` (read-locked-safe) | `&mut self` (write-locked) |
+|----------------------------|----------------------------|
+| `contains`, `len`, `is_empty`, `capacity` | `get`, `insert`, `remove`, `clear` |
+| `peek`                     |                            |
+| (capability) `peek_victim`, `recency_rank`, `frequency`, `access_count`, `k_distance`, `access_history`, `k_value` | (capability) `evict_one`, `touch` |
+
+This is the contract the concurrent wrappers rely on. Adding a new
+`Cache` method that mutates state through `&self` (interior mutability)
+would break the lock-granularity story; adding one that takes `&mut
+self` but doesn't logically mutate would prevent the read-lock fast
+path in `Concurrent*` wrappers.
+
+## Object safety vs. ergonomic methods
+
+Some operations naturally belong on `Cache<K, V>` but would break
+object safety. They live as inherent methods on each policy instead:
+
+- `extend<I: IntoIterator<Item = (K, V)>>(&mut self, iter: I)`
+- `get_or_insert_with<F: FnOnce() -> V>(&mut self, key: K, f: F) -> &V`
+- `insert_many(&mut self, items: impl IntoIterator<Item = (K, V)>)`
+  with buffer reuse
+
+The rule: anything taking a generic closure, generic iterator, or
+returning `impl Trait` is an inherent method, not a trait method.
+The trait stays object-safe; the policy types stay ergonomic.
+
+## Adding a new capability trait
+
+Checklist for new capability traits:
+
+1. **The signal must exist in the implementing policy's metadata.**
+   No defaults that return `None`/`0`/`false` for "doesn't apply."
+2. **Bound on `Cache<K, V>`.** Capability traits compose with the
+   kernel; they don't replace it.
+3. **Object safety is optional for capability traits** but
+   recommended. Trait objects of capability traits show up rarely;
+   ergonomic generic methods are fine.
+4. **Name follows the noun-of-the-signal pattern.** `RecencyTracking`,
+   `FrequencyTracking`, `HistoryTracking`. New ones should follow
+   suit: `WeightTracking`, `CostTracking`, `AdmissionTracking`.
+5. **Re-export from `prelude`.** Capability traits live in the same
+   `use cachekit::prelude::*;` namespace as the kernel.
+6. **Document the implementing-policy set.** The rustdoc on
+   `EvictingCache` lists policies that opt out; new traits should
+   do the same for the smaller set that opts in.
+
+## Future capability traits
+
+Sketched in priority order:
+
+- **`ExpiringCache<K, V>: Cache<K, V>`** — TTL surface, per
+  [`docs/design/ttl.md`](ttl.md) §4(a). Signature:
+
+  ```rust
+  fn insert_with_ttl(&mut self, key: K, value: V, ttl: Duration) -> Option<V>;
+  fn ttl_status(&self, key: &K) -> TtlStatus;
+  fn set_ttl(&mut self, key: &K, ttl: Duration) -> bool;
+  fn purge_expired(&mut self) -> usize;
+  ```
+
+  Implemented by the `Expiring<C>` decorator over any `Cache<K, V>`.
+
+- **`WeightTracking<K, V>: Cache<K, V>`** — surface for weight-aware
+  caches built on [`WeightStore`](../../src/store/weight.rs). Likely
+  signature:
+
+  ```rust
+  fn weight(&self, key: &K) -> Option<usize>;
+  fn total_weight(&self) -> usize;
+  fn weight_capacity(&self) -> usize;
+  ```
+
+  Needed before GDS/GDSF (roadmap policies) can be expressed
+  generically.
+
+- **`AdmissionTracking<K, V>: Cache<K, V>`** — exposes ghost-list /
+  admission-history state for ARC, CAR, S3-FIFO, Clock-PRO,
+  TinyLFU. Specifically: was this key ever resident, and if so when
+  did it leave? Useful for adaptive workloads where the caller
+  wants to know whether a miss is a one-hit-wonder or a returning
+  member of the working set.
+
+The trait is intentionally not added until a second policy implements
+it. The `RecencyTracking` / `FrequencyTracking` / `HistoryTracking`
+naming established the convention; adding `WeightTracking` only when
+GDS lands keeps the surface honest.
+
+## See also
+
+- [Design overview](design.md) — §7 frames the layering at the
+  principles level, §13 covers `DynCache` runtime dispatch
+- [Concurrency](concurrency.md) — read/mutate split + `ConcurrentCache`
+- [TTL design](ttl.md) — applied example: `ExpiringCache` as a new
+  capability trait
+- [Read-only traits](../guides/read-only-traits.md) — user-facing
+  guidance on the `peek` / `get` split
+- [`src/traits.rs`](../../src/traits.rs) — the canonical definitions
+- [`src/store/traits.rs`](../../src/store/traits.rs) — parallel
+  trait family at the store layer (sequential + concurrent)
diff --git a/docs/design/weighted-eviction.md b/docs/design/weighted-eviction.md
new file mode 100644
index 0000000..7a8dbac
--- /dev/null
+++ b/docs/design/weighted-eviction.md
@@ -0,0 +1,388 @@
+# Weighted Eviction
+
+> Status: design rationale for [`WeightStore`](../../src/store/weight.rs)
+> and [`ConcurrentWeightStore`](../../src/store/weight.rs). Companion to
+> [`design.md`](design.md), [`concurrency.md`](concurrency.md), and the
+> [`stores`](../stores/README.md) reference.
+
+Entry-count caps are the wrong tool when entries vary in size. A cache
+sized "max 1 000 entries" that holds a mix of 100-byte thumbnails and
+10 MB blobs will either overshoot its memory budget by orders of
+magnitude (when blobs dominate) or waste capacity (when thumbnails do).
+`WeightStore` exists to give callers a second, byte-denominated budget
+alongside the entry count.
+
+This document explains the dual-limit model, the contract on the
+user-supplied weight function, where weight integrates with eviction
+policies today (it does not), and how it pre-stages GDS/GDSF on the
+roadmap.
+
+## The problem
+
+A typical entry-count cache:
+
+- Fails to bound memory when value sizes differ by orders of magnitude.
+- Cannot answer "how many bytes am I caching?" without iterating.
+- Treats a 1 KB and a 1 MB entry as equal eviction candidates, which
+  is wrong when memory pressure is the binding constraint.
+
+The complementary failure mode — a pure byte-budgeted cache — has its
+own problems:
+
+- Highly variable entry counts make per-entry metadata budgeting hard.
+- A pathological "one giant entry fills the cache" case is the byte
+  version of the "millions of one-byte entries fills the cache"
+  problem in entry-count caches.
+- Some policies (LFU bucket arrays, S3-FIFO ratios) are sized by entry
+  count and need a stable upper bound.
+
+`WeightStore` therefore enforces **both** an entry-count cap and a
+weight cap — whichever is hit first triggers `StoreFull`. The user
+picks the units of "weight" via a closure.
+
+## Dual-limit model
+
+```text
+try_insert(key, value):
+  │
+  ├─► Existing key (update)
+  │     │
+  │     ├── new_weight    = weight_fn(&value)
+  │     ├── next_total    = total_weight - old_weight + new_weight
+  │     │
+  │     └── next_total > capacity_weight? ──► Err(StoreFull)
+  │                                       └──► Ok(Some(old_value))
+  │
+  └─► New key (insert)
+        │
+        ├── len() >= capacity_entries?         ──► Err(StoreFull)
+        ├── new_weight = weight_fn(&value)
+        ├── total_weight + new_weight > capacity_weight? ──► Err(StoreFull)
+        │
+        └── Ok(None)
+```
+
+Three properties worth naming:
+
+- **Pre-checked, not retroactive.** `try_insert` returns
+  `Err(StoreFull)` rather than silently evicting; the **store** is
+  full, so the caller (or the policy layered above it) decides what
+  to evict.
+- **Updates can fail too.** Replacing a 1 MB value with a 2 MB value
+  on a cache with 1.5 MB of remaining headroom returns `StoreFull` —
+  the update is rejected and the original entry stays resident. This
+  is the only sensible behaviour when an update can push the store
+  past its budget.
+- **Atomic weight bookkeeping.** `total_weight` is the live sum of
+  every resident entry's weight. Every successful `try_insert` /
+  `remove` / `clear` updates it; reads (`get`, `peek`) do not. The
+  invariant `total_weight == sum(entries.weight)` is debug-asserted.
+
+## The weight function: contract and hazards
+
+```rust,ignore
+F: Fn(&V) -> usize
+```
+
+The user supplies a closure. Three pieces of the contract matter:
+
+- **Cheap.** Ideally O(1). The function is called on every insert and
+  every update. A weight function that traverses the value to compute
+  bytes (`|tree: &BTreeMap<K, V>| tree.iter().map(…).sum()`) makes
+  insert latency proportional to value size.
+- **Deterministic.** The same value must yield the same weight every
+  time. A non-deterministic weight breaks `total_weight` accounting —
+  the store remembers `old_weight` from the *previous* insert, so a
+  changed weight on update leaks `(new_actual - old_recorded)` bytes
+  of budget per update.
+- **Non-panicking.** The function is invoked while a write lock is
+  held in [`ConcurrentWeightStore`](../../src/store/weight.rs). A
+  panicking weight function under `panic = "unwind"` poisons-by-
+  unwind the inner state (the lock itself is `parking_lot`'s
+  non-poisoning variant; what is "poisoned" is the call site,
+  which never completes the insert). Under the crate's default
+  `panic = "abort"` release profile this terminates the process.
+
+Common shapes:
+
+```rust,ignore
+|v: &Vec<u8>|     v.len()
+|s: &String|      s.len()
+|img: &Image|     img.width * img.height * 4
+|_: &T|           1                 // entry-count only
+|v: &Cow<[u8]>|   v.len()           // works for borrowed/owned
+```
+
+The "weight = 1" specialization deserves a note: it makes
+`WeightStore` behave exactly like a count-only store, at the cost of
+an `Arc<V>` round-trip and per-entry weight slot. Use
+`HashMapStore` for that case unless you specifically want the
+`ConcurrentWeightStore` API.
+
+## Precomputation: weight stored per entry
+
+Each entry holds its weight in a small wrapper:
+
+```rust,ignore
+struct WeightEntry<V> {
+    value: Arc<V>,
+    weight: usize,
+}
+```
+
+Weight is computed **once** at insert/update time and stored alongside
+the value. Three consequences:
+
+- Reads (`get`, `peek`, `contains`, `len`, `total_weight`) never
+  invoke the weight function. They cannot — they only have a
+  reference to the stored entry, and the stored entry already knows
+  its weight.
+- `remove` updates `total_weight` by subtracting the stored weight,
+  with no recompute.
+- Memory overhead per entry is `sizeof(usize)` + `sizeof(Arc<V>)` —
+  one extra word plus the Arc header. Acceptable for variable-size
+  caches where the value itself dominates the per-entry footprint.
+
+The alternative — recomputing weight on every read for the sake of
+"freshness" — would only matter if the weight function were
+non-deterministic, which the contract forbids.
+
+## `Arc<V>` everywhere
+
+`WeightStore` stores `Arc<V>` even in the single-threaded variant:
+
+```rust,ignore
+pub fn try_insert(&mut self, key: K, value: Arc<V>) -> Result<Option<Arc<V>>, StoreFull>
+pub fn get(&mut self, key: &K) -> Option<Arc<V>>
+pub fn peek(&self, key: &K) -> Option<Arc<V>>
+```
+
+This is a deliberate divergence from `StoreCore` / `StoreMut` (which
+return `V` directly). Three reasons:
+
+- **Cheap shared ownership.** Large `V`s (images, blobs) are the
+  target use case. Returning `Arc<V>` lets callers hold or share the
+  value without forcing `V: Clone`.
+- **Surface alignment with `ConcurrentWeightStore`.** The concurrent
+  variant must return `Arc<V>` (the `&V`-across-lock problem from
+  [`concurrency.md`](concurrency.md)). Keeping the single-threaded
+  variant on the same shape lets callers swap between them by
+  changing one type without re-plumbing returns.
+- **`V: !Clone` is supported.** Callers who don't want to require
+  `Clone` on their value type get the `Arc<V>` round-trip "for free."
+
+The cost is that `WeightStore` does **not** implement `StoreCore` /
+`StoreMut`. It is a sibling, not a subtype, of the entry-count stores
+([`HashMapStore`](../../src/store/hashmap.rs),
+[`SlabStore`](../../src/store/slab.rs)), and code generic over those
+traits cannot accept a `WeightStore` without adaptation. This is the
+single sharpest API edge in the store layer, called out explicitly in
+the module documentation.
+
+## Why weight is at the **store** layer, not the policy layer
+
+The 18 implemented policies in `src/policy/` are all weight-unaware.
+They count entries and evict by entry. `WeightStore` is below them in
+the layering:
+
+```text
+   ┌─────────────────────────────┐
+   │   policy (weight-unaware)   │   evicts by recency/frequency/etc
+   └──────────────┬──────────────┘
+                  │ Cache<K, V> uses store underneath
+   ┌──────────────▼──────────────┐
+   │  WeightStore (dual limits)  │   refuses inserts past weight cap
+   └─────────────────────────────┘
+```
+
+This separation has two consequences worth understanding:
+
+- **The policy decides who to evict; the store decides whether the
+  result fits.** A policy operating over a `WeightStore` evicts its
+  policy-chosen victim, then attempts the insert. If the insert
+  still doesn't fit (one large value cannot be made room for by
+  evicting a single small victim), the policy must evict again or
+  surface `StoreFull` to the caller.
+- **No policy in the tree today consumes `WeightStore` directly.**
+  `WeightStore` is reachable only through its own concrete API, not
+  through the `Cache<K, V>` trait or `DynCache`. Users who want a
+  weight-aware cache today build one themselves on top of
+  `WeightStore` plus a chosen eviction strategy.
+
+The reason for this layering is forward compatibility. Weight-aware
+**policies** (GDS, GDSF, LFU-DA, see roadmap) need this store as
+their substrate. Coupling weight directly into a policy locks the
+weight model to that policy; keeping it at the store layer keeps the
+substrate reusable.
+
+## Concurrent variant
+
+`ConcurrentWeightStore<K, V, F>` follows the wrapper pattern from
+[`concurrency.md`](concurrency.md):
+
+```rust,ignore
+pub struct ConcurrentWeightStore<K, V, F> {
+    inner: Arc<RwLock<WeightStore<K, V, F>>>,
+}
+```
+
+`parking_lot::RwLock`; `peek` / `contains` / `len` / `total_weight`
+take the read lock; `try_insert` / `remove` / `clear` take the write
+lock; metrics counters live in `AtomicU64` so the read-locked paths
+can still increment them without escalating.
+
+The weight function runs **inside the write lock** on every insert
+and update. A slow `F` therefore stalls every reader and writer in
+the cache — a DoS amplification vector when caching user-supplied
+values. The mitigation is the cheapness contract; the rustdoc on
+`ConcurrentWeightStore::try_insert` says so.
+
+`ConcurrentWeightStore` implements `ConcurrentStoreRead<K, V>` and
+`ConcurrentStore<K, V>`. Unlike the single-threaded variant — which
+deliberately does not implement `StoreCore`/`StoreMut` — the
+concurrent variant *does* fit the concurrent trait family because
+both already use `Arc<V>` returns. The asymmetry is awkward but
+honest: the trait family is shaped around the constraints the
+concurrent path imposes, and the single-threaded variant happens to
+borrow that shape rather than the sequential one.
+
+## Lock-poisoning and total-weight integrity
+
+Under `panic = "abort"` (the crate's release default) lock poisoning
+is moot — the process exits. Under `panic = "unwind"`, the order of
+operations in `clear()` matters:
+
+```rust,ignore
+fn clear(&mut self) {
+    self.total_weight = 0;          // (1) reset first
+    self.entries.clear();           // (2) then drop entries (may panic)
+}
+```
+
+If (2) panics during entry drop, `total_weight = 0` and `len() == 0`
+remain consistent post-panic. Individual values may leak through the
+unwinding drop but the store's accounting cannot be corrupted into
+"says it has 1 GB resident when actually empty" — which would
+silently reject all future inserts. The module documentation calls
+this out so callers who override `panic = "abort"` know what they
+get.
+
+## Failure mode: weight cap, not entry cap
+
+When the weight budget is hit but the entry count is not:
+
+- `try_insert` returns `StoreFull` for any new key whose value would
+  push `total_weight` past `capacity_weight`.
+- `len() < capacity_entries` — the entry budget has headroom that
+  cannot be used.
+- `total_weight == capacity_weight` (approximately, depending on
+  insert sizes).
+
+The reverse — entry cap hit, weight cap not — produces `StoreFull`
+on any new insert regardless of weight, including tiny values.
+
+Both are correct. The store does not silently demote either limit;
+the caller's intent is "neither budget shall be exceeded," and the
+store enforces it literally.
+
+## Capacity tuning
+
+The dual limits give callers two knobs:
+
+| Setting | Effect |
+|---|---|
+| `capacity_entries` finite, `capacity_weight = usize::MAX` | Behaves like an entry-count store; weight is observable but unconstrained |
+| `capacity_entries = usize::MAX`, `capacity_weight` finite | Behaves like a pure byte-budget store; entry count is observable but unconstrained |
+| Both finite | Hard dual limit |
+
+The first row is rarely what callers want (use `HashMapStore`
+instead — no per-entry weight slot). The second is a legitimate
+configuration for callers who genuinely want bytes-only accounting
+and accept the per-entry overhead. The third is the design intent.
+
+## Security considerations
+
+The module rustdoc is unusually long on security; the points worth
+naming at the design-doc level:
+
+- **Hasher.** `WeightStore`'s key index uses `FxHashMap`, which is
+  **not** HashDoS-resistant. Callers caching variable-size values
+  keyed by request paths, tenant IDs, or filenames — i.e. exactly
+  the use case `WeightStore` targets — should pre-hash keys with a
+  keyed hash (`siphasher` with a per-process key) or migrate to
+  `HashMapStore`'s `RandomState`-backed default.
+- **Side channel.** `total_weight` is publicly readable. Callers
+  with access to the counter can infer the size of other tenants'
+  cached entries from before/after differentials. Avoid exposing
+  `total_weight` across trust boundaries when caching tenant-keyed
+  variable-size records.
+- **Sensitive values.** Dropped `V`s are not zeroized. Wrap `V` in
+  `zeroize::Zeroizing` (or equivalent) when caching credentials.
+- **Counters.** Metrics use `Relaxed` ordering and wrap on overflow
+  in release. Best-effort observability, not audit-grade.
+
+## Pre-staging GDS/GDSF
+
+GreedyDual-Size (GDS) and its frequency-aware variant GDSF evict by
+**cost ÷ size** rather than recency or frequency alone. Both
+require:
+
+- A per-entry size (`WeightStore` already stores it).
+- A per-entry cost (caller-supplied at insert time).
+- An eviction priority queue ordered by `cost / size + age`.
+
+`WeightStore` provides the size half today. The cost half and the
+priority-queue substrate ([`LazyMinHeap`](../../src/ds/lazy_heap.rs)
+is a natural fit) are the missing pieces. When GDS lands, the
+expected shape is:
+
+```rust,ignore
+pub struct GdsCache<K, V, F> {
+    store: WeightStore<K, V, F>,
+    queue: LazyMinHeap<K, GdsPriority>,
+    aging: AgingCounter,
+}
+```
+
+The trait surface would be `Cache<K, V>` plus a future
+`WeightTracking<K, V>` capability trait (sketched in
+[`trait-hierarchy.md`](trait-hierarchy.md#future-capability-traits)),
+giving generic code the ability to consult `weight(key)` and
+`total_weight()` regardless of which policy is doing the evicting.
+
+The non-trivial design question, when GDS lands, is whether the
+priority queue stores cost / size at insert time (cheap, can become
+stale if the value's "true" cost diverges from insert-time cost) or
+recomputes on demand (more expensive, but always current). The
+current expectation is "store at insert time, document the
+staleness window" — matching the precomputed-weight discipline this
+store already follows.
+
+## When not to use `WeightStore`
+
+- **Uniform value sizes.** Use `HashMapStore` or `SlabStore`. The
+  weight slot is overhead with no benefit.
+- **Hot-path latency dominates.** The weight function runs on every
+  insert. If `F` is non-trivial, insert latency is `F`-dominated.
+- **You need a policy.** `WeightStore` is a store; policies sit
+  above it. A bare `WeightStore` evicts nothing on its own — it
+  surfaces `StoreFull` and the caller decides what to remove. Use
+  this directly only when the caller knows the eviction strategy
+  better than any built-in policy would.
+
+## See also
+
+- [Design overview](design.md) — §2 (memory layout) and §5
+  (eviction) frame the trade-offs at the principles level
+- [Concurrency](concurrency.md) — `ConcurrentWeightStore` follows
+  the standard wrapper pattern documented there
+- [Cache trait hierarchy](trait-hierarchy.md) — future
+  `WeightTracking` capability trait sketched in
+  "Future capability traits"
+- [Stores](../stores/README.md) and [`weight.md`](../stores/weight.md)
+  — reference docs for the runtime behaviour
+- [Error model](error-model.md) — `StoreFull` semantics
+- [`src/store/weight.rs`](../../src/store/weight.rs) — the canonical
+  implementation
+- [Roadmap: GDS](../policies/roadmap/gds.md) and
+  [GDSF](../policies/roadmap/gdsf.md) — the planned consumers
diff --git a/docs/index.md b/docs/index.md
index 4a893c7..cbc9c14 100644
--- a/docs/index.md
+++ b/docs/index.md
@@ -12,6 +12,18 @@ Key features:
 - [Quickstart](getting-started/quickstart.md) — Install and build your first cache
 - [Integration guide](getting-started/integration.md) — CacheBuilder API, policy selection, thread safety
 - [Design overview](design/design.md) — Architectural decisions and performance principles
+- [Cache trait hierarchy](design/trait-hierarchy.md) — Kernel trait, capability traits, read/mutate split
+- [Concurrency](design/concurrency.md) — `Concurrent*` wrappers, lock discipline, sharded primitives
+- [Builder and runtime dispatch](design/builder-and-dyn-dispatch.md) — `CachePolicy`, `DynCache`, enum dispatch
+- [Weighted eviction](design/weighted-eviction.md) — `WeightStore`, dual limits, GDS/GDSF pre-staging
+- [Metrics](design/metrics.md) — Recorder / snapshot / exporter split, Prometheus integration
+- [Error model](design/error-model.md) — Panic vs `Result` discipline, four error types
+- [Benchmarking design](design/benchmarking.md) — Benchmark layers, policy registry, JSON artifacts
+- [Hashing and key identity](design/hashing.md) — Hasher choices, key interning, shard routing
+- [Sharding](design/sharding.md) — Sharded primitives, routing, capacity semantics
+- [Serialization](design/serialization.md) — `serde` surface and cache-state persistence boundaries
+- [Non-goals](design/non-goals.md) — Explicit boundaries and out-of-scope features
+- [TTL design](design/ttl.md) — Worked example of every principle in one feature
 - [API surface](guides/api-surface.md) — Module map and entrypoints
 
 ## Policies