From afe12b41c7df6b86d4ef3f4e82e76128f713c7ab Mon Sep 17 00:00:00 2001
From: Thomas Korrison <thomas_korrison@hotmail.com>
Date: Wed, 13 May 2026 21:22:14 +0100
Subject: [PATCH 1/3] docs: expand design documentation with new sections on
 concurrency, metrics, error model, and weighted eviction

- Added detailed documentation on concurrency strategies, outlining the design rationale for concurrent cache types and their usage.
- Introduced a metrics section to explain the metrics infrastructure, including recording, snapshotting, and exporting metrics for observability.
- Documented the error model, clarifying the panic vs. `Result` discipline and the handling of different error types.
- Included a comprehensive overview of weighted eviction strategies, detailing the implementation of `WeightStore` and `ConcurrentWeightStore`.

These additions enhance the overall documentation, providing clearer guidance on design principles and usage patterns for developers.
---
 docs/design/builder-and-dyn-dispatch.md | 436 ++++++++++++++++++++
 docs/design/concurrency.md              | 344 ++++++++++++++++
 docs/design/design.md                   | 237 ++++++++---
 docs/design/error-model.md              | 341 ++++++++++++++++
 docs/design/metrics.md                  | 507 ++++++++++++++++++++++++
 docs/design/trait-hierarchy.md          | 415 +++++++++++++++++++
 docs/design/weighted-eviction.md        | 388 ++++++++++++++++++
 docs/index.md                           |   7 +
 8 files changed, 2623 insertions(+), 52 deletions(-)
 create mode 100644 docs/design/builder-and-dyn-dispatch.md
 create mode 100644 docs/design/concurrency.md
 create mode 100644 docs/design/error-model.md
 create mode 100644 docs/design/metrics.md
 create mode 100644 docs/design/trait-hierarchy.md
 create mode 100644 docs/design/weighted-eviction.md
diff --git a/docs/design/builder-and-dyn-dispatch.md b/docs/design/builder-and-dyn-dispatch.md
new file mode 100644
index 0000000..5f222f7
--- /dev/null
+++ b/docs/design/builder-and-dyn-dispatch.md
@@ -0,0 +1,436 @@
+# Builder and Runtime Dispatch
+
+> Status: design rationale for [`CacheBuilder`](../../src/builder.rs),
+> [`CachePolicy`](../../src/builder.rs), and [`DynCache<K, V>`](../../src/builder.rs).
+> Companion to [`design.md`](design.md) §13, [`trait-hierarchy.md`](trait-hierarchy.md),
+> and [`concurrency.md`](concurrency.md).
+
+cachekit ships 17 implemented eviction policies. Most application code
+wants to pick one of them — possibly at runtime, based on configuration
+— without writing 17 monomorphized call sites. This document explains
+why that runtime choice is delivered through an enum dispatcher rather
+than a `Box<dyn Cache>`, what the user-visible cost is, and how to
+extend the surface when a new policy lands.
+
+## The problem
+
+A user with a `policy: String` configuration value wants to write:
+
+```rust,ignore
+let mut cache = build_cache_from_config(config);
+cache.insert(key, value);
+cache.get(&key);
+```
+
+without enumerating the 17 policies at every call site. The cache type
+must therefore be **uniform across policies** — the concrete type the
+caller holds cannot depend on which policy was chosen.
+
+Two Rust mechanisms give a uniform type:
+
+1. **Trait objects** — `Box<dyn Cache<K, V>>`, with dispatch through a
+   vtable per method call.
+2. **Enum dispatch** — a closed sum of every policy, with dispatch
+   through a `match` per method call.
+
+cachekit picks mechanism 2. The rest of this document explains why and
+what it costs.
+
+## Enum dispatch vs `Box<dyn Cache>`
+
+`Cache<K, V>` is deliberately object-safe (see
+[`trait-hierarchy.md`](trait-hierarchy.md#object-safety)) precisely so
+`Box<dyn Cache<K, V>>` *can* be used; cachekit consumers can still take
+that route in their own code. But the **library-provided** runtime
+dispatcher is an enum, for five reasons:
+
+| Property | `Box<dyn Cache<K, V>>` | `DynCache<K, V>` (enum) |
+|---|---|---|
+| Dispatch cost per call | Indirect call via vtable | Branch-predicted `match` |
+| Devirtualization | No (opaque) | Yes (compiler sees the arm) |
+| Inlining of policy body | No | Yes when the arm is statically reachable |
+| Heap allocation per cache | One `Box` per cache | None (enum lives inline) |
+| Closed vs open extension | Open (any `impl Cache`) | Closed (`#[non_exhaustive]` enum) |
+| API stability for new policies | Adding a method is a breaking change | Adding a variant is a non-breaking change with `#[non_exhaustive]` |
+
+The dominant terms are dispatch cost and devirtualization. A `match` on
+an enum tag is a single branch that predicts well in tight loops; the
+optimizer often hoists it out entirely when the enum tag is invariant
+across a benchmark inner loop. A vtable call cannot be devirtualized
+without inlining context and forces the policy body to live behind an
+opaque indirection.
+
+The cost is in extensibility. `Box<dyn Cache>` accepts any
+out-of-tree policy that implements `Cache<K, V>`; `DynCache` does not.
+Users with their own policy implementations still use them directly —
+`MyCache::new(…)` returns a concrete `MyCache<K, V>` and works with any
+code generic over `Cache<K, V>`. The enum is only the **library-provided
+dispatcher**, not a general substrate.
+
+## `CachePolicy` — config-carrying tag
+
+`CachePolicy` ([`src/builder.rs`](../../src/builder.rs)) is the
+user-facing enum that selects a policy. It is **separate** from the
+internal `CacheInner` enum, and it carries per-policy configuration:
+
+```rust,ignore
+#[non_exhaustive]
+#[derive(Debug, Clone, Copy, PartialEq)]
+pub enum CachePolicy {
+    Fifo,
+    Lru,
+    FastLru,
+    LruK { k: usize },
+    Lfu { bucket_hint: Option<usize> },
+    HeapLfu,
+    TwoQ { probation_frac: f64 },
+    S3Fifo { small_ratio: f64, ghost_ratio: f64 },
+    Arc,
+    Lifo,
+    Mfu,
+    Mru,
+    Random,
+    Slru { probationary_frac: f64 },
+    Clock,
+    ClockPro,
+    Nru,
+}
+```
+
+Three design decisions are worth naming:
+
+- **`#[non_exhaustive]`.** Adding a new variant (e.g. when LIRS lands
+  off the roadmap) is a **minor** version bump rather than a major
+  one. Downstream `match` statements over `CachePolicy` must include a
+  `_ =>` arm, which is the standard `non_exhaustive` discipline.
+- **Config carried inline.** `LruK { k }` rather than `LruK` + separate
+  `set_k`. The variant is the place where the parameter is
+  type-checked, and `CachePolicy` stays `Copy` because every payload
+  is `Copy`. This makes `let policy: CachePolicy = config.into();`
+  trivial and lets callers pass `CachePolicy` by value without
+  ceremony.
+- **Tag separated from implementation.** `CachePolicy::Lru` is a
+  user-facing intent; `CacheInner::Lru(LruCore<K, V>)` is the
+  internal storage. Keeping them separate means the internal type can
+  change (e.g. swap `LruCore` for a new implementation) without
+  touching the public enum.
+
+## `DynCache<K, V>` — uniform runtime type
+
+The public dispatcher:
+
+```rust,ignore
+pub struct DynCache<K, V>
+where
+    K: Copy + Eq + Hash + Ord,
+    V: Clone + Debug,
+{
+    inner: CacheInner<K, V>,
+}
+
+enum CacheInner<K, V> /* same bounds */ {
+    #[cfg(feature = "policy-fifo")]    Fifo(FifoCache<K, V>),
+    #[cfg(feature = "policy-lru")]     Lru(LruCore<K, V>),
+    #[cfg(feature = "policy-fast-lru")] FastLru(FastLru<K, V>),
+    #[cfg(feature = "policy-lru-k")]   LruK(LrukCache<K, V>),
+    #[cfg(feature = "policy-lfu")]     Lfu(LfuCache<K, V>),
+    #[cfg(feature = "policy-heap-lfu")] HeapLfu(HeapLfuCache<K, V>),
+    #[cfg(feature = "policy-two-q")]   TwoQ(TwoQCore<K, V>),
+    #[cfg(feature = "policy-s3-fifo")] S3Fifo(S3FifoCache<K, V>),
+    #[cfg(feature = "policy-arc")]     Arc(ArcCore<K, V>),
+    #[cfg(feature = "policy-lifo")]    Lifo(LifoCore<K, V>),
+    #[cfg(feature = "policy-mfu")]     Mfu(MfuCore<K, V>),
+    #[cfg(feature = "policy-mru")]     Mru(MruCore<K, V>),
+    #[cfg(feature = "policy-random")]  Random(RandomCore<K, V>),
+    #[cfg(feature = "policy-slru")]    Slru(SlruCore<K, V>),
+    #[cfg(feature = "policy-clock")]   Clock(ClockCache<K, V>),
+    #[cfg(feature = "policy-clock-pro")] ClockPro(ClockProCache<K, V>),
+    #[cfg(feature = "policy-nru")]     Nru(NruCache<K, V>),
+}
+```
+
+`CacheInner` is **private**. Users only see `DynCache`. Two consequences:
+
+- Internal policy structs (`LruCore`, `S3FifoCache`, …) do not leak
+  into the public type system through the dispatcher. They can be
+  refactored without breaking SemVer.
+- Pattern-matching on the variant from outside the crate is
+  impossible, which forces feature requests through method additions
+  rather than match-arm proliferation in user code.
+
+## Type bounds: heavier than `Cache<K, V>`
+
+`Cache<K, V>` requires only what each individual policy implementation
+needs (typically `K: Eq + Hash`, sometimes `K: Copy`). `DynCache`
+requires the **union** of all policies' bounds:
+
+```rust,ignore
+K: Copy + Eq + Hash + Ord
+V: Clone + Debug
+```
+
+Each bound exists because at least one variant needs it:
+
+- `K: Copy` — many policies rely on cheap key copies in eviction paths.
+- `K: Eq + Hash` — every hashmap-backed lookup.
+- `K: Ord` — `HeapLfuCache` orders keys in a min-heap.
+- `V: Clone` — variants that store `Arc<V>` internally (LRU, LFU,
+  HeapLFU) fall back to `(*arc).clone()` when `Arc::try_unwrap` fails
+  on `insert` / `remove` (see below).
+- `V: Debug` — `DynCache: Debug` delegates to the variant's `Debug`.
+
+This is the **library-provided dispatcher tax**. Users who do not want
+to pay `K: Ord` can call `LruCore::new(…)` directly and bypass
+`DynCache`; the tax only applies when crossing the runtime-dispatch
+boundary. The tax is documented at the `DynCache` doc comment so
+users picking the dispatcher route know what to expect.
+
+If a future policy adds a heavier bound (e.g. `K: Serialize` for a
+persistent-cache policy), it forces every `DynCache` user to satisfy
+that bound. The mitigation, when that happens, is a separate
+dispatcher type (`DynPersistentCache<K, V>`) rather than tightening
+the existing `DynCache` bounds — preserving SemVer for users who
+don't need persistence.
+
+## The `Arc<V>` round-trip
+
+Three policies — `LruCore`, `LfuCache`, `HeapLfuCache` — internally
+store `Arc<V>` rather than `V`. The rationale lives in those modules
+(zero-copy sharing between `peek` and `get`, predictable eviction-time
+move, alignment with the concurrent wrappers' `Arc<V>` returns). At
+the `DynCache` boundary this creates a small impedance:
+
+```rust,ignore
+CacheInner::Lru(lru) => {
+    let arc_value = Arc::new(value);
+    lru.insert(key, arc_value)
+        .map(|arc| Arc::try_unwrap(arc).unwrap_or_else(|arc| (*arc).clone()))
+},
+```
+
+`insert` wraps the value in `Arc` for the policy and tries to unwrap
+the returned `Arc<V>` on the way out. `try_unwrap` is O(1) when the
+refcount is 1 (the common case for sequential `DynCache`); it falls
+back to `(*arc).clone()` only when another reference outlived the
+caller's, which happens on iteration paths where the policy held a
+secondary reference. The fallback is the reason `V: Clone` is required
+on `DynCache`.
+
+The cost is one `Arc::new` per insert and one branch (`try_unwrap`) per
+return on Arc-storing variants. It does not affect FIFO, LIFO, MFU,
+MRU, 2Q, S3-FIFO, ARC, Clock, Clock-PRO, NRU, Random, SLRU, LRU-K,
+or FastLru, which store `V` directly. Users sensitive to this round
+trip should pick a `V`-storing policy or use the concrete type
+directly.
+
+## Feature gating discipline
+
+Every `CachePolicy` variant, every `CacheInner` variant, every match
+arm in every `DynCache` method, every `CacheBuilder::build` arm, and
+every `validate_policy` arm is gated by `#[cfg(feature = "policy-X")]`.
+The discipline:
+
+- A user building with `default-features = false, features = ["policy-lru"]`
+  gets a `CachePolicy` enum with **one variant** and a `DynCache` enum
+  with **one inner variant**. Match exhaustiveness still holds because
+  every arm vanishes with its variant.
+- The internal `match` in each `DynCache` method is **always
+  exhaustive** at the active feature set, because every arm and every
+  variant share the same set of `cfg` predicates.
+- `policy-all` is a convenience feature that turns on every
+  `policy-*` feature at once. The default is a curated subset
+  (`policy-s3-fifo`, `policy-lru`, `policy-fast-lru`, `policy-lru-k`,
+  `policy-clock`) chosen to cover the most-recommended workloads from
+  [`docs/policies/README.md`](../policies/README.md).
+
+The cost is that adding a new policy involves edits in *six*
+synchronized locations (see [Adding a new policy](#adding-a-new-policy)).
+The benefit is that a "policy-lru-only" build is genuinely small —
+none of the other 16 policies appear in the resulting binary.
+
+## Validation: panic vs `Result`
+
+`CacheBuilder::build` panics on invalid configuration:
+
+```rust,ignore
+assert!(self.capacity > 0, "cache capacity must be greater than 0");
+// …
+match policy {
+    CachePolicy::LruK { k } => assert!(*k > 0, "LruK: k must be greater than 0"),
+    CachePolicy::TwoQ { probation_frac } =>
+        check_frac("TwoQ: probation_frac", *probation_frac),
+    // …
+}
+```
+
+This is consistent with cachekit's broader error model
+([`src/error.rs`](../../src/error.rs)): panics for **programming
+errors** (programmer hands the builder a `k = 0`, which has no sensible
+behavior), `Result<_, ConfigError>` reserved for **user-supplied
+configuration** that arrives through deserialization or external
+input.
+
+Callers that need to validate untrusted configuration before calling
+`build` should branch on the `CachePolicy` variant and inspect the
+payload themselves, or use the per-policy fallible constructors
+(`S3FifoCache::try_with_ratios`, future `LrukCache::try_with_k`)
+directly. The builder deliberately does not provide a `try_build` —
+adding one would split the API surface for marginal gain when the
+panic path already catches the bug at the call site.
+
+## `Send + Sync` is conditional
+
+`DynCache<K, V>: Send + Sync` is **not** unconditional. The
+`FastLru` policy uses `NonNull<Node>` for single-threaded performance
+and is therefore `!Send + !Sync`. The test in
+[`src/builder.rs`](../../src/builder.rs) encodes this:
+
+```rust,ignore
+#[cfg(all(feature = "policy-lru", not(feature = "policy-fast-lru")))]
+const _: () = {
+    fn assert_send<T: Send>() {}
+    fn check() { assert_send::<DynCache<u64, String>>(); }
+};
+```
+
+In words: `DynCache<K, V>` is `Send + Sync` whenever no
+`!Send`-or-`!Sync` policy variant is enabled. With the default feature
+set (which includes `policy-fast-lru`), `DynCache` is **not**
+`Send + Sync`. Users who want a sendable `DynCache` should disable
+`policy-fast-lru` and use `policy-lru` for the LRU path.
+
+This is a known sharp edge. The alternative — making `FastLru: Send`
+via an unsafe impl — would invalidate `FastLru`'s entire design
+premise (raw-pointer recency list without atomics). The current
+trade prioritises `FastLru`'s single-threaded speed over `DynCache`'s
+universal sendability, on the grounds that callers wanting concurrent
+access should use a `Concurrent*` wrapper directly (see
+[`concurrency.md`](concurrency.md)), not `DynCache`.
+
+## Maintenance cost
+
+The dispatcher's runtime cost is small. The **maintenance** cost is
+real:
+
+- **17 inner variants** × **~10 `DynCache` methods** = **~170 match
+  arms** that must stay in sync.
+- A `Debug` impl, a `default()` (where applicable), and a
+  `validate_policy` arm per variant.
+- A `Cargo.toml` feature flag per variant.
+- A documentation entry per variant in `docs/policies/`.
+
+The mitigations in place:
+
+1. **A single regression test** (`test_all_policies_basic_ops` in
+   [`src/builder.rs`](../../src/builder.rs)) loops over every enabled
+   policy and exercises `insert` / `get` / `contains` / `len` /
+   update / `clear`. Adding a variant immediately surfaces if any arm
+   was missed.
+2. **Compile-time exhaustiveness** in the inner `match`. Forgetting an
+   arm is a build error, not a runtime bug.
+3. **`#[non_exhaustive]` on `CachePolicy`** keeps downstream code
+   from depending on the full set of variants.
+
+Even with those, the line count of `src/builder.rs` (~1300) is
+disproportionate to its semantic content. A `macro_rules!` to generate
+per-method dispatchers has been considered and rejected — the
+explicit `match` is grep-friendly, readable in source review, and
+each arm sometimes diverges from the boilerplate (the `Arc<V>` round
+trip is the visible case; future TTL integration is another). Macros
+would compress the file but obscure the points where the dispatcher
+intervenes.
+
+## Adding a new policy
+
+Checklist for landing a new policy, ordered to minimise compile-time
+churn:
+
+1. Implement the policy core: `MyPolicyCache<K, V>` with a `Cache<K, V>`
+   impl. Add `MyPolicyCache::new(capacity: usize)` and any config
+   constructors. Land this with its own tests.
+2. Add a `policy-my-policy` feature in [`Cargo.toml`](../../Cargo.toml).
+   Add it to `policy-all`. Decide whether it joins `default = […]`.
+3. Add the `CachePolicy::MyPolicy { … }` variant, gated by the new
+   feature. Include any config fields as inline payload.
+4. Add the `CacheInner::MyPolicy(MyPolicyCache<K, V>)` variant under
+   the same `cfg`.
+5. Add a match arm in every `DynCache` method (`insert`, `get`, `peek`,
+   `contains`, `len`, `capacity`, `remove`, `clear`, `Debug` impl).
+6. Add a `CachePolicy::MyPolicy { … } => CacheInner::MyPolicy(…)` arm
+   in `CacheBuilder::build`. Add validation in `validate_policy` if
+   the variant has constraints (frac in 0..=1, non-zero K, etc.).
+7. Add the variant to `all_enabled_policies()` in the test module so
+   the regression sweep covers it.
+8. Document the policy in `docs/policies/my-policy.md`; if it's a
+   roadmap policy graduating to implementation, move the doc from
+   `docs/policies/roadmap/` per the rule in
+   [`docs/policies/roadmap/README.md`](../policies/roadmap/README.md).
+9. Update [`docs/policies/README.md`](../policies/README.md) and
+   [`docs/guides/choosing-a-policy.md`](../guides/choosing-a-policy.md).
+
+The work is mechanical. A CR template that lists these nine steps as
+checkboxes would reduce the chance of missed updates further.
+
+## Future: `DynExpiringCache<K, V>`
+
+When the `ttl` feature lands ([`ttl.md`](ttl.md) §4(c)), TTL **does
+not** modify `DynCache`. Instead, `with_default_ttl` on the builder
+returns a sibling type:
+
+```rust,ignore
+let mut cache = CacheBuilder::new(1024)
+    .with_default_ttl(Duration::from_secs(60))
+    .build::<u64, String>(CachePolicy::Lru);
+// `cache: DynExpiringCache<u64, String>`, not DynCache.
+```
+
+`DynExpiringCache<K, V>` mirrors `DynCache`'s match-arm boilerplate
+one level out: each method threads the expiry check through the
+inner policy's `Cache` call. The key design choice — argued in detail
+in [`ttl.md`](ttl.md) §1, §4(c) — is that `DynExpiringCache` is a
+**distinct type**, not `impl Cache for DynCache` plus a wrapper.
+Distinctness makes `Expiring<Expiring<DynCache>>` structurally
+unrepresentable, which prevents the "two clocks, two indexes"
+double-wrapping bug at the type level.
+
+The duplication is real: a parallel ~170 arms for the expiring
+variant. It is bounded (one type per cross-cutting capability) and
+the trade favours type-level safety over deduplication.
+
+## When not to use `DynCache`
+
+`DynCache` is the right tool when:
+
+- The policy is chosen at runtime from configuration.
+- The caller wants a single concrete type that can hold any policy.
+- The dispatch cost is amortised over enough work that the `match`
+  doesn't dominate.
+
+It is the wrong tool when:
+
+- The policy is known at compile time. Use the concrete type
+  (`LruCache::new(…)`, `S3FifoCache::new(…)`) and let monomorphization
+  do its work.
+- The hottest inner loop is `get`-bound and devirtualization matters
+  beyond what enum dispatch provides. Concrete types still win for
+  raw throughput on benchmarks (see
+  [`benches/comparison.rs`](../../benches/comparison.rs)).
+- The caller needs `Send + Sync` and the build includes
+  `policy-fast-lru`. See [`Send + Sync`](#send--sync-is-conditional)
+  above; use the relevant `Concurrent*` wrapper instead.
+- A user wants to plug in their own policy. `DynCache` is closed;
+  generic code over `Cache<K, V>` is open.
+
+## See also
+
+- [Design overview](design.md) — §13 frames compile-time and runtime
+  composition at the principles level
+- [Cache trait hierarchy](trait-hierarchy.md) — kernel trait and
+  capability traits
+- [Concurrency](concurrency.md) — `Send + Sync` interaction, why
+  `Concurrent*` is a separate path
+- [TTL design](ttl.md) — `DynExpiringCache` as a worked extension of
+  the dispatcher pattern
+- [Error model](../../src/error.rs) — `ConfigError` vs panic discipline
+- [`src/builder.rs`](../../src/builder.rs) — the canonical
+  implementation
diff --git a/docs/design/concurrency.md b/docs/design/concurrency.md
new file mode 100644
index 0000000..920b7ad
--- /dev/null
+++ b/docs/design/concurrency.md
@@ -0,0 +1,344 @@
+# Concurrency
+
+> Status: design rationale for the concurrent surface that ships today
+> behind the `concurrency` feature flag. Companion to the cross-cutting
+> principles in [`docs/design/design.md`](design.md) §3 and the trait
+> rationale in [`docs/design/trait-hierarchy.md`](trait-hierarchy.md).
+
+cachekit's default surface is single-threaded. Concurrency is opt-in,
+delivered through a parallel set of types and traits gated by the
+`concurrency` Cargo feature. This document explains why the concurrent
+surface looks the way it does, what invariants the wrappers promise,
+and where the gaps are.
+
+## Non-goals
+
+- **`no_std`.** Concurrency relies on `parking_lot`, `std::sync::Arc`,
+  and `std::sync::atomic`. No `loom`/`no_std` support is planned.
+- **Lock-free policies.** Mostly-lock-free or strictly lock-free
+  policies are out of scope today; see [Future directions](#future-directions).
+- **Async-native traits.** `AsyncCacheFuture` is a Phase 2 placeholder
+  ([`src/traits.rs`](../../src/traits.rs)); no policy implements it
+  meaningfully yet.
+
+## The dominant pattern: sequential core, concurrent wrapper
+
+Every concurrent type in cachekit follows the same shape:
+
+```text
+ConcurrentX<K, V> { inner: Arc<RwLock<X<K, V>>> }
+```
+
+where `X` is the single-threaded core (`LruCore`, `FifoCache`,
+`S3FifoCache`, `SlotArena`, `IntrusiveList`, `ClockRing`,
+`HashMapStore`, `SlabStore`, `WeightStore`, `HandleStore`,
+`FrequencyBuckets`). The wrapper:
+
+1. holds the core behind an `Arc<RwLock<…>>`,
+2. presents owned/`Arc<V>` returns instead of borrowed `&V`,
+3. is `Clone` via `Arc::clone` so callers can hand copies to threads,
+4. is `Send + Sync` because the inner core's `Send + Sync` impls
+   auto-derive through `Arc<RwLock<…>>`.
+
+The pattern is verbose but consistent and was chosen for three reasons:
+
+- **No `&mut self` in the public API.** Sharing requires interior
+  mutability; an `RwLock` is the cheapest tool that exposes both shared
+  reads and exclusive writes through `&self`.
+- **The sequential core stays unaware of locking.** Policy code under
+  [`src/policy/`](../../src/policy) is single-threaded and easier to
+  reason about. The locking discipline lives in one place per type.
+- **The lock is replaceable.** Swapping `parking_lot::RwLock` for a
+  different primitive (`std::sync::RwLock`, a sharded lock, a seqlock)
+  is a local change because the inner core has no opinion on it.
+
+The 11 concurrent wrappers shipped today live in:
+
+- `ConcurrentLruCache` — [`src/policy/lru.rs`](../../src/policy/lru.rs)
+- `ConcurrentFifoCache` — [`src/policy/fifo.rs`](../../src/policy/fifo.rs)
+- `ConcurrentS3FifoCache` — [`src/policy/s3_fifo.rs`](../../src/policy/s3_fifo.rs)
+- `ConcurrentHashMapStore`, `ShardedHashMapStore` — [`src/store/hashmap.rs`](../../src/store/hashmap.rs)
+- `ConcurrentSlabStore` — [`src/store/slab.rs`](../../src/store/slab.rs)
+- `ConcurrentWeightStore` — [`src/store/weight.rs`](../../src/store/weight.rs)
+- `ConcurrentHandleStore` — [`src/store/handle.rs`](../../src/store/handle.rs)
+- `ConcurrentSlotArena`, `ShardedSlotArena` — [`src/ds/slot_arena.rs`](../../src/ds/slot_arena.rs)
+- `ConcurrentIntrusiveList` — [`src/ds/intrusive_list.rs`](../../src/ds/intrusive_list.rs)
+- `ConcurrentClockRing` — [`src/ds/clock_ring.rs`](../../src/ds/clock_ring.rs)
+- `ShardedFrequencyBuckets` — [`src/ds/frequency_buckets.rs`](../../src/ds/frequency_buckets.rs)
+
+## Why `Concurrent*` does not implement `Cache<K, V>`
+
+`Cache<K, V>` is the sequential trait. Its method signatures encode
+sequential ownership:
+
+```rust
+fn peek(&self, key: &K) -> Option<&V>;
+fn get(&mut self, key: &K) -> Option<&V>;
+fn insert(&mut self, key: K, value: V) -> Option<V>;
+```
+
+Three of these are unimplementable on `Arc<RwLock<…>>`:
+
+- **`peek` and `get` return `&V`.** A borrowed reference cannot
+  outlive the `RwLockReadGuard`/`RwLockWriteGuard` it was extracted
+  from. There is no safe lifetime that ties `&V` to `&self` rather
+  than to the (anonymous) guard. Returning `&V` would force the
+  caller to hold the lock across the borrow, which serializes readers
+  and defeats `RwLock`.
+- **`get` takes `&mut self`.** With shared ownership through
+  `Arc<RwLock<…>>` the wrapper only ever holds `&self`. Forcing
+  `&mut self` would require `Arc::make_mut` or external locking,
+  defeating the point of the inner lock.
+
+The concurrent wrappers therefore expose their own concrete API:
+
+```rust
+pub fn get(&self, key: &K) -> Option<Arc<V>>;
+pub fn peek(&self, key: &K) -> Option<Arc<V>>;
+pub fn insert(&self, key: K, value: V) -> Option<Arc<V>>;
+pub fn insert_arc(&self, key: K, value: Arc<V>) -> Option<Arc<V>>;
+pub fn remove(&self, key: &K) -> Option<Arc<V>>;
+```
+
+Returning `Arc<V>` is the contract. It costs one atomic refcount bump
+on hit, which is cheap relative to the lock acquisition itself, and it
+lets callers hold the value past lock release, send it across threads,
+or stash it in another structure without lifetime gymnastics.
+
+For uniformity across the store layer there is a parallel trait family
+that **does** model the `&self` + `Arc<V>` shape:
+
+| Sequential ([`src/store/traits.rs`](../../src/store/traits.rs)) | Concurrent ([`src/store/traits.rs`](../../src/store/traits.rs)) |
+|---|---|
+| `StoreRead` (`&mut self`, `&V`) | `ConcurrentStoreRead` (`&self`, `Arc<V>`) |
+| `StoreMut` (`&mut self`) | `ConcurrentStore` (`&self`) |
+| `StoreFactory` | `ConcurrentStoreFactory` |
+
+The policy layer does not yet have a counterpart family — see
+[Future directions](#future-directions).
+
+## Lock primitive choice
+
+Every concurrent wrapper uses **`parking_lot::RwLock`**. Two things
+drove this:
+
+- **Reader / writer split matches the access pattern.** `peek` /
+  `contains` / `len` only need shared access. `get` (which mutates
+  recency or frequency state) and `insert` / `remove` need exclusive
+  access. `Mutex` would serialize all of these.
+- **Fairness and uncontended speed.** `parking_lot::RwLock` is small
+  (one `AtomicUsize` on 64-bit), uncontended-fast, and tunable via
+  fairness traits. The `RwLock<HashMap<K, Arc<V>>>` and
+  `RwLock<SlotArena<T>>` shapes throughout the codebase rely on this.
+
+`Mutex` is intentionally absent from the wrappers. The few `Mutex`
+references in the source tree are in doctests and rustdoc prose
+describing how a user would wrap a non-concurrent cache themselves —
+they are not on any hot path.
+
+The `parking_lot` choice is **not** absolute. On Rust 1.85+ the
+futex-based `std::sync::Mutex` is competitive for the uncontended
+single-writer case on Linux/macOS, and revisiting this is reasonable
+if `parking_lot` ever becomes a build burden. The `RwLock` advantage
+is more durable: `std::sync::RwLock` still has writer-starvation
+hazards on some platforms that `parking_lot` avoids by default.
+
+## The `get` / `peek` lock-level asymmetry
+
+`peek` and `get` both look up by key, but they differ in what they
+mutate:
+
+- **`peek`** is side-effect-free. The wrapper takes a **read lock**
+  and clones the `Arc<V>`. Multiple readers proceed in parallel.
+- **`get`** updates policy state (LRU recency, LFU frequency, Clock
+  reference bit, …). The wrapper takes a **write lock**. Only one
+  thread proceeds.
+
+This asymmetry is the single most important reason `peek` and `get`
+are distinct methods at all (see
+[`trait-hierarchy.md`](trait-hierarchy.md) for the rationale at the
+trait level). Without `peek`, every read would serialize through the
+write lock. With `peek`, read-heavy workloads — buffer pools, immutable
+metadata caches — scale linearly across cores.
+
+The cost is that callers must choose, and choosing `get` on a
+read-heavy workload silently kills scalability. The rustdoc on each
+wrapper's `peek` and `get` says so explicitly; benchmarks under
+[`benches/`](../../benches) compare the two.
+
+## Atomic check-and-act
+
+Compound operations must stay inside a single lock acquisition. The
+rule is **check, decide, mutate, release** — all under the same write
+lock. Splitting the steps across two acquisitions allows a concurrent
+writer to invalidate the decision between them.
+
+The pattern shows up in three places worth naming:
+
+- **Insert-on-full.** Capacity check + eviction + insert must be one
+  critical section. `WeightStore::try_insert` and the policy `insert`
+  methods both follow this.
+- **Replace-and-return.** `insert` returns the previous value if one
+  existed. The "did this key exist?" check and the replace must
+  happen under the same write lock; otherwise two concurrent inserts
+  can both observe "key absent" and both return `None`.
+- **Future: expiry + remove.** TTL (see
+  [`docs/design/ttl.md`](ttl.md) §4(e)) requires the expiry check and
+  the removal to be one atomic operation under a write lock. A
+  read-locked fast path that observes `expires_at <= now` and
+  escalates to a write lock is safe **only** if the write-locked
+  path re-checks the deadline before acting, because a concurrent
+  `set_ttl` may have renewed the entry in between.
+
+The atomicity rule is a wrapper-level discipline, not a trait-level
+one. The single-threaded core can't enforce it because it doesn't
+know about locks.
+
+## Cloning the wrapper
+
+Every `Concurrent*` type implements `Clone` via `Arc::clone(&self.inner)`.
+Cloning the wrapper is cheap (one atomic increment) and produces a
+second handle to the **same** underlying cache. This is the intended
+way to share a cache across threads:
+
+```rust,ignore
+let cache = ConcurrentLruCache::<u64, String>::new(1_000);
+let cache2 = cache.clone();
+std::thread::spawn(move || {
+    cache2.insert(1, "hello".into());
+});
+cache.get(&1);
+```
+
+There is no separate `Arc<ConcurrentLruCache<…>>` wrapping needed;
+the inner `Arc` is the sharing primitive. Callers who want
+`Arc<dyn ConcurrentCache>` for type erasure are still free to wrap, but
+in practice the concrete clone is what's used in the codebase.
+
+## `ConcurrentCache`: marker trait, not capability trait
+
+`ConcurrentCache` lives in [`src/traits.rs`](../../src/traits.rs) and
+is declared `unsafe trait ConcurrentCache: Send + Sync {}`. It has
+no methods. Its job is to **promise**, at the type system level,
+that "this type is safe to share across threads in the cache sense" —
+specifically that its `Cache`-like operations (whatever those happen
+to be — concrete `Concurrent*` types do not implement `Cache<K, V>`)
+take care of internal synchronization.
+
+The `unsafe` is load-bearing. Implementing `ConcurrentCache`
+incorrectly cannot be caught by the type system; it's an
+implementer-side soundness claim, which is why only the wrappers
+implement it (`ConcurrentFifoCache`, `ConcurrentS3FifoCache` today;
+`ConcurrentLruCache` is a candidate but does not yet have the impl).
+
+Users writing generic code that requires a thread-safe cache should
+bound on `ConcurrentCache + Send + Sync`. They should **not** bound on
+`Cache<K, V> + Send + Sync` and expect that to suffice — that bound
+is satisfied by single-threaded caches whose user is responsible for
+external locking.
+
+## Sharded primitives
+
+For data structures where a single `RwLock` becomes the bottleneck,
+cachekit ships sharded variants:
+
+- **`ShardedHashMapStore<K, V, S>`** — N independent shards, each its
+  own `RwLock<HashMap<…>>`. Shard selected by hashing the key with
+  the store's `BuildHasher`.
+- **`ShardedSlotArena<T>`** — N independent arenas with sharded
+  `SlotId`s. Same shape, applied to slab-style storage.
+- **`ShardedFrequencyBuckets<K>`** — N independent frequency-bucket
+  shards for LFU-family policies that want concurrent frequency
+  updates.
+
+Sharding lives at the **data-structure** layer, not the policy layer,
+because the shard count, hash function, and shard-aware key type
+(`ShardedSlotId`) all need to be visible to the policy that uses the
+primitive. A `ShardedLruCache` does not yet exist as a single type;
+it would be built by composing a `ShardedHashMapStore` with sharded
+recency lists, and that composition is roadmap.
+
+When sharding is **not** what you want:
+
+- A single concurrent wrapper is simpler and faster for caches that
+  fit on one or two cores' worth of contention.
+- Sharding multiplies the working-set fragmentation across shards.
+  A 1 M-entry cache split 16 ways has 16 caches of ~62 K each, and
+  evictions on one shard cannot rescue items on another.
+- Per-shard eviction is correct for capacity bookkeeping (each shard
+  tracks its own capacity) but **not** globally optimal — a single-
+  shard LRU strictly dominates a sharded LRU on hit rate.
+
+## Concurrent policy coverage
+
+Of the 17 implemented policies, **3 ship with a `Concurrent*` wrapper
+today**: LRU, FIFO, S3-FIFO. The remaining 14 require external locking
+by the caller — typically `Arc<parking_lot::RwLock<CacheCore>>`. The
+relevant rustdoc on those policies (e.g. `LfuCache`, `HeapLfuCache`,
+`MfuCache`) calls this out.
+
+This is a coverage gap rather than a design choice. The pattern is
+mechanical: wrap the sequential core in `Arc<RwLock<…>>`, expose the
+`&self` API with `Arc<V>` returns, decide read-lock vs. write-lock per
+method, implement `Clone` via `Arc::clone`, and implement
+`unsafe impl ConcurrentCache`. The work is bounded; what's missing is
+the discipline to do it consistently across all 17 policies.
+
+## Failure modes
+
+Three failure modes worth naming:
+
+- **Poisoning.** `parking_lot` does **not** poison locks on panic.
+  A panic inside a critical section unwinds, releases the lock, and
+  leaves the inner core in whatever state the panic interrupted.
+  The single-threaded cores are designed to be panic-safe for
+  `Cache::insert` / `get` / `remove` — invariants are restored
+  before any potentially-panicking operation (allocation, user
+  hashing). This is a property of each core, not of the wrapper.
+- **Deadlock.** Cachekit never holds two locks at once in the
+  current code. Sharded primitives acquire exactly one shard lock
+  per operation. Any future work that composes locks (e.g. a sharded
+  LRU that touches a shared recency list) must document its locking
+  order.
+- **Starvation.** `parking_lot::RwLock` defaults to writer-friendly
+  fairness; readers do not starve writers. Heavy `get`-dominated
+  workloads still serialize through the write lock, which is the
+  underlying constraint, not a fairness bug.
+
+## Future directions
+
+Tracked roughly in priority order:
+
+1. **Coverage parity.** `Concurrent*` wrappers for the remaining 14
+   policies (LFU, Heap-LFU, MFU, LRU-K, 2Q, ARC, CAR, Clock,
+   Clock-PRO, NRU, SLRU, MRU, LIFO, Random). Mechanical work; the
+   pattern is fixed.
+2. **`ConcurrentExpiring<C>`.** TTL's concurrent wrapper, per
+   [`docs/design/ttl.md`](ttl.md) §4(e). Distinct from `Concurrent*`
+   policies because the expiry-check + remove must be atomic across
+   *both* the inner cache and the expiration index.
+3. **Sharded `Cache` wrappers.** A generic `Sharded<C: Cache<K, V>>`
+   that hashes keys to N independent inner caches. The design
+   question is how to model capacity: per-shard capacity (simple,
+   imperfect global behaviour) vs. global capacity with cross-shard
+   victim selection (correct, requires inter-shard locking).
+4. **Lock-free reads.** `peek` and `contains` paths that avoid the
+   `RwLock` entirely — `arc-swap` or seqlock-style techniques —
+   for caches whose recency state can tolerate eventual consistency.
+   Out of scope until benchmarks show the read lock is the bottleneck.
+5. **Loom testing.** Once concurrent coverage stabilises, model-check
+   the wrapper invariants under `loom`. Particularly valuable for the
+   atomic check-and-act sequences in TTL and sharded composition.
+
+## See also
+
+- [Design overview](design.md) — §3 frames concurrency at the
+  principles level
+- [TTL design](ttl.md) — applied case for `ConcurrentExpiring<C>`
+- [Cache trait hierarchy](trait-hierarchy.md) — read/mutate split and
+  object-safety rationale
+- [Stores](../stores/README.md) — `ConcurrentStoreRead` /
+  `ConcurrentStore` trait family
+- [`src/store/traits.rs`](../../src/store/traits.rs) — concurrent
+  store traits
+- [`src/traits.rs`](../../src/traits.rs) — `ConcurrentCache` marker
diff --git a/docs/design/design.md b/docs/design/design.md
index 92f84d6..7f388bc 100644
--- a/docs/design/design.md
+++ b/docs/design/design.md
@@ -1,14 +1,23 @@
-Designing high-performance caches in Rust is a multi-disciplinary problem: data structures, memory layout, concurrency, workload modeling, and systems-level performance all matter. The points below reflect what moves the needle in practice across systems, services, and libraries.
+# Design Overview
 
-For interface and API decisions, the [Rust API Guidelines checklist](https://rust-lang.github.io/api-guidelines/checklist.html) is a useful companion for consistent, ergonomic design.
+This document collects the design principles that shape `cachekit`. Each
+section pairs a principle with the concrete artifact in the source tree
+that realizes it, so the prose stays grounded in the code rather than
+floating as advice.
+
+For a worked example that applies every principle below to one feature,
+see the [TTL design doc](ttl.md). For interface conventions, the
+[Rust API Guidelines checklist](https://rust-lang.github.io/api-guidelines/checklist.html)
+is the companion reference; module-level documentation follows the
+[doc style guide](style-guide.md).
 
 ## 1. Workload First, Policy Second
 
 Cache policy only matters relative to workload.
 
 Identify access patterns:
-- Hotset-heavy traffic: skewed keys, high churn.
-- Scan-heavy traffic: large working sets, weak locality.
+- Hot-set traffic: skewed keys, low churn on the hot set, high churn at the tail.
+- Scan-heavy traffic: large working sets, weak temporal locality.
 - Mixed traffic: bursts of hot data over large cold sets.
 
 Measure:
@@ -17,29 +26,43 @@ Measure:
 - Temporal vs spatial locality.
 
 Choose policies accordingly:
-- LRU: good for temporal locality, bad for scans.
-- LRU-K / 2Q (roadmap): better at filtering one-off accesses.
-- Clock / ARC (roadmap): lower overhead, more adaptive.
+- `LRU` / `Clock`: good for temporal locality, vulnerable to scans.
+- `LRU-K` / `2Q` / `SLRU`: better at filtering one-off accesses.
+- `ARC` / `CAR`: adaptive recency/frequency balance without manual tuning.
+- `S3-FIFO` / `Heap-LFU`: strong general-purpose defaults under scans.
+
+All of the above ship today; see [`docs/policies/`](../policies/README.md)
+for the implemented catalog and [`docs/policies/roadmap/`](../policies/roadmap/README.md)
+for planned policies (LIRS, TinyLFU, SIEVE, GDS/GDSF, etc.).
 
-Never design a "general purpose" cache first; design for the workload you expect.
+When picking a policy or tuning a cache, design for the workload you
+expect — not the average of all workloads.
 
 ## 2. Memory Layout Matters More Than Algorithms
 
 In a cache, memory layout often dominates policy.
 
 Prefer:
-- Contiguous storage (Vec, slabs, arenas).
+- Contiguous storage (`Vec`, slabs, arenas).
 - Index-based indirection over pointer chasing.
 
 Avoid:
-- Excessive Box, Arc, linked lists.
-- HashMap lookups in hot paths if avoidable.
+- Excessive `Box`, `Arc`, linked lists with heap-allocated nodes.
+- `HashMap` lookups in hot paths if avoidable.
 
 Techniques:
 - Store metadata (recency, freq, flags) in tightly packed structs.
 - Separate hot metadata from cold payloads.
 - Use slab allocators for fixed-size entries.
 
+cachekit realizes this through reusable building blocks under
+[`src/ds/`](../../src/ds): [`SlotArena`](../../src/ds/slot_arena.rs)
+hands out stable `Handle`s backed by a `Vec`, [`IntrusiveList`](../../src/ds/intrusive_list.rs)
+threads recency lists through those slots without per-node allocation,
+and [`ClockRing`](../../src/ds/clock_ring.rs) keeps Clock-style state in
+a single contiguous array. See [`docs/policy-ds/`](../policy-ds/README.md)
+for the full primitive catalog.
+
 Cache misses caused by your own data structure are as bad as upstream misses.
 
 ## 3. Concurrency Strategy Is Core Design, Not a Wrapper
@@ -48,35 +71,54 @@ Locking strategy shapes everything.
 
 Options:
 - Global lock: simple, often fast enough for small cores, dies under high contention.
-- Sharded caches: hash key -> shard, each shard independently locked.
+- Sharded caches: hash key → shard, each shard independently locked.
 - Lock-free or mostly-lock-free: hard in Rust, only worth it if contention dominates.
 
+cachekit ships the first option today via the `concurrency` feature:
+`Concurrent*` wrappers (e.g. `ConcurrentLruCache`, `ConcurrentSlotArena`,
+`ConcurrentClockRing`) place a `parking_lot::RwLock` around the
+single-threaded core. The wrappers deliberately do **not** implement
+`Cache<K, V>` directly when that would force returning `&V` across a
+lock boundary — they expose `Option<Arc<V>>` style APIs instead. See
+[`src/policy/lru.rs`](../../src/policy/lru.rs),
+[`src/ds/slot_arena.rs`](../../src/ds/slot_arena.rs), and
+[`src/ds/clock_ring.rs`](../../src/ds/clock_ring.rs).
+
 Rust-specific notes:
-- When `std` is available, prefer `parking_lot` locks over `std::sync` for lower overhead and better ergonomics.
-- Avoid Arc<Mutex<...>> in hot paths.
-- Consider per-thread caches with periodic merge.
-- Consider RCU-style read paths for read-heavy caches.
+- For `RwLock`, prefer `parking_lot` for fairness control and lower
+  uncontended overhead. For `Mutex`, the futex-based `std::sync::Mutex`
+  on Rust 1.85+ is competitive on Linux/macOS; `parking_lot::Mutex`
+  still wins on raw uncontended speed and offers nicer guard ergonomics.
+- Avoid `Arc<Mutex<…>>` in hot paths.
+
+Future directions worth exploring but **not currently implemented**:
+sharded caches (hash key → shard, per-shard lock), per-thread caches with
+periodic merge, and RCU-style read paths for read-heavy workloads.
 
 ## 4. Avoid Per-Operation Allocation
 
 Allocations kill throughput.
 
 Pre-allocate:
-- Entry pools.
-- Node arrays.
+- Entry pools — see [`SlotArena`](../../src/ds/slot_arena.rs) and the
+  free-list discipline in [`src/store/slab.rs`](../../src/store/slab.rs).
+- Node arrays — intrusive lists thread through arena slots rather than
+  allocating per-node (see [`src/ds/intrusive_list.rs`](../../src/ds/intrusive_list.rs)).
 
 Reuse:
-- Free lists.
-- Slabs.
+- Free lists (slab-backed).
+- Slabs sized once at construction time via `CacheBuilder::new(capacity)`.
 
 Use:
-- Vec with capacity management.
-- Custom allocators if necessary.
+- `Vec` with explicit capacity management.
+- `rustc-hash` (via the `rustc-hash` dep) for cheap key hashing in
+  hot-path lookups.
 
 Avoid:
-- Creating new Arc, String, Vec per lookup.
+- Creating new `Arc`, `String`, `Vec` per lookup.
+- Hidden clones of `K` on the eviction path.
 
-If malloc shows up in your flamegraph, your cache is already slow.
+If `malloc` shows up in your flamegraph, your cache is already slow.
 
 ## 5. Eviction Must Be Predictable and Cheap
 
@@ -87,12 +129,17 @@ O(1) eviction is the goal.
 Avoid unbounded tree walks or scans in eviction paths.
 
 Maintain:
-- Direct pointers/indices to eviction candidates.
-- Eviction lists or clock hands.
+- Direct indices / `Handle`s to eviction candidates (see
+  [`src/store/handle.rs`](../../src/store/handle.rs) and the
+  [`Cache`](../../src/store/traits.rs) trait).
+- Eviction lists or clock hands (intrusive list head, `ClockRing` hand).
+- Lazy heaps where amortized O(log n) is acceptable
+  ([`LazyMinHeap`](../../src/ds/lazy_heap.rs); used by Heap-LFU and TTL).
 
 Be careful with:
 - Background eviction threads (synchronization overhead).
-- Lazy cleanup that grows unbounded.
+- Lazy cleanup that grows unbounded; bound it with rebuild thresholds
+  (e.g. `LazyMinHeap::with_auto_rebuild`).
 
 Eviction cost must be comparable to lookup cost, not orders of magnitude higher.
 
@@ -102,13 +149,21 @@ You cannot tune what you do not measure.
 
 Track at least:
 - Hit / miss rate.
-- Eviction count and reason.
+- Eviction count and reason (capacity vs. expiration).
 - Insert/update rate.
+
+cachekit exposes these through [`StoreMetrics`](../../src/store/traits.rs)
+and per-policy metric structs (e.g. `LruMetrics`), gated behind the
+`metrics` feature so non-instrumented builds pay nothing. The
+`expirations` counter on `Expiring<C>` follows the same pattern (see
+[`src/policy/expiring.rs`](../../src/policy/expiring.rs)).
+
+Roadmap counters:
 - Scan pollution rate.
-- Lock contention or wait time (roadmap).
+- Lock contention or wait time.
 
 Expose:
-- Lightweight counters in hot path.
+- Lightweight counters in the hot path.
 - Optional detailed metrics behind feature flags.
 
 Metrics should guide design decisions, not justify them afterward.
@@ -116,14 +171,24 @@ Metrics should guide design decisions, not justify them afterward.
 ## 7. Separate Policy From Storage
 
 Design in layers:
-- Storage layer: how entries live in memory, allocation, layout, indexing.
-- Policy layer: LRU, FIFO, LFU, LRU-K (roadmap: Clock/ARC/2Q, etc; see [Policy roadmap](../policies/roadmap/README.md)); only manipulates metadata and ordering.
-- Integration layer: ties application objects, payloads, or IDs into cache entries.
+- Storage layer: how entries live in memory, allocation, layout,
+  indexing — [`src/store/`](../../src/store).
+- Policy layer: LRU, FIFO, LFU, LRU-K, 2Q, ARC, CAR, Clock, Clock-PRO,
+  S3-FIFO, … — manipulates metadata and ordering only
+  ([`src/policy/`](../../src/policy)).
+- Capability layer: opt-in extension traits ([`RecencyTracking`](../../src/traits.rs),
+  `FrequencyTracking`, `HistoryTracking`, `ExpiringCache`) that policies
+  implement when the underlying signal exists. This is how `Expiring<C>`
+  composes over any policy without touching policy code.
+- Integration layer: ties application objects, payloads, or IDs into
+  cache entries via [`CacheBuilder`](../../src/builder.rs) and the
+  `DynCache` runtime dispatcher.
 
 Related docs:
 - [Policy overview](../policies/README.md)
 - [Policy roadmap](../policies/roadmap/README.md)
 - [Policy data structures](../policy-ds/README.md)
+- [Read-only traits](../guides/read-only-traits.md)
 
 This makes:
 - Benchmarking easier.
@@ -135,15 +200,16 @@ This makes:
 Ergonomics often cost performance.
 
 Avoid in critical loops:
-- Heavy generics causing code bloat.
+- Heavy generics causing code bloat across many monomorphizations.
 - Trait objects for hot dispatch.
 - Closures capturing state.
-- Iterator chains instead of simple loops.
+- Iterator chains where a plain `for` loop would do.
 
 Prefer:
 - Explicit loops.
-- Concrete types.
-- Monomorphized fast paths.
+- Concrete types and monomorphized fast paths.
+- Enum dispatch over `Box<dyn Trait>` when polymorphism is needed at the
+  edges — this is exactly the trade `DynCache` makes (see §13).
 
 You can wrap fast internals in nice APIs at the edges.
 
@@ -154,15 +220,17 @@ In scan-heavy workloads:
 Large sequential reads destroy LRU-style caches.
 
 Solutions:
-- Scan-resistant policies (LRU-K, 2Q/ARC are roadmap).
+- Scan-resistant policies: `LRU-K`, `2Q`, `SLRU`, `ARC`, `CAR`,
+  `Clock-PRO`, `S3-FIFO`, `Heap-LFU` — all implemented today.
 - Explicit "scan mode" hints from the caller or workload layer.
 - Bypass cache for known one-shot reads.
 
-If you ignore scans, your cache will look great in microbenchmarks and terrible in production.
+If you ignore scans, your cache will look great in microbenchmarks and
+terrible in production.
 
 ## 10. Benchmark Like a System, Not a Library
 
-Do not rely on random key benchmarks.
+Do not rely on uniform-random key benchmarks.
 
 Use:
 - Zipfian distributions.
@@ -176,22 +244,31 @@ Measure:
 - Memory overhead.
 - Eviction cost.
 
-A cache that is 5% faster on random keys but 50% worse under scans is a bad cache.
+cachekit's benchmark harness covers these dimensions; see
+[`docs/benchmarks/workloads.md`](../benchmarks/workloads.md) and the
+runners under [`benches/`](../../benches).
+
+A cache that is 5 % faster on uniform-random keys but 50 % worse under
+scans is a bad cache.
 
-## 11. Rust-Specific Pitfalls
+## 11. Rust Hot-Path Hazards Beyond Allocation
 
-Arc is expensive in hot paths.
+`Arc` is expensive in hot paths; minimize it and lift `Arc::clone` out
+of inner loops.
 
-Borrow checker can push you toward indirection—fight it with:
-- Index-based access.
-- Interior mutability only where unavoidable.
+The borrow checker can push you toward indirection — fight it with:
+- Index-based access (`Handle`s, slot indices) instead of `&mut` chains.
+- Interior mutability only where unavoidable; prefer `Cell<T>` over
+  `RefCell<T>` when `T: Copy`, and atomics when the value lives behind
+  a shared reference.
 
 Beware of:
-- Hidden clones.
-- Trait object dispatch.
-- Over-generic designs.
+- Hidden clones, particularly of keys on the eviction path.
+- Trait object dispatch on read/insert.
+- Over-generic designs whose monomorphization cost dwarfs their benefit.
 
-Rust can be as fast as C, but only if you design like a systems programmer, not a library author.
+Rust can match C on hot paths, but only when systems-level discipline
+survives contact with the type system.
 
 ## 12. Design for Failure Modes
 
@@ -207,13 +284,69 @@ Add:
 
 A cache that collapses under stress is worse than no cache.
 
+## 13. Compile-Time and Runtime Composition
+
+cachekit's externally visible surface is shaped by two composition
+mechanisms that together let users pay only for what they use.
+
+**Per-policy feature flags.** Every policy is behind a Cargo feature
+(`policy-lru`, `policy-s3-fifo`, …), with `policy-all` for "everything"
+and a small default of `policy-s3-fifo`, `policy-lru`, `policy-fast-lru`,
+`policy-lru-k`, `policy-clock`. Optional capabilities are gated the
+same way: `metrics`, `concurrency`, `serde`, and `ttl`. Downstream
+crates can disable defaults and select the minimum surface they need;
+see [`Cargo.toml`](../../Cargo.toml).
+
+**Capability traits + runtime dispatch.** Extension traits
+([`RecencyTracking`](../../src/traits.rs), `FrequencyTracking`,
+`HistoryTracking`, `ExpiringCache`) keep optional behavior off the
+core `Cache<K, V>` trait. For ergonomic builder construction without
+forcing trait objects on the user, [`CacheBuilder`](../../src/builder.rs)
+returns a [`DynCache<K, V>`](../../src/builder.rs) that dispatches via
+an internal enum match rather than `Box<dyn Cache>`. When TTL is
+enabled, the builder returns a sibling `DynExpiringCache<K, V>` that
+threads the expiry check around each variant's `Cache` call — a worked
+example of capability composition. See [`docs/design/ttl.md`](ttl.md)
+for the full design and [`src/policy/expiring.rs`](../../src/policy/expiring.rs)
+for the decorator itself.
+
 ## Bottom Line
 
-High-performance caches are not about clever algorithms—they are about:
+High-performance caches are not about clever algorithms — they are about:
 - Memory layout.
 - Allocation discipline.
 - Contention control.
 - Eviction predictability.
 - Workload realism.
 
-In Rust, your main enemy is not safety—it is abstraction overhead and accidental allocation. Design from the metal upward, then wrap it in something pleasant to use.
+In Rust, your main enemy is not safety — it is abstraction overhead and
+accidental allocation. Design from the metal upward, then wrap it in
+something pleasant to use.
+
+## See Also
+
+Design docs:
+- [Concurrency](concurrency.md) — `Concurrent*` wrappers, `RwLock`
+  discipline, sharded primitives, `ConcurrentCache` marker
+- [Cache trait hierarchy](trait-hierarchy.md) — `Cache<K, V>` kernel,
+  capability traits, read/mutate split, object safety
+- [Builder and runtime dispatch](builder-and-dyn-dispatch.md) —
+  `CachePolicy`, `DynCache`, enum-vs-`Box<dyn>` trade-off, adding new
+  policies
+- [Weighted eviction](weighted-eviction.md) — `WeightStore` dual
+  limits, weight function contract, GDS/GDSF pre-staging
+- [Metrics](metrics.md) — recorder / snapshot / exporter split,
+  `MetricsCell`, Prometheus exporter, feature gating
+- [Error model](error-model.md) — panic vs `Result` discipline,
+  four error types, debug-only invariant checks
+- [TTL](ttl.md) — applied example of every principle above
+- [Doc style guide](style-guide.md)
+
+Reference docs:
+- [Policy overview](../policies/README.md) and [roadmap](../policies/roadmap/README.md)
+- [Policy data structures](../policy-ds/README.md)
+- [Stores](../stores/README.md)
+- [Read-only traits](../guides/read-only-traits.md)
+- [Choosing a policy](../guides/choosing-a-policy.md)
+- [Benchmarks overview](../benchmarks/overview.md) and [workloads](../benchmarks/workloads.md)
+- [Rust API Guidelines checklist](https://rust-lang.github.io/api-guidelines/checklist.html)
diff --git a/docs/design/error-model.md b/docs/design/error-model.md
new file mode 100644
index 0000000..4f73a8f
--- /dev/null
+++ b/docs/design/error-model.md
@@ -0,0 +1,341 @@
+# Error Model
+
+> Status: design rationale for cachekit's panic-vs-`Result` discipline,
+> the four error types in the public API, and the debug-only invariant
+> checks. Companion to [`design.md`](design.md) and [`src/error.rs`](../../src/error.rs).
+
+cachekit treats error handling as a design question, not an ergonomics
+question. The rule is:
+
+> **Panic on programming errors. Return `Result` for user-supplied
+> input. Reserve invariant checks for `debug_assertions`.**
+
+This document explains where each side of that rule applies, why the
+four shipped error types each exist as separate types, and what
+discipline a new error type needs to follow.
+
+## The three tiers
+
+cachekit divides every failure mode into one of three tiers, each with
+its own response:
+
+| Tier | Cause | Response | Example |
+|---|---|---|---|
+| 1. Programming error | Bug in the caller's code, statically detectable in principle | Panic | `LruK::with_k(10, 0)` (k = 0) |
+| 2. User-supplied input | Configuration arriving from outside the program | `Result<_, ErrorType>` | `S3FifoCache::try_with_ratios(_, 2.0, _)` |
+| 3. Invariant violation | Internal data-structure corruption (cannot reach in normal use) | `debug_assert` + `InvariantError` (test/debug only) | `pop_front` while queue length is zero |
+
+The tiers are not opinions — they map to specific Rust constructs and
+runtime behaviours. Mixing them (panicking on tier 2, returning
+`Result` from tier 3) produces APIs that are either ergonomically
+heavy or operationally unsafe.
+
+## Tier 1: panic on programming errors
+
+A "programming error" is a precondition violation the caller could
+have prevented with a `if` or a type. cachekit panics in this case
+rather than returning `Result`, because:
+
+- The bug is in **the caller's code**, not in untrusted input the
+  caller is forwarding.
+- The right fix is for the caller to fix their code, not to handle
+  an error path at the call site.
+- Forcing every call site to handle `Result<_, "you passed 0 for capacity">`
+  for a bug they could have prevented adds friction without
+  catching anything new.
+
+The shipped examples:
+
+- `CacheBuilder::build` panics on `capacity == 0`, `k == 0` for LRU-K,
+  and `probation_frac > 1.0` for 2Q. The validation is centralised in
+  `validate_policy` ([`src/builder.rs`](../../src/builder.rs)).
+- Direct constructors (`LruCore::new`, `S3FifoCache::new`) panic on
+  invalid arguments. The fallible counterparts (`try_with_ratios`,
+  `try_with_capacity`) exist for tier 2.
+- `assert!(*k > 0, "LruK: k must be greater than 0")` in
+  `CacheBuilder::validate_policy` is the canonical shape: a clear
+  message that identifies the parameter and the constraint.
+
+The cost is that a panicking call site terminates under the crate's
+default `panic = "abort"` release profile. This is intentional —
+cachekit's `panic = "abort"` is documented in the
+[`Cargo.toml`](../../Cargo.toml) release profile, and the rationale
+is that a panic in cache code under load is a bug worth surfacing
+through the supervisor / restart strategy, not unwinding.
+
+## Tier 2: `Result` for user-supplied input
+
+When the failure mode is "user passes us configuration we don't
+recognise as valid," return `Result`. The shipped error types each
+cover a specific surface:
+
+### `ConfigError` — invalid configuration parameters
+
+```rust,ignore
+pub struct ConfigError(String);
+```
+
+Defined in [`src/error.rs`](../../src/error.rs). Returned by fallible
+constructors that accept user-tunable knobs:
+
+- `S3FifoCache::try_with_ratios(capacity, small_ratio, ghost_ratio)`
+- Future `try_build` variants on `CacheBuilder`
+
+The contained `String` carries a human-readable description of which
+parameter failed validation. By convention messages are lowercase,
+unpunctuated, and identify the parameter: `"capacity must be greater
+than zero"`, `"small_ratio must be in 0.0..=1.0"`.
+
+`ConfigError`'s presence on a constructor signals that the parameter
+set can legitimately come from outside the program — a config file,
+a CLI flag, an HTTP request — and the caller should handle invalid
+input gracefully rather than crashing the process.
+
+### `StoreFull` — capacity-bound failure
+
+```rust,ignore
+pub struct StoreFull;
+```
+
+Zero-sized type defined in
+[`src/store/traits.rs`](../../src/store/traits.rs). Returned by
+`StoreMut::try_insert` and `ConcurrentStore::try_insert` when the
+store is at capacity and the insert would exceed it. The contract:
+
+- **`StoreFull` is not a panic.** A full store under capacity
+  pressure is the **expected** outcome of `try_insert`. The caller —
+  typically a policy layered on top — must respond by evicting and
+  retrying.
+- **The store does not evict on its own.** `StoreFull` is the
+  signal that says "you, policy, decide who to evict." This is the
+  core of the policy/storage separation rule from
+  [`design.md`](design.md) §7.
+- **The error carries no data.** The caller knows what they tried
+  to insert; `StoreFull` adds nothing useful by retaining it.
+
+`StoreFull` is **not** in `src/error.rs` despite being an error
+type. It lives alongside the trait that returns it because the
+two are co-evolving and the surface is small enough that the
+co-location aids readability.
+
+### `LazyMinHeapError` — `ds`-layer fallible construction
+
+```rust,ignore
+pub enum LazyMinHeapError {
+    CapacityTooLarge { requested: usize, max: usize },
+    Allocation(std::collections::TryReserveError),
+}
+```
+
+Defined in [`src/ds/lazy_heap.rs`](../../src/ds/lazy_heap.rs).
+Returned by `LazyMinHeap::try_with_capacity` when:
+
+- The requested capacity exceeds the internal `MAX_CAPACITY` bound,
+  or
+- The allocator cannot satisfy the reservation.
+
+The enum exposes both failure modes distinctly because a caller may
+want to retry on `Allocation` (transient memory pressure) but not on
+`CapacityTooLarge` (logic bug or genuinely-too-big request that
+won't recover).
+
+The pattern generalises: a future "fallible-construction" error type
+on any `ds` primitive that pre-allocates should distinguish "you
+asked for too much" from "we couldn't get what you asked for."
+
+### `std::collections::TryReserveError` — passthrough
+
+Some `try_new` constructors (`HashMapStore::try_new`,
+`ConcurrentHashMapStore::try_new`) return the standard
+`TryReserveError` directly rather than wrapping it. The reason: the
+only failure mode is allocator pressure, and `TryReserveError`
+already says exactly that. Wrapping it would add a layer for no
+information.
+
+The shape is: if cachekit has a distinct failure mode of its own
+(`CapacityTooLarge`, `StoreFull`), wrap or define a new type; if the
+only failure mode is "the allocator said no," return the standard
+type and let the caller's error-handling stack absorb it.
+
+## Tier 3: invariant checks (debug-only)
+
+```rust,ignore
+pub struct InvariantError(String);
+```
+
+Defined in [`src/error.rs`](../../src/error.rs). Returned by
+`check_invariants` methods on internal data structures:
+
+```rust,ignore
+impl<K, V> S3FifoCache<K, V> {
+    #[cfg(any(debug_assertions, test))]
+    pub fn check_invariants(&self) -> Result<(), InvariantError> {
+        if self.small.len() + self.main.len() != self.map.len() {
+            return Err(InvariantError::new("queue length mismatch"));
+        }
+        // …
+        Ok(())
+    }
+}
+```
+
+Three properties define the tier:
+
+- **Off the hot path.** `check_invariants` is called from tests,
+  fuzz harnesses, and `debug_assertions` paths. It is never called
+  from normal `insert` / `get` / `evict`.
+- **Internal-only.** The invariants are about data-structure
+  integrity: "the queue length matches the map length", "the heap
+  is in heap order", "the ghost list hasn't grown past its bound."
+  No caller program would meaningfully react to one of these
+  failing — the cache is corrupted, the right response is to
+  capture state and bail.
+- **Returns `Result`, not panics.** Counter-intuitive given the
+  tier-1 rule. The reason: `check_invariants` is called by
+  diagnostic code that wants to **report** the violation (in a test
+  failure message, a fuzz reproducer, a debug-mode assertion's
+  output) rather than crash. Returning `Result` lets the caller
+  format the failure; if they want to panic, they `unwrap()`.
+
+`InvariantError` carries the same `String`-message shape as
+`ConfigError`, by the same convention: lowercase, unpunctuated,
+identifying the specific invariant.
+
+## Why four error types, not one
+
+A single `CachekitError` enum could in principle subsume all four.
+cachekit doesn't ship one, deliberately. Three reasons:
+
+- **Each surface has different recovery semantics.** `StoreFull`
+  means "evict and retry"; `ConfigError` means "fix your config";
+  `LazyMinHeapError::Allocation` means "back off and retry";
+  `InvariantError` means "we have a bug, capture state." A unified
+  enum forces every caller to either match exhaustively (most of
+  which can't happen at their call site) or use a catch-all that
+  loses information.
+- **Each lives near the trait that uses it.** `StoreFull` lives in
+  `src/store/traits.rs`; `LazyMinHeapError` lives in
+  `src/ds/lazy_heap.rs`; `ConfigError` and `InvariantError` live
+  in `src/error.rs`. Co-location helps maintenance — adding a new
+  failure mode to one surface doesn't ripple through the others.
+- **Sum types compose poorly across abstractions.** A unified
+  enum would propagate every variant up through every layer that
+  touched it. The current shape lets a layer convert (or
+  re-wrap) only the errors it cares about.
+
+The cost is that downstream code wanting to catch "any cachekit
+error" has to enumerate all four. The mitigation is that no
+realistic downstream code wants that — each call site touches one
+surface at a time and handles that surface's error.
+
+## Operational contract: panic profile
+
+The crate's release profile sets `panic = "abort"`:
+
+```toml
+[profile.release]
+panic = "abort"
+```
+
+Two implications worth naming:
+
+- **A panic terminates the process.** No unwind, no destructors,
+  no observer recovery. A panicking weight function in
+  `ConcurrentWeightStore` (see
+  [`weighted-eviction.md`](weighted-eviction.md)) kills the
+  process; a `parking_lot` lock-poisoning concern is moot under
+  `panic = "abort"` because the process is gone before any
+  observer can read poisoned state.
+- **Callers who override the profile take on more contract.**
+  Callers building with `panic = "unwind"` get unwind safety up
+  to the documented invariants. The
+  [`weighted-eviction.md`](weighted-eviction.md) clear-ordering
+  rule and the
+  [`concurrency.md`](concurrency.md#failure-modes) panic-safety
+  notes apply only to this mode.
+
+The interplay matters for error model design: under `abort`, tier 1
+panics are terminal and need to be debugged at development time;
+under `unwind`, they are catchable but should still be treated as
+bugs because the cache may be in an unspecified-but-not-corrupt
+state.
+
+## What `Result` does **not** cover
+
+Three failure modes are deliberately not represented as `Result`:
+
+- **OOM in non-`try_*` constructors.** `LruCore::new(huge)` aborts
+  on allocator failure. Use `try_with_capacity` to get a `Result`
+  surface (where available).
+- **Logic errors in policy code.** Eviction picking the wrong
+  victim is a bug, not a return value. Detected (when detected) by
+  `check_invariants` or by the policy's tests.
+- **Concurrent contention.** `parking_lot::RwLock` doesn't poison,
+  doesn't time out by default, and doesn't return `Result`. A
+  contended cache blocks until it can proceed. Callers who need
+  timeouts wrap the cache themselves with a wider locking
+  discipline.
+
+## Adding a new error
+
+Checklist for a new failure mode:
+
+1. **Decide the tier.** Programming error, user-supplied input, or
+   internal invariant?
+2. **Pick or define the type.**
+   - Tier 1: use `assert!` / `debug_assert!` / `panic!`. No new
+     type needed.
+   - Tier 2: define a new type if the failure has data the caller
+     needs and no existing type fits. Otherwise reuse `ConfigError`
+     (with a clear message) or pass through `TryReserveError`.
+   - Tier 3: add a `check_invariants` method on the affected type
+     that returns `Result<(), InvariantError>`.
+3. **Co-locate.** Types specific to a trait live with the trait
+   (`StoreFull` in `src/store/traits.rs`). Types specific to a
+   primitive live with the primitive (`LazyMinHeapError`).
+   Cross-cutting types (`ConfigError`, `InvariantError`) live in
+   `src/error.rs`.
+4. **Implement `Display` and `Error`.** Both are required for
+   `?` interop with `Box<dyn Error>`. The convention is:
+   ```rust,ignore
+   impl fmt::Display for MyError { … }
+   impl std::error::Error for MyError {}
+   ```
+   `Display` writes the message; `Error` is empty unless the type
+   wraps another error (then `source` returns the inner error).
+5. **`Send + Sync + Clone`.** All existing error types satisfy this.
+   The convention is `#[derive(Debug, Clone, PartialEq, Eq, Hash)]`
+   for value types and matching impls for enums. Errors that flow
+   between threads must be `Send + Sync`; errors that get cloned
+   into snapshots / test fixtures must be `Clone`.
+
+## Compatibility with `?` and `anyhow`/`thiserror`
+
+The cachekit error types are intentionally **plain types, not
+`thiserror`-derived**, to avoid forcing a `thiserror` dependency on
+downstream users. They implement `std::error::Error` directly, so
+they work with `?`, `Box<dyn Error>`, and any error-aggregation
+crate (including `anyhow` and `thiserror::Error` in user code).
+
+A downstream `thiserror`-derived enum that includes a `#[from]
+cachekit::ConfigError` works. A downstream `anyhow::Result<_>` that
+absorbs cachekit errors via `?` works. The choice not to bundle
+either crate keeps the error layer dependency-free and gives
+downstream the standard `From` and `Display` shape they expect.
+
+## See also
+
+- [Design overview](design.md) — §12 frames failure modes at the
+  principles level
+- [Concurrency](concurrency.md) — `parking_lot` non-poisoning,
+  atomic check-and-act, lock-acquisition failure modes
+- [Builder and runtime dispatch](builder-and-dyn-dispatch.md) —
+  panic-in-`build` validation, `try_build`-deliberately-absent
+  rationale
+- [Weighted eviction](weighted-eviction.md) — `StoreFull`'s role
+  and unwind-safety in `clear`
+- [`src/error.rs`](../../src/error.rs) — `ConfigError`,
+  `InvariantError`
+- [`src/store/traits.rs`](../../src/store/traits.rs) — `StoreFull`
+- [`src/ds/lazy_heap.rs`](../../src/ds/lazy_heap.rs) —
+  `LazyMinHeapError`
diff --git a/docs/design/metrics.md b/docs/design/metrics.md
new file mode 100644
index 0000000..3a296d7
--- /dev/null
+++ b/docs/design/metrics.md
@@ -0,0 +1,507 @@
+# Metrics
+
+> Status: design rationale for the metrics infrastructure under
+> [`src/metrics/`](../../src/metrics), gated by the `metrics` Cargo
+> feature. Companion to [`design.md`](design.md) §6.
+
+cachekit's metrics surface is bigger than "two counters behind a
+feature flag." It mirrors the cache trait hierarchy — recorder /
+snapshot / exporter — so each concern lives in the smallest trait
+that captures it, and policy code stays free of monitoring plumbing.
+This document explains the three-trait separation, the
+`&self`-vs-`&mut self` split, the `MetricsCell` interior-mutability
+escape hatch, the Prometheus exporter contract, and what guarantees
+counters do and do not provide.
+
+## Goals and non-goals
+
+The metrics module is shaped for:
+
+- **Lightweight in-process counters** that a policy can increment on
+  its hot path without measurable overhead when enabled.
+- **Zero overhead when disabled.** The entire `metrics` module
+  compiles away under `#[cfg(feature = "metrics")]`.
+- **Decoupled consumption.** Tests, benchmarks, and production
+  monitoring should each consume metrics in the shape they need
+  without dragging recording concerns along.
+- **Per-policy specificity.** A Clock policy's `hand_advance` count
+  matters; a FIFO's `pop_oldest_empty_or_stale` count matters. The
+  trait surface preserves these signals rather than flattening to
+  one shape.
+
+It is **not** shaped for:
+
+- **High-cardinality labels.** Counters are flat scalars. Tag
+  dimensions (per-key, per-tenant) are out of scope.
+- **Histograms or sliding windows.** Counters and gauges only.
+  Latency distributions live in the user's monitoring stack via
+  external instrumentation.
+- **Audit-grade accounting.** Counters use `Relaxed` atomics
+  ([`src/store/weight.rs`](../../src/store/weight.rs)) and wrap on
+  overflow in release. Best-effort observability, not financial
+  ledger.
+
+## Three-trait separation
+
+```text
+                                ┌─────────────────────────────┐
+                                │     CoreMetricsRecorder     │
+                                │  record_get_hit, _miss,     │
+                                │  _insert_*, _evict_*,       │
+                                │  _clear                     │
+                                └──────────────┬──────────────┘
+                                               │ extends
+        ┌──────────┬───────────┬───────────────┼───────────┬────────────┐
+        ▼          ▼           ▼               ▼           ▼            ▼
+   FifoRec    LruRec       LfuRec          ArcRec      ClockRec    S3FifoRec
+                │                                                       …
+                ▼
+            LruKRec
+            (further extends LruRec)
+
+   Consumption (decoupled from recording):
+   ┌──────────────────────────────┐    ┌──────────────────────────────┐
+   │ MetricsSnapshotProvider<S>   │    │ MetricsExporter<S>           │
+   │ + MetricsReset               │    │ PrometheusTextExporter       │
+   │ (bench / test)               │    │ (production monitoring)      │
+   └──────────────────────────────┘    └──────────────────────────────┘
+```
+
+Three responsibilities, three trait families:
+
+- **Record.** Per-policy `*MetricsRecorder` traits live in
+  [`src/metrics/traits.rs`](../../src/metrics/traits.rs). Every
+  policy-specific recorder extends `CoreMetricsRecorder` and adds
+  policy-specific methods (`record_hand_advance` for Clock,
+  `record_b1_ghost_hit` for ARC, etc.). The policy itself calls
+  these methods on its hot path.
+- **Snapshot.** `MetricsSnapshotProvider<S>` returns a `Copy`
+  `*MetricsSnapshot` struct ([`src/metrics/snapshot.rs`](../../src/metrics/snapshot.rs))
+  — a point-in-time scalar copy of every counter. Snapshots are
+  `#[non_exhaustive]` for SemVer headroom and gated on `serde` for
+  cross-process transport.
+- **Export.** `MetricsExporter<S>` consumes a snapshot and pushes it
+  to an external system. The shipped implementation,
+  `PrometheusTextExporter` ([`src/metrics/exporter.rs`](../../src/metrics/exporter.rs)),
+  writes Prometheus exposition format to any `W: Write + Send`.
+
+Splitting these three lets:
+
+- **Policy code stay minimal.** A policy needs only the recorder
+  trait. It does not import snapshots or exporters.
+- **Tests bypass production.** Bench harnesses use
+  `MetricsSnapshotProvider` + `MetricsReset` and never touch
+  `MetricsExporter`. Production code does the inverse.
+- **Exporters multiply without policy churn.** Adding a StatsD or
+  OpenTelemetry exporter is a new `impl MetricsExporter<S>` for the
+  snapshot types — no policy changes.
+
+## Per-policy recorder traits
+
+Every policy gets its own recorder trait extending
+`CoreMetricsRecorder`. The shipped set:
+
+| Trait | Adds counters for |
+|---|---|
+| `FifoMetricsRecorder` | scan steps, stale skips, `pop_oldest` calls |
+| `LruMetricsRecorder` | `pop_lru`, `peek_lru`, `touch`, `recency_rank` |
+| `LruKMetricsRecorder` | extends `LruMetricsRecorder` + K-distance counters |
+| `LfuMetricsRecorder` | `pop_lfu`, `peek_lfu`, frequency reads / mutates |
+| `MfuMetricsRecorder` | mirrors LFU for most-frequent eviction |
+| `ArcMetricsRecorder` | T1→T2 promotions, B1/B2 ghost hits, `p` movement |
+| `CarMetricsRecorder` | recent→frequent, ghost hits, hand sweeps |
+| `ClockMetricsRecorder` | hand advances, ref-bit resets |
+| `ClockProMetricsRecorder` | cold↔hot transitions, test entries |
+| `NruMetricsRecorder` | sweep steps, ref-bit resets |
+| `SlruMetricsRecorder` | probationary→protected, protected evictions |
+| `TwoQMetricsRecorder` | A1in→Am promotions, A1out ghost hits |
+| `S3FifoMetricsRecorder` | promotions, main reinserts, ghost hits |
+
+Two design principles drive the granularity:
+
+- **Each counter answers a tuning question.** "Are my LRU-K
+  promotions worth the metadata?" "Is my ARC ghost list catching
+  meaningful hits?" Generic `evictions: u64` cannot answer either.
+- **Counters live near their semantics.** `record_a1in_to_am_promotion`
+  belongs to 2Q because A1in/Am are 2Q concepts. Putting it on
+  `CoreMetricsRecorder` would force every other policy to either
+  implement a meaningless method or document a no-op.
+
+The trade is API surface: 14 recorder traits with ~5-10 methods
+each. The mitigation is that **users do not implement them** — they
+implement the shipped `*Metrics` structs through inherent methods on
+each policy, and they read snapshots, not recorders.
+
+## The `&self`-vs-`&mut self` split
+
+Several `Cache<K, V>` methods take `&self`:
+[`trait-hierarchy.md`](trait-hierarchy.md#peek-vs-get--the-readmutate-split)
+explains why. The metrics system has to honour this — a `&self`
+read path cannot call a `&mut self` recorder. The shipped solution
+is a parallel `*MetricsReadRecorder` family for each policy whose
+read paths increment counters:
+
+| Mutable trait | Read-only counterpart |
+|---|---|
+| `FifoMetricsRecorder` | `FifoMetricsReadRecorder` |
+| `LruMetricsRecorder` | `LruMetricsReadRecorder` |
+| `LruKMetricsRecorder` | `LruKMetricsReadRecorder` |
+| `LfuMetricsRecorder` | `LfuMetricsReadRecorder` |
+| `MfuMetricsRecorder` | `MfuMetricsReadRecorder` |
+
+The read-only traits take `&self` on every method. They are
+implemented through interior mutability on the concrete metrics
+struct — specifically `MetricsCell`, the internal type that wraps
+`Cell<u64>` with an `unsafe impl Sync` (covered below).
+
+Two questions this design avoided:
+
+- **"Why not put `Cell<u64>` directly on the metrics struct?"**
+  Because `Cell<u64>` is `!Sync`, which propagates and prevents
+  every policy struct that embeds metrics from being `Sync`. The
+  thin `MetricsCell` wrapper makes the synchronisation discipline
+  explicit at one site instead of N.
+- **"Why not just `AtomicU64` for everything?"** Because counters
+  on `&mut self` paths (the majority — `insert`, `get`, `evict`)
+  do not need atomic semantics; the policy already holds exclusive
+  access. Using `AtomicU64` everywhere would impose memory-fence
+  cost on the hot path for no concurrency benefit. The split
+  reserves atomic-ish behaviour (`MetricsCell` + external lock) for
+  read paths only.
+
+## `MetricsCell`: interior mutability under external lock
+
+```rust,ignore
+#[repr(transparent)]
+#[derive(Debug, Default, Clone, PartialEq, Eq)]
+pub(crate) struct MetricsCell(Cell<u64>);
+
+unsafe impl Sync for MetricsCell {}
+unsafe impl Send for MetricsCell {}
+```
+
+This is the only `unsafe impl Sync` in the metrics surface. The
+contract:
+
+- **External synchronization is required.** `MetricsCell` lives
+  inside a policy struct that is itself behind an `RwLock` in any
+  concurrent wrapper (see [`concurrency.md`](concurrency.md)). The
+  read lock serializes concurrent `&self` access; the cell is
+  manipulated under that lock.
+- **The cell is observation-only.** Lost increments are
+  acceptable; the worst-case outcome is undercounting a metric,
+  which is a precision issue, not a correctness one. Cache hits and
+  evictions still behave correctly.
+- **`pub(crate)`.** The type does not escape the crate.
+  Down-stream code can read counters through the snapshot API but
+  cannot construct `MetricsCell` itself, which prevents misuse from
+  outside the codebase.
+
+The alternatives considered and rejected:
+
+- `Mutex<u64>` — cost dominates the counter increment.
+- `AtomicU64` — works, but imposes fence cost where no concurrency
+  exists for the increment itself.
+- `RefCell<u64>` — runtime borrow checking with panic on contention;
+  not desirable on a metrics increment path.
+
+`MetricsCell` is the smallest tool that says "we know about the
+sync requirement; trust the external lock; pay no per-increment
+cost beyond a `Cell::set`."
+
+## Snapshots: cheap, copyable, optionally serializable
+
+Every snapshot struct in [`src/metrics/snapshot.rs`](../../src/metrics/snapshot.rs)
+follows the same shape:
+
+```rust,ignore
+#[derive(Debug, Default, Clone, Copy, PartialEq, Eq)]
+#[non_exhaustive]
+#[cfg_attr(feature = "serde", derive(serde::Serialize, serde::Deserialize))]
+pub struct LruMetricsSnapshot {
+    pub get_calls: u64,
+    pub get_hits: u64,
+    pub get_misses: u64,
+    pub insert_calls: u64,
+    pub insert_updates: u64,
+    pub insert_new: u64,
+    pub evict_calls: u64,
+    pub evicted_entries: u64,
+    pub pop_lru_calls: u64,
+    pub pop_lru_found: u64,
+    pub peek_lru_calls: u64,
+    pub peek_lru_found: u64,
+    pub touch_calls: u64,
+    pub touch_found: u64,
+    pub recency_rank_calls: u64,
+    pub recency_rank_found: u64,
+    pub recency_rank_scan_steps: u64,
+    pub cache_len: usize,
+    pub insertion_order_len: usize,
+    pub capacity: usize,
+}
+```
+
+Five intentional properties:
+
+- **`Copy`.** A snapshot is a flat block of `u64`s and `usize`s.
+  Copying is a `memcpy` and snapshots can flow through channels,
+  futures, and test assertions without ceremony.
+- **`Default`.** Equivalent to "no operations recorded." Useful for
+  test fixtures and explicit reset comparisons.
+- **`#[non_exhaustive]`.** Adding a new counter (e.g. when a
+  policy variant gains a new internal step) is a minor version
+  bump. Downstream code matching on the struct must accept new
+  fields gracefully — the standard `non_exhaustive` discipline.
+- **`PartialEq + Eq`.** Snapshot equality is well-defined and
+  useful in tests. Two snapshots compare equal iff every counter
+  matches.
+- **Optionally `serde`.** Gated on `serde`, not unconditional, so
+  the metrics module doesn't drag serde into builds that don't
+  want it.
+
+Gauges (`cache_len`, `insertion_order_len`, `capacity`) live
+alongside counters and snapshot together. The Prometheus exporter
+writes the right `# TYPE` line for each, which matters for the
+scraper.
+
+## Recording is push, consumption is pull
+
+Two operating models coexist:
+
+- **Recording is push from the policy.** The policy calls
+  `m.record_get_hit()` directly. The recorder method has the
+  cheapest possible body (one `+= 1`). This is the hot-path
+  contract.
+- **Consumption is pull from the consumer.** Tests / benches /
+  exporters call `m.snapshot()` whenever they want a value, and
+  `MetricsReset::reset_metrics(&self)` when they want to clear.
+  Nothing about the policy timing depends on consumption.
+
+Specifically, the policy does **not** push to the exporter. There
+is no observer-pattern hook from the recorder to the exporter, no
+synchronous flush on every increment, and no async channel between
+them. The pull model lets benches consume at known checkpoints
+(once per iteration), and lets production scrapers poll on their
+own cadence (every 10 s, every minute, etc.).
+
+The cost of the pull model is that an exporter cannot react to a
+specific event (e.g. "evictions spiked above N"). cachekit users
+who need event-driven reactions instrument at the application
+layer, not the metrics layer.
+
+## Prometheus text exporter
+
+The shipped exporter (`PrometheusTextExporter` in
+[`src/metrics/exporter.rs`](../../src/metrics/exporter.rs)) writes
+the Prometheus text exposition format to any `W: Write + Send`:
+
+```rust,ignore
+let exporter = PrometheusTextExporter::new("myapp_cache", io::stdout());
+let snapshot = lru_cache.snapshot();
+exporter.export(&snapshot);
+```
+
+Three design choices worth naming:
+
+- **Per-prefix instance.** The prefix (`myapp_cache`) is set at
+  construction, not per call. This keeps the call site simple and
+  enforces a single metric namespace per exporter instance.
+- **I/O errors are silently dropped.** A failing write does not
+  panic the cache or surface a `Result`. The contract is
+  "fire-and-forget monitoring" — a transient `EPIPE` to a metrics
+  socket must not interrupt cache operations. Callers who need
+  guaranteed delivery should wrap their writer in something with
+  retry semantics and accept the cost.
+- **The writer is `Mutex<W>`, not `RwLock<W>`.** Writing is
+  always exclusive; there's no read path. Using `Mutex` here is
+  the right primitive even though most of cachekit uses
+  `parking_lot::RwLock`. (Note: this is `std::sync::Mutex`,
+  poisoning-aware. `export` panics on poisoning. This is a
+  deliberate divergence from `parking_lot` — the exporter is on
+  the cold path and the std mutex's poisoning behaviour is fine
+  there.)
+
+Other exporters (StatsD, OpenTelemetry, custom) plug in by
+implementing `MetricsExporter<S>` for each snapshot type they
+care about. No changes elsewhere in the crate are required.
+
+## Feature gating: all-or-nothing at compile time
+
+The entire metrics subsystem is gated on the `metrics` Cargo
+feature:
+
+```rust
+// src/lib.rs
+#[cfg(feature = "metrics")]
+pub mod metrics;
+```
+
+Inside each policy, recorder calls are wrapped:
+
+```rust,ignore
+#[cfg(feature = "metrics")]
+self.metrics.record_get_hit();
+```
+
+When `metrics` is **off**:
+
+- The entire `metrics` module disappears from the build.
+- Every `record_*` call site becomes a no-op (the `#[cfg]` block
+  compiles away).
+- Snapshot types are not in the public API.
+- Build time drops; binary size drops; no runtime cost.
+
+When `metrics` is **on**:
+
+- Recording costs one `u64 += 1` per call (or one `Cell::set` for
+  read-only counters). For a 17-policy `DynCache` that records on
+  every `get` / `insert`, the overhead is sub-nanosecond and shows
+  up in benches as flat regression.
+- The `metrics::snapshot` and `metrics::exporter` modules are in
+  the public API and exporting infrastructure is available.
+
+The trade-off is deliberate. No "low-cardinality always-on,
+detailed-on-demand" two-tier scheme exists — every counter is
+either always present (feature on) or absent (feature off). The
+discipline that keeps "always present" cheap is the recorder
+contract: methods do no work beyond incrementing a counter.
+
+## What about `StoreMetrics`?
+
+`StoreMetrics` ([`src/store/traits.rs`](../../src/store/traits.rs))
+is a **separate**, simpler structure that ships unconditionally
+(not behind `metrics`). It carries the universal counters every
+store-layer implementation tracks:
+
+```rust,ignore
+pub struct StoreMetrics {
+    pub hits: u64,
+    pub misses: u64,
+    pub inserts: u64,
+    pub updates: u64,
+    pub removes: u64,
+    pub evictions: u64,
+}
+```
+
+The two systems coexist:
+
+- `StoreMetrics` is the store-layer baseline. Always present, always
+  cheap, six counters.
+- `src/metrics/` (feature-gated) is the policy-layer detailed
+  metrics — recorder traits, snapshots, exporter, per-policy signals.
+
+A store typically backs `StoreMetrics` with `AtomicU64` counters
+(see `StoreCounters` in [`src/store/weight.rs`](../../src/store/weight.rs)),
+because stores are often behind concurrent wrappers and the
+increment paths can be `&self`. The split mirrors the
+sequential-vs-concurrent split at the trait level
+([`concurrency.md`](concurrency.md)).
+
+## Counter discipline
+
+Three rules every recorder method follows:
+
+1. **No allocation.** Counter increments are O(1) and allocation-free.
+2. **No fallible operations.** A counter must not be in a position
+   where it can fail — `+=` always succeeds; saturation is
+   acceptable for u64 wrap (it takes years at billions/sec).
+3. **No conditional logic beyond the counter itself.** A recorder
+   method that branches on cache state belongs in the policy, not
+   in metrics.
+
+The corollary: a policy that wants a derived counter ("number of
+evictions where the victim's recency rank was > 10") computes the
+condition itself and calls one of two existing methods accordingly.
+Putting the branching inside the recorder would couple metrics to
+policy state.
+
+## Adding a new metric
+
+Checklist for adding a per-policy counter:
+
+1. **Add the field.** Plain `u64` if it's updated on `&mut self`
+   paths; `MetricsCell` if it's updated on `&self` paths. Place it
+   in the corresponding `*Metrics` struct under
+   [`src/metrics/metrics_impl.rs`](../../src/metrics/metrics_impl.rs).
+2. **Add the recorder method.** On the relevant `*MetricsRecorder`
+   trait (or its `*ReadRecorder` counterpart for `&self`).
+3. **Implement on the policy's metrics struct.** One-line
+   `+= 1` body.
+4. **Wire the call site in the policy.** Wrap with
+   `#[cfg(feature = "metrics")]`.
+5. **Add the field to the snapshot.** In
+   [`src/metrics/snapshot.rs`](../../src/metrics/snapshot.rs). The
+   snapshot's `From<&*Metrics>` (or equivalent) needs the new
+   field.
+6. **Update the exporter.** Add a `write_counter` /
+   `write_gauge` call in `PrometheusTextExporter::export` for the
+   new field.
+
+Six locations is a lot of friction for a new counter. The friction
+is intentional — adding a counter is rarely the right answer to a
+debugging question, and the friction encourages reuse of existing
+counters where possible.
+
+## Adding a new metric **type** (gauge vs counter, histogram)
+
+Histograms and sliding windows are deliberately out of scope. Adding
+either is a wider design change:
+
+- The recorder traits assume `&mut u64 += 1` semantics. A histogram
+  needs `observe(value)` semantics and an aggregation strategy.
+- The snapshot types assume `Copy` and `u64` fields. A histogram
+  snapshot needs bucket arrays.
+- The Prometheus exporter writes counters and gauges only.
+
+If histograms become needed (the most likely use case is latency
+distribution per policy), the design has space: introduce a
+`HistogramRecorder` trait alongside `CoreMetricsRecorder` and a
+matching `HistogramSnapshot`. The existing exporter stays counter-
+and-gauge-only; a new `PrometheusHistogramExporter` handles the
+new shape. The current omission is a coverage decision, not a
+foundation problem.
+
+## Guarantees and non-guarantees
+
+What the metrics system guarantees:
+
+- **Eventual consistency in single-threaded builds.** Every recorded
+  event eventually appears in `snapshot()` for the same thread.
+- **Snapshot atomicity per counter.** A snapshot reads each
+  counter as a single load; no torn `u64` reads on 64-bit
+  platforms.
+- **No cache correctness impact.** Metrics never block, panic
+  (except `PrometheusTextExporter` on poisoned mutex), or alter
+  cache state.
+
+What it does **not** guarantee:
+
+- **Cross-counter snapshot consistency.** A snapshot reads counters
+  sequentially. A reader can observe `hits = 100, misses = 99`
+  while a concurrent writer is mid-update; the next snapshot may
+  show `hits = 100, misses = 101`. There is no "snapshot epoch."
+- **Lossless recording under contention.** `MetricsCell`
+  increments under a held read lock are safe; multiple read locks
+  are not serialized against each other. Concurrent `&self`
+  recorder calls on the same `MetricsCell` can lose increments.
+  This is the "best-effort observability" caveat.
+- **Wrap-safe arithmetic in release.** Release profile sets
+  `overflow-checks = false`. Counters wrap silently. At one billion
+  events per second, `u64` wraps in ~585 years — practically a
+  non-issue, formally not a guarantee.
+
+## See also
+
+- [Design overview](design.md) — §6 frames metrics at the
+  principles level
+- [Cache trait hierarchy](trait-hierarchy.md) — `&self` / `&mut self`
+  split that drives the read-vs-mutate recorder fork
+- [Concurrency](concurrency.md) — read/write lock model behind
+  `MetricsCell`'s soundness
+- [Error model](error-model.md) — panic discipline shared by the
+  exporter's poisoning behaviour
+- [`src/metrics/`](../../src/metrics) — the canonical implementation
+- [`src/store/traits.rs`](../../src/store/traits.rs) —
+  `StoreMetrics`, the unconditional store-layer counterpart
diff --git a/docs/design/trait-hierarchy.md b/docs/design/trait-hierarchy.md
new file mode 100644
index 0000000..70a786b
--- /dev/null
+++ b/docs/design/trait-hierarchy.md
@@ -0,0 +1,415 @@
+# Cache Trait Hierarchy
+
+> Status: design rationale for the trait surface in
+> [`src/traits.rs`](../../src/traits.rs). Companion to the cross-cutting
+> principles in [`docs/design/design.md`](design.md) §7 and the concurrency
+> rationale in [`docs/design/concurrency.md`](concurrency.md).
+
+cachekit exposes its policies through a small, layered trait hierarchy.
+One kernel trait (`Cache<K, V>`) covers what every policy must do;
+optional capability traits expose signals that some policies have and
+others don't. This document explains why the surface is shaped this
+way, what each trait promises, and how to add new capabilities without
+breaking the kernel.
+
+## Goals
+
+The trait surface optimizes for four things, roughly in order:
+
+1. **Code written against the kernel survives a policy swap.** Users
+   writing `fn warm<C: Cache<K, V>>(c: &mut C, …)` can pick any of
+   the 17 implemented policies without changing call sites.
+2. **Optional behaviour is visible only when present.** A policy that
+   doesn't track frequency should not have a `frequency()` method that
+   returns garbage or panics. Capability traits exist so this remains
+   true.
+3. **The kernel stays object-safe.** `Box<dyn Cache<K, V>>` is needed
+   for runtime dispatch (the `DynCache` enum is the chosen alternative,
+   but object safety keeps the door open and keeps the trait usable in
+   trait objects elsewhere).
+4. **The read/mutate split is explicit.** `peek` and `contains` are
+   side-effect-free `&self` methods; `get` is `&mut self` because it
+   updates policy state. This drops out of point 3 but is worth naming
+   on its own because it shapes the concurrent surface
+   ([`docs/design/concurrency.md`](concurrency.md)).
+
+## Map of the hierarchy
+
+```text
+                            ┌───────────────────────┐
+                            │      Cache<K, V>      │   object-safe kernel
+                            │  contains, len,       │
+                            │  capacity, peek, get, │
+                            │  insert, remove,      │
+                            │  clear, is_empty      │
+                            └───────────┬───────────┘
+                                        │ extends
+            ┌───────────────┬───────────┼───────────┬──────────────────┐
+            ▼               ▼           ▼           ▼                  ▼
+    EvictingCache   VictimInspect   RecencyTrack   FrequencyTrack   HistoryTrack
+    evict_one()     peek_victim()   touch,         frequency()      access_count,
+                                    recency_rank                    k_distance,
+                                                                    access_history,
+                                                                    k_value
+
+   ConcurrentCache  CacheFactory + CacheConfig    AsyncCacheFuture  (utility traits,
+   (unsafe marker)  (constructor abstraction)     (Phase 2)         not extensions)
+```
+
+All capability traits in the upper row extend `Cache<K, V>`. They
+compose by being implemented additively — `LrukCache` implements
+`Cache`, `RecencyTracking`, `FrequencyTracking`, **and**
+`HistoryTracking` because it tracks all three signals.
+
+## Layer 1 — `Cache<K, V>`
+
+The kernel trait. Every policy implements it. The full signature lives
+in [`src/traits.rs`](../../src/traits.rs); the design decisions worth
+naming are:
+
+```rust
+pub trait Cache<K, V> {
+    fn contains(&self, key: &K) -> bool;
+    fn len(&self) -> usize;
+    fn is_empty(&self) -> bool { self.len() == 0 }
+    fn capacity(&self) -> usize;
+
+    fn peek(&self, key: &K) -> Option<&V>;
+    fn get(&mut self, key: &K) -> Option<&V>;
+    fn insert(&mut self, key: K, value: V) -> Option<V>;
+    fn remove(&mut self, key: &K) -> Option<V>;
+    fn clear(&mut self);
+}
+```
+
+### Object safety
+
+The signature deliberately avoids every feature that would break
+object safety:
+
+- No generic methods.
+- No `Self` in return position (except by reference, which is allowed).
+- No `where Self: Sized` bounds.
+- No `impl Trait` returns.
+
+This costs ergonomics — batch operations like `insert_many`,
+`get_or_insert_with(closure)`, and `extend(iter)` stay as inherent
+methods on each policy rather than landing on `Cache<K, V>` itself.
+That trade is intentional: keeping the trait object-safe means
+`DynCache<K, V>` is *able* to dispatch through it (even though the
+shipped `DynCache` is an enum dispatcher rather than a trait object —
+see [`design.md`](design.md) §13). It also keeps `Box<dyn Cache<K, V>>`
+available for users writing test harnesses, factories, or registries
+that need true type erasure.
+
+### `peek` vs `get` — the read/mutate split
+
+This is the most consequential design decision in the kernel:
+
+- **`peek(&self, …) -> Option<&V>`** does not update recency, frequency,
+  reference bits, segment placement, or any policy state. It is the
+  honest read.
+- **`get(&mut self, …) -> Option<&V>`** is the policy-tracked read.
+  An LRU `get` moves the entry to MRU; an LFU `get` bumps the
+  frequency counter; a Clock `get` sets the reference bit.
+
+Three things fall out of the split:
+
+1. **`peek` is usable behind a read lock.** Concurrent wrappers
+   ([`docs/design/concurrency.md`](concurrency.md)) implement their
+   `peek` with `RwLock::read`, allowing multiple readers to proceed
+   in parallel. `get` requires `RwLock::write` because it mutates.
+2. **`peek` is testable as a pure function.** Hit-rate measurements,
+   invariant assertions, and debug prints can use `peek` without
+   perturbing the policy.
+3. **`len` / `contains` / `capacity` are also `&self`.** They live
+   alongside `peek` in the read-locked surface of concurrent wrappers,
+   for the same reason.
+
+`contains` is its own method — not `peek(key).is_some()` — because
+some policies (S3-FIFO, ARC, CAR with ghost lists) can answer
+"is this key resident?" cheaper than they can return a value reference.
+
+### `&V` return positions
+
+Returning `&V` rather than `V`-by-value or `Arc<V>` is the right
+choice **for the sequential trait**. Callers who need ownership can
+clone; callers who don't pay nothing. The cost shows up in concurrent
+wrappers, which cannot return `&V` across a lock boundary — that's
+why `Concurrent*` types deviate from `Cache<K, V>` (covered in detail
+in [`concurrency.md`](concurrency.md)).
+
+### Default methods
+
+Only `is_empty` has a default. Adding more defaults — even ones that
+seem obviously implementable in terms of other methods — would push
+performance regressions onto policies that have cheaper specialised
+implementations. The hashmap-backed `contains` is faster than the
+default `peek(…).is_some()` because it skips fetching the value, and
+that difference matters on hot lookup paths.
+
+## Layer 2 — Capability traits
+
+Each capability trait extends `Cache<K, V>` and exposes a signal that
+**some but not all** policies have. The rule is:
+
+> Implement the capability trait only when the policy genuinely
+> exposes that signal. Do not stub out the methods with sentinel
+> returns.
+
+### `EvictingCache<K, V>: Cache<K, V>`
+
+```rust
+fn evict_one(&mut self) -> Option<(K, V)>;
+```
+
+Forces a single eviction by policy. Returns the evicted entry or
+`None` if the cache is empty. Useful for benchmarks ("evict 1 % of
+the cache and measure"), background cleanup, and capacity-on-demand
+patterns. Implemented by FIFO, LIFO, LRU, FastLRU, Heap-LFU, S3-FIFO,
+Clock, Clock-PRO, LRU-K, MFU, MRU, plus the LFU variants.
+
+Policies that **do not** implement it: ARC, CAR, NRU, Random, SLRU,
+2Q. The rustdoc on `EvictingCache` lists this set explicitly. The
+reason is policy-specific:
+
+- ARC / CAR evict via adaptive choice across two queues; "evict one
+  by policy" is ambiguous without an insertion that drives the
+  adaptation.
+- NRU sweeps reference bits; an isolated `evict_one` may scan the
+  whole cache.
+- Random has no order; users who want random eviction should call
+  `remove(random_key)` themselves.
+- SLRU / 2Q's victim depends on which segment is over-quota,
+  which only happens under capacity pressure.
+
+The trait is `#[must_use]` on its return because dropping the evicted
+entry on the floor is rarely what callers want.
+
+### `VictimInspectable<K, V>: Cache<K, V>`
+
+```rust
+fn peek_victim(&self) -> Option<(&K, &V)>;
+```
+
+Read-only access to the entry that would be evicted next. Only
+implemented by policies whose victim is cheap and stable to identify
+without mutating state — FIFO, LIFO, LRU, FastLRU. Clock-family
+policies don't implement it because identifying the victim requires
+advancing the hand (a mutation). LFU-family policies don't implement
+it because the heap top can be a stale entry that hasn't been popped
+yet ([`LazyMinHeap`](../../src/ds/lazy_heap.rs)).
+
+The signature is deliberately `&self`-only. Anything that would force
+`&mut self` (lazy heap rebuild, clock-hand advance, ARC adaptation)
+disqualifies the policy from implementing it.
+
+### `RecencyTracking<K, V>: Cache<K, V>`
+
+```rust
+fn touch(&mut self, key: &K) -> bool;
+fn recency_rank(&self, key: &K) -> Option<usize>;
+```
+
+For policies that order entries by access recency: LRU, FastLRU,
+LRU-K. `touch` is `get` without the value lookup — useful when you
+want to refresh recency for a key whose value you already have.
+`recency_rank` returns 0 for the MRU entry and `len() - 1` for the
+LRU. Both are stable across `peek`/`contains`/`len` calls but invalidate
+on any `&mut` call.
+
+### `FrequencyTracking<K, V>: Cache<K, V>`
+
+```rust
+fn frequency(&self, key: &K) -> Option<u64>;
+```
+
+For policies that track access frequency: LFU, Heap-LFU, MFU, LRU-K.
+The `u64` return is intentional even though some policies use smaller
+counters internally (LFU uses small saturating counters under
+[`FrequencyBuckets`](../../src/ds/frequency_buckets.rs)) — exposing
+`u64` keeps the trait stable across counter-width changes.
+
+### `HistoryTracking<K, V>: Cache<K, V>`
+
+```rust
+fn access_count(&self, key: &K) -> Option<usize>;
+fn k_distance(&self, key: &K) -> Option<u64>;
+fn access_history(&self, key: &K) -> Option<Vec<u64>>;
+fn k_value(&self) -> usize;
+```
+
+LRU-K style access-history inspection. Currently implemented only by
+`LrukCache`. The `access_history` return is a `Vec<u64>` because the
+history is bounded by K and callers typically inspect it as a unit;
+exposing the underlying [`FixedHistory`](../../src/ds/fixed_history.rs)
+would couple consumers to an internal type.
+
+`k_value()` is on the trait rather than as a constructor argument
+witness because LRU-K's K is policy-configured and consumers writing
+generic code over `HistoryTracking` need to read it without knowing
+the concrete type.
+
+## Why capability traits, not feature flags?
+
+cachekit could expose recency / frequency / history through methods
+on `Cache<K, V>` itself, gated by Cargo features. It doesn't, for
+three reasons:
+
+- **Compile-time gating doesn't match the actual gating signal.**
+  Whether a method is meaningful depends on the **policy**, not on
+  the **build**. A `policy-all` build still has policies that can't
+  answer `frequency()`.
+- **Method-level defaults that return `None` are a footgun.** Code
+  that calls `cache.frequency(&k)` on an LRU cache would silently
+  return `None` and pass through review.
+- **Trait bounds carry information.** `fn warm<C: FrequencyTracking>()`
+  documents at the type-system level that the function only makes
+  sense for frequency-tracking caches.
+
+The trade is one extra `use` statement at call sites — `use
+cachekit::traits::{Cache, RecencyTracking};` — which is a small price
+for the correctness gain.
+
+## Utility traits
+
+Three traits live alongside the hierarchy but are not extensions of
+`Cache<K, V>`.
+
+### `unsafe trait ConcurrentCache: Send + Sync`
+
+Marker trait, no methods. Implementing it asserts that the type
+handles internal synchronization safely. Covered in detail in
+[`concurrency.md`](concurrency.md#concurrentcache-marker-trait-not-capability-trait).
+
+### `CacheFactory<K, V>` and `CacheConfig`
+
+```rust
+pub trait CacheFactory<K, V> {
+    type Cache: Cache<K, V>;
+    fn new(capacity: usize) -> Self::Cache;
+    fn with_config(config: CacheConfig) -> Self::Cache;
+}
+```
+
+Constructor abstraction for generic code that needs to build caches
+without naming the concrete type. `CacheConfig` is a `#[non_exhaustive]`
+struct with builder-style `with_*` methods, mirroring the wider
+`CacheBuilder` shape in [`src/builder.rs`](../../src/builder.rs).
+
+In practice most code constructs caches directly (`LruCache::new(…)`)
+or through `CacheBuilder`. `CacheFactory` mostly exists for test
+harnesses and benchmark runners that want to parameterise across
+policies; the trait's `Cache` associated type makes that ergonomic.
+
+### `AsyncCacheFuture<K, V>: Send + Sync`
+
+Phase 2 placeholder. The methods (`supports_async_get`,
+`supports_async_insert`) default to `false` and no policy overrides
+them. The trait exists so that async-native policies can be added in
+the future without breaking the existing surface.
+
+## Read/mutate split rationale (recapitulated)
+
+Worth stating once more in one place: the methods on `Cache<K, V>`
+split cleanly into two groups:
+
+| `&self` (read-locked-safe) | `&mut self` (write-locked) |
+|----------------------------|----------------------------|
+| `contains`, `len`, `is_empty`, `capacity` | `get`, `insert`, `remove`, `clear` |
+| `peek`                     |                            |
+| (capability) `peek_victim`, `recency_rank`, `frequency`, `access_count`, `k_distance`, `access_history`, `k_value` | (capability) `evict_one`, `touch` |
+
+This is the contract the concurrent wrappers rely on. Adding a new
+`Cache` method that mutates state through `&self` (interior mutability)
+would break the lock-granularity story; adding one that takes `&mut
+self` but doesn't logically mutate would prevent the read-lock fast
+path in `Concurrent*` wrappers.
+
+## Object safety vs. ergonomic methods
+
+Some operations naturally belong on `Cache<K, V>` but would break
+object safety. They live as inherent methods on each policy instead:
+
+- `extend<I: IntoIterator<Item = (K, V)>>(&mut self, iter: I)`
+- `get_or_insert_with<F: FnOnce() -> V>(&mut self, key: K, f: F) -> &V`
+- `insert_many(&mut self, items: impl IntoIterator<Item = (K, V)>)`
+  with buffer reuse
+
+The rule: anything taking a generic closure, generic iterator, or
+returning `impl Trait` is an inherent method, not a trait method.
+The trait stays object-safe; the policy types stay ergonomic.
+
+## Adding a new capability trait
+
+Checklist for new capability traits:
+
+1. **The signal must exist in the implementing policy's metadata.**
+   No defaults that return `None`/`0`/`false` for "doesn't apply."
+2. **Bound on `Cache<K, V>`.** Capability traits compose with the
+   kernel; they don't replace it.
+3. **Object safety is optional for capability traits** but
+   recommended. Trait objects of capability traits show up rarely;
+   ergonomic generic methods are fine.
+4. **Name follows the noun-of-the-signal pattern.** `RecencyTracking`,
+   `FrequencyTracking`, `HistoryTracking`. New ones should follow
+   suit: `WeightTracking`, `CostTracking`, `AdmissionTracking`.
+5. **Re-export from `prelude`.** Capability traits live in the same
+   `use cachekit::prelude::*;` namespace as the kernel.
+6. **Document the implementing-policy set.** The rustdoc on
+   `EvictingCache` lists policies that opt out; new traits should
+   do the same for the smaller set that opts in.
+
+## Future capability traits
+
+Sketched in priority order:
+
+- **`ExpiringCache<K, V>: Cache<K, V>`** — TTL surface, per
+  [`docs/design/ttl.md`](ttl.md) §4(a). Signature:
+
+  ```rust
+  fn insert_with_ttl(&mut self, key: K, value: V, ttl: Duration) -> Option<V>;
+  fn ttl_status(&self, key: &K) -> TtlStatus;
+  fn set_ttl(&mut self, key: &K, ttl: Duration) -> bool;
+  fn purge_expired(&mut self) -> usize;
+  ```
+
+  Implemented by the `Expiring<C>` decorator over any `Cache<K, V>`.
+
+- **`WeightTracking<K, V>: Cache<K, V>`** — surface for weight-aware
+  caches built on [`WeightStore`](../../src/store/weight.rs). Likely
+  signature:
+
+  ```rust
+  fn weight(&self, key: &K) -> Option<usize>;
+  fn total_weight(&self) -> usize;
+  fn weight_capacity(&self) -> usize;
+  ```
+
+  Needed before GDS/GDSF (roadmap policies) can be expressed
+  generically.
+
+- **`AdmissionTracking<K, V>: Cache<K, V>`** — exposes ghost-list /
+  admission-history state for ARC, CAR, S3-FIFO, Clock-PRO,
+  TinyLFU. Specifically: was this key ever resident, and if so when
+  did it leave? Useful for adaptive workloads where the caller
+  wants to know whether a miss is a one-hit-wonder or a returning
+  member of the working set.
+
+The trait is intentionally not added until a second policy implements
+it. The `RecencyTracking` / `FrequencyTracking` / `HistoryTracking`
+naming established the convention; adding `WeightTracking` only when
+GDS lands keeps the surface honest.
+
+## See also
+
+- [Design overview](design.md) — §7 frames the layering at the
+  principles level, §13 covers `DynCache` runtime dispatch
+- [Concurrency](concurrency.md) — read/mutate split + `ConcurrentCache`
+- [TTL design](ttl.md) — applied example: `ExpiringCache` as a new
+  capability trait
+- [Read-only traits](../guides/read-only-traits.md) — user-facing
+  guidance on the `peek` / `get` split
+- [`src/traits.rs`](../../src/traits.rs) — the canonical definitions
+- [`src/store/traits.rs`](../../src/store/traits.rs) — parallel
+  trait family at the store layer (sequential + concurrent)
diff --git a/docs/design/weighted-eviction.md b/docs/design/weighted-eviction.md
new file mode 100644
index 0000000..7ce0910
--- /dev/null
+++ b/docs/design/weighted-eviction.md
@@ -0,0 +1,388 @@
+# Weighted Eviction
+
+> Status: design rationale for [`WeightStore`](../../src/store/weight.rs)
+> and [`ConcurrentWeightStore`](../../src/store/weight.rs). Companion to
+> [`design.md`](design.md), [`concurrency.md`](concurrency.md), and the
+> [`stores`](../stores/README.md) reference.
+
+Entry-count caps are the wrong tool when entries vary in size. A cache
+sized "max 1 000 entries" that holds a mix of 100-byte thumbnails and
+10 MB blobs will either overshoot its memory budget by orders of
+magnitude (when blobs dominate) or waste capacity (when thumbnails do).
+`WeightStore` exists to give callers a second, byte-denominated budget
+alongside the entry count.
+
+This document explains the dual-limit model, the contract on the
+user-supplied weight function, where weight integrates with eviction
+policies today (it does not), and how it pre-stages GDS/GDSF on the
+roadmap.
+
+## The problem
+
+A typical entry-count cache:
+
+- Fails to bound memory when value sizes differ by orders of magnitude.
+- Cannot answer "how many bytes am I caching?" without iterating.
+- Treats a 1 KB and a 1 MB entry as equal eviction candidates, which
+  is wrong when memory pressure is the binding constraint.
+
+The complementary failure mode — a pure byte-budgeted cache — has its
+own problems:
+
+- Highly variable entry counts make per-entry metadata budgeting hard.
+- A pathological "one giant entry fills the cache" case is the byte
+  version of the "millions of one-byte entries fills the cache"
+  problem in entry-count caches.
+- Some policies (LFU bucket arrays, S3-FIFO ratios) are sized by entry
+  count and need a stable upper bound.
+
+`WeightStore` therefore enforces **both** an entry-count cap and a
+weight cap — whichever is hit first triggers `StoreFull`. The user
+picks the units of "weight" via a closure.
+
+## Dual-limit model
+
+```text
+try_insert(key, value):
+  │
+  ├─► Existing key (update)
+  │     │
+  │     ├── new_weight    = weight_fn(&value)
+  │     ├── next_total    = total_weight - old_weight + new_weight
+  │     │
+  │     └── next_total > capacity_weight? ──► Err(StoreFull)
+  │                                       └──► Ok(Some(old_value))
+  │
+  └─► New key (insert)
+        │
+        ├── len() >= capacity_entries?         ──► Err(StoreFull)
+        ├── new_weight = weight_fn(&value)
+        ├── total_weight + new_weight > capacity_weight? ──► Err(StoreFull)
+        │
+        └── Ok(None)
+```
+
+Three properties worth naming:
+
+- **Pre-checked, not retroactive.** `try_insert` returns
+  `Err(StoreFull)` rather than silently evicting; the **store** is
+  full, so the caller (or the policy layered above it) decides what
+  to evict.
+- **Updates can fail too.** Replacing a 1 MB value with a 2 MB value
+  on a cache with 1.5 MB of remaining headroom returns `StoreFull` —
+  the update is rejected and the original entry stays resident. This
+  is the only sensible behaviour when an update can push the store
+  past its budget.
+- **Atomic weight bookkeeping.** `total_weight` is the live sum of
+  every resident entry's weight. Every successful `try_insert` /
+  `remove` / `clear` updates it; reads (`get`, `peek`) do not. The
+  invariant `total_weight == sum(entries.weight)` is debug-asserted.
+
+## The weight function: contract and hazards
+
+```rust,ignore
+F: Fn(&V) -> usize
+```
+
+The user supplies a closure. Three pieces of the contract matter:
+
+- **Cheap.** Ideally O(1). The function is called on every insert and
+  every update. A weight function that traverses the value to compute
+  bytes (`|tree: &BTreeMap<K, V>| tree.iter().map(…).sum()`) makes
+  insert latency proportional to value size.
+- **Deterministic.** The same value must yield the same weight every
+  time. A non-deterministic weight breaks `total_weight` accounting —
+  the store remembers `old_weight` from the *previous* insert, so a
+  changed weight on update leaks `(new_actual - old_recorded)` bytes
+  of budget per update.
+- **Non-panicking.** The function is invoked while a write lock is
+  held in [`ConcurrentWeightStore`](../../src/store/weight.rs). A
+  panicking weight function under `panic = "unwind"` poisons-by-
+  unwind the inner state (the lock itself is `parking_lot`'s
+  non-poisoning variant; what is "poisoned" is the call site,
+  which never completes the insert). Under the crate's default
+  `panic = "abort"` release profile this terminates the process.
+
+Common shapes:
+
+```rust,ignore
+|v: &Vec<u8>|     v.len()
+|s: &String|      s.len()
+|img: &Image|     img.width * img.height * 4
+|_: &T|           1                 // entry-count only
+|v: &Cow<[u8]>|   v.len()           // works for borrowed/owned
+```
+
+The "weight = 1" specialization deserves a note: it makes
+`WeightStore` behave exactly like a count-only store, at the cost of
+an `Arc<V>` round-trip and per-entry weight slot. Use
+`HashMapStore` for that case unless you specifically want the
+`ConcurrentWeightStore` API.
+
+## Precomputation: weight stored per entry
+
+Each entry holds its weight in a small wrapper:
+
+```rust,ignore
+struct WeightEntry<V> {
+    value: Arc<V>,
+    weight: usize,
+}
+```
+
+Weight is computed **once** at insert/update time and stored alongside
+the value. Three consequences:
+
+- Reads (`get`, `peek`, `contains`, `len`, `total_weight`) never
+  invoke the weight function. They cannot — they only have a
+  reference to the stored entry, and the stored entry already knows
+  its weight.
+- `remove` updates `total_weight` by subtracting the stored weight,
+  with no recompute.
+- Memory overhead per entry is `sizeof(usize)` + `sizeof(Arc<V>)` —
+  one extra word plus the Arc header. Acceptable for variable-size
+  caches where the value itself dominates the per-entry footprint.
+
+The alternative — recomputing weight on every read for the sake of
+"freshness" — would only matter if the weight function were
+non-deterministic, which the contract forbids.
+
+## `Arc<V>` everywhere
+
+`WeightStore` stores `Arc<V>` even in the single-threaded variant:
+
+```rust,ignore
+pub fn try_insert(&mut self, key: K, value: Arc<V>) -> Result<Option<Arc<V>>, StoreFull>
+pub fn get(&mut self, key: &K) -> Option<Arc<V>>
+pub fn peek(&self, key: &K) -> Option<Arc<V>>
+```
+
+This is a deliberate divergence from `StoreCore` / `StoreMut` (which
+return `V` directly). Three reasons:
+
+- **Cheap shared ownership.** Large `V`s (images, blobs) are the
+  target use case. Returning `Arc<V>` lets callers hold or share the
+  value without forcing `V: Clone`.
+- **Surface alignment with `ConcurrentWeightStore`.** The concurrent
+  variant must return `Arc<V>` (the `&V`-across-lock problem from
+  [`concurrency.md`](concurrency.md)). Keeping the single-threaded
+  variant on the same shape lets callers swap between them by
+  changing one type without re-plumbing returns.
+- **`V: !Clone` is supported.** Callers who don't want to require
+  `Clone` on their value type get the `Arc<V>` round-trip "for free."
+
+The cost is that `WeightStore` does **not** implement `StoreCore` /
+`StoreMut`. It is a sibling, not a subtype, of the entry-count stores
+([`HashMapStore`](../../src/store/hashmap.rs),
+[`SlabStore`](../../src/store/slab.rs)), and code generic over those
+traits cannot accept a `WeightStore` without adaptation. This is the
+single sharpest API edge in the store layer, called out explicitly in
+the module documentation.
+
+## Why weight is at the **store** layer, not the policy layer
+
+The 17 implemented policies in `src/policy/` are all weight-unaware.
+They count entries and evict by entry. `WeightStore` is below them in
+the layering:
+
+```text
+   ┌─────────────────────────────┐
+   │   policy (weight-unaware)   │   evicts by recency/frequency/etc
+   └──────────────┬──────────────┘
+                  │ Cache<K, V> uses store underneath
+   ┌──────────────▼──────────────┐
+   │  WeightStore (dual limits)  │   refuses inserts past weight cap
+   └─────────────────────────────┘
+```
+
+This separation has two consequences worth understanding:
+
+- **The policy decides who to evict; the store decides whether the
+  result fits.** A policy operating over a `WeightStore` evicts its
+  policy-chosen victim, then attempts the insert. If the insert
+  still doesn't fit (one large value cannot be made room for by
+  evicting a single small victim), the policy must evict again or
+  surface `StoreFull` to the caller.
+- **No policy in the tree today consumes `WeightStore` directly.**
+  `WeightStore` is reachable only through its own concrete API, not
+  through the `Cache<K, V>` trait or `DynCache`. Users who want a
+  weight-aware cache today build one themselves on top of
+  `WeightStore` plus a chosen eviction strategy.
+
+The reason for this layering is forward compatibility. Weight-aware
+**policies** (GDS, GDSF, LFU-DA, see roadmap) need this store as
+their substrate. Coupling weight directly into a policy locks the
+weight model to that policy; keeping it at the store layer keeps the
+substrate reusable.
+
+## Concurrent variant
+
+`ConcurrentWeightStore<K, V, F>` follows the wrapper pattern from
+[`concurrency.md`](concurrency.md):
+
+```rust,ignore
+pub struct ConcurrentWeightStore<K, V, F> {
+    inner: Arc<RwLock<WeightStore<K, V, F>>>,
+}
+```
+
+`parking_lot::RwLock`; `peek` / `contains` / `len` / `total_weight`
+take the read lock; `try_insert` / `remove` / `clear` take the write
+lock; metrics counters live in `AtomicU64` so the read-locked paths
+can still increment them without escalating.
+
+The weight function runs **inside the write lock** on every insert
+and update. A slow `F` therefore stalls every reader and writer in
+the cache — a DoS amplification vector when caching user-supplied
+values. The mitigation is the cheapness contract; the rustdoc on
+`ConcurrentWeightStore::try_insert` says so.
+
+`ConcurrentWeightStore` implements `ConcurrentStoreRead<K, V>` and
+`ConcurrentStore<K, V>`. Unlike the single-threaded variant — which
+deliberately does not implement `StoreCore`/`StoreMut` — the
+concurrent variant *does* fit the concurrent trait family because
+both already use `Arc<V>` returns. The asymmetry is awkward but
+honest: the trait family is shaped around the constraints the
+concurrent path imposes, and the single-threaded variant happens to
+borrow that shape rather than the sequential one.
+
+## Lock-poisoning and total-weight integrity
+
+Under `panic = "abort"` (the crate's release default) lock poisoning
+is moot — the process exits. Under `panic = "unwind"`, the order of
+operations in `clear()` matters:
+
+```rust,ignore
+fn clear(&mut self) {
+    self.total_weight = 0;          // (1) reset first
+    self.entries.clear();           // (2) then drop entries (may panic)
+}
+```
+
+If (2) panics during entry drop, `total_weight = 0` and `len() == 0`
+remain consistent post-panic. Individual values may leak through the
+unwinding drop but the store's accounting cannot be corrupted into
+"says it has 1 GB resident when actually empty" — which would
+silently reject all future inserts. The module documentation calls
+this out so callers who override `panic = "abort"` know what they
+get.
+
+## Failure mode: weight cap, not entry cap
+
+When the weight budget is hit but the entry count is not:
+
+- `try_insert` returns `StoreFull` for any new key whose value would
+  push `total_weight` past `capacity_weight`.
+- `len() < capacity_entries` — the entry budget has headroom that
+  cannot be used.
+- `total_weight == capacity_weight` (approximately, depending on
+  insert sizes).
+
+The reverse — entry cap hit, weight cap not — produces `StoreFull`
+on any new insert regardless of weight, including tiny values.
+
+Both are correct. The store does not silently demote either limit;
+the caller's intent is "neither budget shall be exceeded," and the
+store enforces it literally.
+
+## Capacity tuning
+
+The dual limits give callers two knobs:
+
+| Setting | Effect |
+|---|---|
+| `capacity_entries` finite, `capacity_weight = usize::MAX` | Behaves like an entry-count store; weight is observable but unconstrained |
+| `capacity_entries = usize::MAX`, `capacity_weight` finite | Behaves like a pure byte-budget store; entry count is observable but unconstrained |
+| Both finite | Hard dual limit |
+
+The first row is rarely what callers want (use `HashMapStore`
+instead — no per-entry weight slot). The second is a legitimate
+configuration for callers who genuinely want bytes-only accounting
+and accept the per-entry overhead. The third is the design intent.
+
+## Security considerations
+
+The module rustdoc is unusually long on security; the points worth
+naming at the design-doc level:
+
+- **Hasher.** `WeightStore`'s key index uses `FxHashMap`, which is
+  **not** HashDoS-resistant. Callers caching variable-size values
+  keyed by request paths, tenant IDs, or filenames — i.e. exactly
+  the use case `WeightStore` targets — should pre-hash keys with a
+  keyed hash (`siphasher` with a per-process key) or migrate to
+  `HashMapStore`'s `RandomState`-backed default.
+- **Side channel.** `total_weight` is publicly readable. Callers
+  with access to the counter can infer the size of other tenants'
+  cached entries from before/after differentials. Avoid exposing
+  `total_weight` across trust boundaries when caching tenant-keyed
+  variable-size records.
+- **Sensitive values.** Dropped `V`s are not zeroized. Wrap `V` in
+  `zeroize::Zeroizing` (or equivalent) when caching credentials.
+- **Counters.** Metrics use `Relaxed` ordering and wrap on overflow
+  in release. Best-effort observability, not audit-grade.
+
+## Pre-staging GDS/GDSF
+
+GreedyDual-Size (GDS) and its frequency-aware variant GDSF evict by
+**cost ÷ size** rather than recency or frequency alone. Both
+require:
+
+- A per-entry size (`WeightStore` already stores it).
+- A per-entry cost (caller-supplied at insert time).
+- An eviction priority queue ordered by `cost / size + age`.
+
+`WeightStore` provides the size half today. The cost half and the
+priority-queue substrate ([`LazyMinHeap`](../../src/ds/lazy_heap.rs)
+is a natural fit) are the missing pieces. When GDS lands, the
+expected shape is:
+
+```rust,ignore
+pub struct GdsCache<K, V, F> {
+    store: WeightStore<K, V, F>,
+    queue: LazyMinHeap<K, GdsPriority>,
+    aging: AgingCounter,
+}
+```
+
+The trait surface would be `Cache<K, V>` plus a future
+`WeightTracking<K, V>` capability trait (sketched in
+[`trait-hierarchy.md`](trait-hierarchy.md#future-capability-traits)),
+giving generic code the ability to consult `weight(key)` and
+`total_weight()` regardless of which policy is doing the evicting.
+
+The non-trivial design question, when GDS lands, is whether the
+priority queue stores cost / size at insert time (cheap, can become
+stale if the value's "true" cost diverges from insert-time cost) or
+recomputes on demand (more expensive, but always current). The
+current expectation is "store at insert time, document the
+staleness window" — matching the precomputed-weight discipline this
+store already follows.
+
+## When not to use `WeightStore`
+
+- **Uniform value sizes.** Use `HashMapStore` or `SlabStore`. The
+  weight slot is overhead with no benefit.
+- **Hot-path latency dominates.** The weight function runs on every
+  insert. If `F` is non-trivial, insert latency is `F`-dominated.
+- **You need a policy.** `WeightStore` is a store; policies sit
+  above it. A bare `WeightStore` evicts nothing on its own — it
+  surfaces `StoreFull` and the caller decides what to remove. Use
+  this directly only when the caller knows the eviction strategy
+  better than any built-in policy would.
+
+## See also
+
+- [Design overview](design.md) — §2 (memory layout) and §5
+  (eviction) frame the trade-offs at the principles level
+- [Concurrency](concurrency.md) — `ConcurrentWeightStore` follows
+  the standard wrapper pattern documented there
+- [Cache trait hierarchy](trait-hierarchy.md) — future
+  `WeightTracking` capability trait sketched in
+  "Future capability traits"
+- [Stores](../stores/README.md) and [`weight.md`](../stores/weight.md)
+  — reference docs for the runtime behaviour
+- [Error model](error-model.md) — `StoreFull` semantics
+- [`src/store/weight.rs`](../../src/store/weight.rs) — the canonical
+  implementation
+- [Roadmap: GDS](../policies/roadmap/gds.md) and
+  [GDSF](../policies/roadmap/gdsf.md) — the planned consumers
diff --git a/docs/index.md b/docs/index.md
index 4a893c7..87d50a8 100644
--- a/docs/index.md
+++ b/docs/index.md
@@ -12,6 +12,13 @@ Key features:
 - [Quickstart](getting-started/quickstart.md) — Install and build your first cache
 - [Integration guide](getting-started/integration.md) — CacheBuilder API, policy selection, thread safety
 - [Design overview](design/design.md) — Architectural decisions and performance principles
+- [Cache trait hierarchy](design/trait-hierarchy.md) — Kernel trait, capability traits, read/mutate split
+- [Concurrency](design/concurrency.md) — `Concurrent*` wrappers, lock discipline, sharded primitives
+- [Builder and runtime dispatch](design/builder-and-dyn-dispatch.md) — `CachePolicy`, `DynCache`, enum dispatch
+- [Weighted eviction](design/weighted-eviction.md) — `WeightStore`, dual limits, GDS/GDSF pre-staging
+- [Metrics](design/metrics.md) — Recorder / snapshot / exporter split, Prometheus integration
+- [Error model](design/error-model.md) — Panic vs `Result` discipline, four error types
+- [TTL design](design/ttl.md) — Worked example of every principle in one feature
 - [API surface](guides/api-surface.md) — Module map and entrypoints
 
 ## Policies

From 808aef578aac464e34577efe3f6bbac51d244629 Mon Sep 17 00:00:00 2001
From: Thomas Korrison <thomas_korrison@hotmail.com>
Date: Wed, 13 May 2026 21:34:42 +0100
Subject: [PATCH 2/3] docs: expand design documentation with new sections on
 benchmarking, hashing, sharding, serialization, and non-goals

- Added a comprehensive section on benchmarking design, detailing the benchmark layers, goals, and artifact schema to enhance performance evaluation.
- Introduced documentation on hashing and key identity, explaining hasher choices, key interning, and shard routing strategies.
- Documented sharding design, outlining current sharded primitives, routing requirements, and capacity semantics for improved concurrency.
- Included a section on serialization, clarifying the current serialization surface and future considerations for cache-state persistence.
- Added a non-goals document to define explicit boundaries for cachekit's design, ensuring clarity on what the library does not aim to achieve.

These additions significantly enhance the documentation, providing clearer guidance on design principles and usage patterns for developers.
---
 docs/design/benchmarking.md  | 222 +++++++++++++++++++++++++++++++++++
 docs/design/design.md        |  10 ++
 docs/design/hashing.md       | 166 ++++++++++++++++++++++++++
 docs/design/non-goals.md     | 166 ++++++++++++++++++++++++++
 docs/design/serialization.md | 195 ++++++++++++++++++++++++++++++
 docs/design/sharding.md      | 157 +++++++++++++++++++++++++
 docs/index.md                |   5 +
 7 files changed, 921 insertions(+)
 create mode 100644 docs/design/benchmarking.md
 create mode 100644 docs/design/hashing.md
 create mode 100644 docs/design/non-goals.md
 create mode 100644 docs/design/serialization.md
 create mode 100644 docs/design/sharding.md

diff --git a/docs/design/benchmarking.md b/docs/design/benchmarking.md
new file mode 100644
index 0000000..9ca2aba
--- /dev/null
+++ b/docs/design/benchmarking.md
@@ -0,0 +1,222 @@
+# Benchmarking
+
+> Status: design rationale for the benchmark suite under [`benches/`](../../benches)
+> and shared benchmark support under [`bench-support/`](../../bench-support).
+> Companion to [`design.md`](design.md) §10 and the benchmark reference docs.
+
+cachekit benchmarks are designed to answer cache questions, not just produce
+fast-looking numbers. A cache policy can be excellent on uniform keys and weak
+under scans, or fast on micro-operations and poor at preserving hit rate. The
+benchmark suite therefore separates micro-operation cost, policy effectiveness,
+trace-shaped workloads, reporting, and machine-readable artifacts.
+
+## Goals
+
+- Compare policies under workload shapes that resemble real cache traffic.
+- Keep measured loops free of allocator noise and dynamic dispatch.
+- Produce both human-readable reports and stable JSON artifacts.
+- Preserve enough metadata to reproduce a run: git commit, branch, dirty bit,
+  rustc version, host triple, CPU model, capacity, universe, operations, seed.
+- Make adding a policy or workload a registry edit, not a benchmark rewrite.
+
+## Benchmark Layers
+
+The benchmark suite has four layers:
+
+| Layer | Files | Purpose |
+|---|---|---|
+| Criterion measurements | `benches/workloads.rs`, `benches/ops.rs`, `benches/comparison.rs`, `benches/policy/*.rs` | statistically sampled latency and throughput |
+| Console reports | `benches/reports.rs` | fast, readable tables without Criterion overhead |
+| JSON artifact runner | `benches/runner.rs` | structured output for docs, charts, CI, historical comparison |
+| Shared support crate | `bench-support/` | policy registry, workloads, metrics, JSON schema, doc renderer |
+
+This split is deliberate. Criterion is good for micro-benchmark statistics; the
+artifact runner is good for automation; console reports are good while tuning a
+policy locally. No single binary is forced to serve every audience.
+
+## Monomorphic Policy Registry
+
+Benchmarks iterate policies through `for_each_policy!` in
+[`bench-support/src/registry.rs`](../../bench-support/src/registry.rs):
+
+```rust,ignore
+for_each_policy! {
+    with |policy_id, display_name, make_cache| {
+        let mut cache = make_cache(CAPACITY);
+        // measured workload...
+    }
+}
+```
+
+The macro expands to one block per concrete policy type. This avoids dynamic
+dispatch in the measured loop while keeping policy iteration centralized.
+`POLICIES` in the same module provides presentation metadata (stable id,
+display name, chart color) for renderers and reports.
+
+The trade-off is that adding a policy touches the macro and metadata table. A
+test (`policies_metadata_matches_macro`) keeps the two from drifting. This is
+the same explicit-boilerplate-over-magic choice as `DynCache`: more arms in
+source, fewer surprises in hot code.
+
+## Workload Registry
+
+Workload definitions live in `bench-support/src/registry.rs`; generators live in
+[`bench-support/src/workload.rs`](../../bench-support/src/workload.rs). The
+current standard workloads cover:
+
+- Uniform random keys for raw overhead baselines.
+- Hot-set access for explicit skew.
+- Sequential scan for scan-pollution stress.
+- Zipfian and scrambled Zipfian for power-law access.
+- Latest / recency-biased access.
+- Shifting hotspots and flash crowds for adaptation.
+- Composite scan-resistance mixes.
+
+[`docs/benchmarks/workloads.md`](../benchmarks/workloads.md) is the catalog. It
+also contains a large roadmap of workloads that should not be confused with
+implemented cases. New workloads should land first in the support crate, then in
+the docs, then in reports.
+
+## Value Construction Discipline
+
+`benches/runner.rs` pre-allocates one `Arc<u64>` per key in the universe and
+passes a closure that returns `Arc::clone`:
+
+```rust,ignore
+fn preallocate_values() -> Vec<Arc<u64>> {
+    (0..UNIVERSE).map(Arc::new).collect()
+}
+```
+
+The rule is: **do not allocate values inside the measured operation loop**.
+Allocating on every miss makes the benchmark measure the allocator and value
+constructor, not the policy. A cheap `Arc::clone` isolates hit/miss behaviour,
+eviction order, and policy metadata overhead.
+
+This is especially important because policies store values differently:
+`FastLru` stores `V` directly, while LRU / LFU / Heap-LFU use `Arc<V>` in some
+paths. Pre-allocation keeps those representation differences from dominating
+the benchmark.
+
+## Artifact Schema
+
+`bench-support/src/json_results.rs` defines the stable JSON schema for results:
+
+- `SCHEMA_VERSION` follows semantic schema rules.
+- Major bumps remove or rename required fields.
+- Minor bumps add optional fields.
+- Renderers accept any artifact with a matching major.
+
+Each `BenchmarkArtifact` contains:
+
+- `metadata`: timestamp, git commit, branch, dirty bit, rustc, host, CPU,
+  benchmark config.
+- `results`: rows keyed by policy, workload, and `case_id`.
+- `metrics`: optional typed sections for hit rate, throughput, latency,
+  eviction, scan resistance, adaptation speed.
+
+The schema is presentation-neutral. Markdown tables and charts are rendered
+later by `bench-support/src/bin/render_docs.rs`, so measurement and presentation
+can evolve independently.
+
+## Case IDs
+
+Use `case_id::*` constants from `json_results.rs` instead of string literals:
+
+- `hit_rate`
+- `comprehensive`
+- `scan_resistance`
+- `adaptation`
+
+This catches typos at compile time and prevents a result section from silently
+disappearing from rendered docs. Adding a new case means adding a constant,
+teaching the runner to populate it, and teaching the renderer how to display it.
+
+## What Each Benchmark Answers
+
+| Benchmark | Question |
+|---|---|
+| `ops.rs` | What is the raw cost of `get` / `insert` / policy-specific operations? |
+| `workloads.rs` | Which policies preserve hit rate under standard workloads? |
+| `comparison.rs` | How does cachekit compare with external crates (`lru`, `quick_cache`)? |
+| `policy/*.rs` | What is the cost of each policy's unique operations? |
+| `reports.rs` | What should a human inspect while tuning? |
+| `runner.rs` | What should CI and docs consume? |
+
+Do not overload one benchmark to answer all questions. If you need policy
+micro-cost, use `ops.rs`; if you need hit rate under scans, use `workloads.rs`
+or `runner.rs`.
+
+## Reproducibility Rules
+
+- Seed every workload. Default seed is 42 unless a benchmark is explicitly
+  sweeping seeds.
+- Record the git dirty bit. Dirty runs are useful locally but should not be
+  published as release baselines without a note.
+- Keep capacity, universe, and operation count visible in the artifact.
+- Prefer `ScrambledZipfian` over raw `Zipfian` for cross-policy comparison when
+  hardware prefetch could bias hot-key locality.
+- Do not compare results across machines without CPU metadata. Tail latency and
+  pointer-heavy policy cost are machine-sensitive.
+
+## CI and Documentation Flow
+
+The docs pipeline runs the benchmark suite, writes
+`target/benchmarks/<run-id>/results.json`, and renders
+`docs/benchmarks/latest/` plus charts. Release-tag snapshots live under
+`docs/benchmarks/vX.Y.Z/`.
+
+Manual workflow:
+
+```bash
+cargo bench --bench runner
+./scripts/update_benchmark_docs.sh
+```
+
+The script is the high-level path for refreshing published benchmark docs. Use
+individual benches (`cargo bench --bench ops`, `cargo bench --bench reports -- scan`)
+while developing a policy.
+
+## Adding a Policy to Benchmarks
+
+1. Add the policy to `for_each_policy!` with a concrete constructor.
+2. Add matching `PolicyMeta` in `POLICIES`.
+3. Run the registry drift test.
+4. Run `cargo bench --bench reports -- hit_rate` for a quick sanity check.
+5. Run `cargo bench --bench runner` before publishing docs.
+
+Keep constructors comparable. If one policy needs `Arc<u64>` and another stores
+`u64`, choose the value shape that preserves fairness and document the exception
+in the registry comment.
+
+## Adding a Workload
+
+1. Implement the generator in `bench-support/src/workload.rs`.
+2. Add a `WorkloadCase` in the registry with stable id and display name.
+3. Add docs in [`docs/benchmarks/workloads.md`](../benchmarks/workloads.md).
+4. Add renderer support if the workload needs a custom section.
+5. Run at least one policy family expected to behave differently (for example,
+   LRU vs S3-FIFO for scan-heavy workloads).
+
+Do not add a workload just because it is mathematically interesting. It should
+answer a policy-selection question.
+
+## Non-goals
+
+- Benchmarks are not formal proofs of policy optimality.
+- Benchmarks are not stable ABI. The JSON schema is versioned, but Criterion
+  names and report formatting can change.
+- Benchmarks do not hide hardware effects. They record enough metadata for the
+  reader to judge them.
+- Benchmarks do not replace fuzzing or invariant tests; they measure behaviour
+  under selected workloads.
+
+## See Also
+
+- [Design overview](design.md) - §10 frames benchmarking at the principles level
+- [Metrics](metrics.md) - recorder / snapshot / exporter split
+- [Benchmark docs](../benchmarks/README.md)
+- [Workload catalog](../benchmarks/workloads.md)
+- [`bench-support/src/registry.rs`](../../bench-support/src/registry.rs)
+- [`bench-support/src/json_results.rs`](../../bench-support/src/json_results.rs)
+- [`benches/runner.rs`](../../benches/runner.rs)
diff --git a/docs/design/design.md b/docs/design/design.md
index 7f388bc..dcc01e7 100644
--- a/docs/design/design.md
+++ b/docs/design/design.md
@@ -339,6 +339,16 @@ Design docs:
   `MetricsCell`, Prometheus exporter, feature gating
 - [Error model](error-model.md) — panic vs `Result` discipline,
   four error types, debug-only invariant checks
+- [Benchmarking](benchmarking.md) — benchmark layers, monomorphic policy
+  registry, JSON artifact schema, reproducibility rules
+- [Hashing and key identity](hashing.md) — hasher choices, `KeyInterner`,
+  `ShardSelector`, HashDoS trade-offs
+- [Sharding](sharding.md) — current sharded primitives, routing,
+  capacity semantics, roadmap for sharded caches
+- [Serialization](serialization.md) — current `serde` surface, cache-state
+  persistence boundaries, TTL and hash-seed rules
+- [Non-goals](non-goals.md) — explicit boundaries for what cachekit does
+  not try to be
 - [TTL](ttl.md) — applied example of every principle above
 - [Doc style guide](style-guide.md)
 
diff --git a/docs/design/hashing.md b/docs/design/hashing.md
new file mode 100644
index 0000000..5af6b4b
--- /dev/null
+++ b/docs/design/hashing.md
@@ -0,0 +1,166 @@
+# Hashing and Key Identity
+
+> Status: design rationale for hasher choices, key interning, and hash-based
+> routing. Companion to [`concurrency.md`](concurrency.md), [`sharding.md`](sharding.md),
+> and the security notes in store/data-structure modules.
+
+cachekit uses hashing in three different roles:
+
+- Lookup indexes (`HashMapStore`, policy maps, ghost indexes).
+- Compact key identity (`KeyInterner`).
+- Shard routing (`ShardSelector`).
+
+Those roles have different threat models. Some code paths choose `FxHash` for
+speed on trusted keys; others default to `RandomState` or keyed SipHash because
+untrusted keys can create HashDoS or single-shard contention. This document
+explains those choices and when callers should override them.
+
+## The Decision Matrix
+
+| Component | Default hasher | Why | Caller override? |
+|---|---|---|---|
+| `HashMapStore` | `RandomState` | public store API, safer default | yes, `with_hasher` |
+| `ClockRing` | `RandomState` | can be keyed by user input | yes, with explicit trust acknowledgement |
+| `KeyInterner` | `FxBuildHasher` | hot internal mapping, trusted-key bias | yes, `with_hasher` |
+| `WeightStore` | `FxHashMap` | speed, large-value target | no generic hasher today |
+| Policy internals | mostly `FxHashMap` | hot metadata paths | generally no |
+| `ShardSelector` | keyed SipHash-1-3 | routing must resist shard pinning | seed or randomized constructor |
+
+The rule: **default to DoS-resistant hashing at public boundaries; use faster
+hashing inside policy metadata when keys are trusted or already admitted.**
+
+## `RandomState`: Safe Public Default
+
+`HashMapStore` and `ClockRing` default to
+`std::collections::hash_map::RandomState`. This is the right public default
+because callers often pass keys derived from request paths, tenant ids, URLs,
+or filenames. Randomized hashing prevents an attacker from precomputing many
+keys that collide in one bucket.
+
+The cost is per-hash overhead. For workloads with fully trusted keys (for
+example, dense integer ids generated by the process), callers can use
+`with_hasher` to opt into a faster hasher. That opt-in is intentionally explicit:
+the call site documents the threat-model decision.
+
+`ClockRing` goes further by using a `KeysAreTrusted` acknowledgement for faster
+non-randomized hashers. The extra marker makes the security trade visible in
+review rather than hidden in a type alias.
+
+## `FxHash`: Hot Internal Default
+
+Many policy internals use `rustc_hash::FxHashMap`:
+
+- LRU-family maps from key to node pointer / slot id.
+- LFU/MFU frequency maps.
+- 2Q / SLRU / Clock-PRO resident and ghost indexes.
+- `WeightStore`'s index.
+- `KeyInterner`'s default index.
+
+`FxHash` is fast and deterministic. It is also non-cryptographic and not
+HashDoS-resistant. The intended use is trusted, already-admitted keys where the
+hash map is not directly exposed as an unbounded public endpoint.
+
+The sharp edge is `WeightStore`: its target use case (variable-size objects
+like images, documents, blobs) often has user-derived keys. Its module docs call
+this out directly: pre-hash keys with a keyed hash or use `HashMapStore` if the
+key source is adversarial.
+
+## `KeyInterner`: Identity Compression, Not Security
+
+`KeyInterner` maps external keys to compact `u64` handles:
+
+```text
+index: HashMap<K, u64, S>     keys: Vec<K>
+"user:123" -> 0               keys[0] = "user:123"
+```
+
+The design goals:
+
+- Avoid repeated key cloning in hot paths.
+- Use compact handles in policy metadata and frequency maps.
+- Resolve a handle back to a key in O(1).
+
+Handles are **not capability tokens**. They are sequential integers. A handle
+from one interner can silently resolve to a different key in another interner,
+and handles are reused after `clear`. Callers that store handles externally
+must pair them with `generation()` and reject stale generations.
+
+Security implications:
+
+- The default `FxBuildHasher` is for trusted input.
+- Use `with_hasher` / `with_capacity_and_hasher` with `RandomState` when keys
+  are derived from untrusted input.
+- `KeyInterner` is append-only until `clear`, so unique-key attacks can drive
+  memory growth. Use `try_intern` and your own admission bound for untrusted
+  keys.
+- `Debug` intentionally omits interned keys to avoid leaking URLs, user ids, or
+  auth material into logs.
+
+## `ShardSelector`: Hashing for Routing
+
+Shard routing has a different failure mode than lookup maps. A lookup hash
+collision slows one map; a routing collision pins the whole workload to one
+shard and defeats concurrency.
+
+`ShardSelector` therefore uses keyed SipHash-1-3:
+
+- `ShardSelector::randomized(shards)` draws key material from `RandomState`.
+  Use this for normal production sharding.
+- `ShardSelector::new(shards, seed)` is deterministic and reproducible. Treat
+  `seed` as secret key material if adversaries can influence keys.
+
+The selector reduces hash output to `[0, shards)` using fast range reduction
+rather than `%`, keeping distribution unbiased and cheap. The shard count is
+clamped to `[1, MAX_SHARDS]` to prevent user-controlled configs from allocating
+an unbounded number of locks or vectors.
+
+## Custom Hasher Rules
+
+When adding a hasher parameter to a public type:
+
+1. Default to `RandomState` unless the type is clearly internal-only.
+2. Expose `with_hasher` and `try_with_hasher` if callers have legitimate
+   trusted-key fast paths.
+3. Document the threat model at the constructor, not only at module level.
+4. Never hide a non-randomized hasher behind a harmless-sounding `new`.
+5. If the hasher affects shard routing, prefer `ShardSelector` over ad hoc
+   hashing so the keyed-routing contract stays centralized.
+
+When using `FxHashMap` internally:
+
+1. Keep it behind the policy or data-structure boundary.
+2. Do not expose arbitrary insertions from untrusted users without a separate
+   capacity/admission guard.
+3. Mention the assumption in the module's security notes if keys may be user
+   controlled.
+
+## Serialization and Hash Seeds
+
+Do not serialize hash seeds or hasher state unless the type is explicitly a
+deterministic routing artifact. `ShardSelector::new(shards, seed)` is the one
+place where reproducible routing is part of the public contract. `RandomState`
+and policy-internal hash maps should be reconstructed on deserialization.
+
+Serializing raw hash-map order is also wrong. Hash-map iteration order changes
+with seeds and implementation details; serialized cache state should use stable
+semantic fields (keys, values, policy order) rather than map buckets.
+
+## Future Direction: Hasher Audit
+
+The codebase intentionally mixes `RandomState`, `FxHashMap`, and SipHash. That
+mix is valid only while every use site has a documented threat model. A useful
+future hardening pass:
+
+- List every public constructor that accepts a key type.
+- Classify whether keys are trusted, user-supplied, or mixed.
+- Ensure user-supplied defaults are randomized.
+- Add `KeysAreTrusted`-style acknowledgement to any public non-randomized path.
+
+## See Also
+
+- [Sharding](sharding.md) - shard routing and contention trade-offs
+- [Weighted eviction](weighted-eviction.md) - `WeightStore` HashDoS caveat
+- [`src/ds/interner.rs`](../../src/ds/interner.rs)
+- [`src/ds/shard.rs`](../../src/ds/shard.rs)
+- [`src/store/hashmap.rs`](../../src/store/hashmap.rs)
+- [`src/ds/clock_ring.rs`](../../src/ds/clock_ring.rs)
diff --git a/docs/design/non-goals.md b/docs/design/non-goals.md
new file mode 100644
index 0000000..369e28c
--- /dev/null
+++ b/docs/design/non-goals.md
@@ -0,0 +1,166 @@
+# Non-Goals
+
+> Status: explicit boundaries for cachekit's design. Companion to
+> [`design.md`](design.md), which states what the crate optimizes for.
+
+Good design needs negative space. This document records what cachekit is **not**
+trying to be, so future features can be judged against the same boundaries.
+
+## Not a Distributed Cache
+
+cachekit is an in-process cache library. It does not provide:
+
+- network protocols;
+- replication;
+- cluster membership;
+- consistent hashing across processes;
+- cross-node invalidation;
+- persistence guarantees.
+
+Use Redis, Memcached, or a database/cache service when those are the problem.
+cachekit can still be useful inside a node in front of those systems.
+
+## Not a Full Application Cache Framework
+
+cachekit does not manage:
+
+- request coalescing / singleflight;
+- background refresh;
+- cache stampede suppression;
+- application-specific invalidation rules;
+- loader functions or read-through APIs as the primary abstraction.
+
+The library provides cache primitives and policies. Application frameworks can
+compose them into higher-level behaviours.
+
+## Not Async-Native Today
+
+`AsyncCacheFuture` exists as a placeholder, but the shipped policies are
+synchronous. Async-native traits are not currently implemented.
+
+The reason is not that async is unimportant. It is that async cache APIs need
+owned values, cancellation semantics, loader lifetime rules, and executor
+integration. Adding `async fn get_or_insert_with` to the core trait would break
+object safety and pull async choices into every policy.
+
+Future async support should be a separate layer, not a mutation of
+`Cache<K, V>`.
+
+## Not `no_std`
+
+cachekit uses:
+
+- `std::collections`;
+- `std::sync::Arc`;
+- `std::time` in planned TTL work;
+- `parking_lot` for concurrent wrappers;
+- benchmark and metrics tooling built around `std`.
+
+`no_std` would require a different allocator story, different synchronization
+surface, and feature-gated alternatives for large parts of the crate. It is not
+a current target.
+
+## Not Lock-Free
+
+The concurrency design is explicit and lock-based:
+
+- `Concurrent*` wrappers use `parking_lot::RwLock`;
+- sharded structures use one lock per shard;
+- future lock-free reads are a research direction, not current design.
+
+Lock-free structures would need a separate memory reclamation strategy,
+different value ownership rules, and a much larger unsafe surface. The current
+crate favours predictable, reviewable lock boundaries.
+
+## Not a HashDoS Firewall
+
+Some public surfaces use DoS-resistant hashing by default (`HashMapStore`,
+`ClockRing`, `ShardSelector::randomized`). Other hot internal surfaces use
+`FxHashMap` for speed.
+
+cachekit documents those choices, but it is not a general-purpose security
+boundary. Callers with adversarial keys must choose safe constructors, bound
+admission, and avoid exposing interned handles or `total_weight` across trust
+boundaries.
+
+## Not a Serialization Format for Live Caches
+
+The `serde` feature supports metrics snapshots and `StoreMetrics`, not live
+cache state. Serializing a policy means deciding what to do with recency lists,
+ghost history, clock hands, hash seeds, `Arc<V>` identity, and TTL deadlines.
+
+Until a policy has an explicit restore contract, do not derive serde for it.
+
+## Not a General Metrics Platform
+
+The metrics layer provides counters, gauges, snapshots, reset, and a Prometheus
+text exporter. It does not provide:
+
+- high-cardinality labels;
+- histograms;
+- sampling;
+- streaming events;
+- tracing spans;
+- alerting.
+
+Use your monitoring stack for those. cachekit exposes enough counters to make
+policy tuning possible without making the cache own observability.
+
+## Not a Policy Research Playground at the Cost of Hot Paths
+
+New policies are welcome, but they must fit the crate's constraints:
+
+- no per-operation allocation in hot paths;
+- predictable eviction cost;
+- feature-gated implementation;
+- docs and benchmarks;
+- clear workload motivation.
+
+A clever algorithm that needs tree walks, heap allocation on every access, or
+opaque trait-object dispatch in the hot loop belongs in a research branch until
+benchmarks justify it.
+
+## Not a Replacement for Workload Analysis
+
+cachekit ships many policies, but it cannot choose your workload for you.
+`CachePolicy::Lru` or `CachePolicy::S3Fifo` are defaults, not guarantees. Users
+still need to measure reuse distance, scan rate, write ratio, object sizes, and
+tail latency under representative traffic.
+
+The benchmark suite provides workload generators to help, but it cannot infer
+production behaviour automatically.
+
+## Not a Stability Promise for Internal Layout
+
+Public traits and documented constructors follow SemVer. Internal layout does
+not:
+
+- slot ids;
+- intrusive-list node fields;
+- heap tombstone representation;
+- ghost-list internals;
+- metric recorder implementation details;
+- `DynCache`'s private `CacheInner` enum.
+
+If downstream code depends on private layout, it is outside the compatibility
+contract.
+
+## How To Use This Doc
+
+When proposing a feature, ask:
+
+1. Does it violate one of these non-goals?
+2. If yes, is it a new layer that keeps the core intact?
+3. Can it be feature-gated so users who do not need it pay nothing?
+4. Does it preserve hot-path constraints?
+5. Does it belong in cachekit, or in an application/framework crate above it?
+
+If the answer is unclear, write a design doc before implementation.
+
+## See Also
+
+- [Design overview](design.md)
+- [Concurrency](concurrency.md)
+- [Serialization](serialization.md)
+- [Metrics](metrics.md)
+- [Benchmarking](benchmarking.md)
diff --git a/docs/design/serialization.md b/docs/design/serialization.md
new file mode 100644
index 0000000..a6fc80f
--- /dev/null
+++ b/docs/design/serialization.md
@@ -0,0 +1,195 @@
+# Serialization
+
+> Status: design rationale for the current `serde` feature and the boundaries
+> around future cache-state persistence. Companion to [`metrics.md`](metrics.md),
+> [`ttl.md`](ttl.md), and [`builder-and-dyn-dispatch.md`](builder-and-dyn-dispatch.md).
+
+cachekit has a narrow serialization surface today. The `serde` feature derives
+`Serialize` / `Deserialize` for metrics snapshots and `StoreMetrics`; it does
+**not** serialize cache contents, policy metadata, hash-map state, locks, or
+builder dispatchers.
+
+That boundary is intentional. Metrics are stable observations. Cache state is
+live data with policy invariants, hash seeds, pointer-like handles, and optional
+time semantics.
+
+## Current Surface
+
+With `features = ["serde"]`, these public value types derive serde:
+
+- `StoreMetrics` in [`src/store/traits.rs`](../../src/store/traits.rs).
+- Every metrics snapshot in [`src/metrics/snapshot.rs`](../../src/metrics/snapshot.rs).
+
+Properties:
+
+- They are flat value types (`u64`, `usize`, optional nested stats).
+- They are `#[non_exhaustive]`, so new fields are SemVer-compatible at the Rust
+  API level but still require schema discipline for serialized consumers.
+- They carry observations, not live handles into cache internals.
+
+No policy type implements serde today. No store type serializes entries today.
+
+## Why Metrics Are Safe To Serialize
+
+Metrics snapshots are point-in-time copies:
+
+```rust,ignore
+#[derive(Debug, Default, Clone, Copy, PartialEq, Eq)]
+#[cfg_attr(feature = "serde", derive(serde::Serialize, serde::Deserialize))]
+pub struct LruMetricsSnapshot {
+    pub get_calls: u64,
+    pub get_hits: u64,
+    // ...
+}
+```
+
+Serializing a snapshot cannot corrupt a cache on restore because there is no
+restore into a running policy. At most, a downstream dashboard sees old or
+partial counters. That matches the metrics contract: best-effort observability.
+
+## Why Cache State Is Not Serialized
+
+Serializing a cache is not just "serialize a map." A policy may contain:
+
+- Intrusive list pointers or slot ids.
+- Ghost-list history.
+- Clock hand position and reference bits.
+- ARC/CAR adaptive target parameters.
+- Lazy heap tombstones.
+- Hash seeds and randomized map order.
+- `Arc<V>` sharing state.
+- TTL deadlines based on monotonic time.
+
+Restoring only keys and values discards policy warm state. Restoring every
+internal field exposes private representation and risks accepting corrupted
+state from disk.
+
+The default position: **do not serialize policy internals until there is a
+specific restore contract for that policy.**
+
+## Two Possible Future Modes
+
+If cache-state serialization lands later, it should choose one of two modes per
+type.
+
+### Data-only restore
+
+Serialize only entries (`K`, `V`) plus capacity/config. On restore, rebuild the
+policy as if entries were inserted in serialized order.
+
+Pros:
+
+- Simple and robust.
+- No private invariants exposed.
+- Cross-version friendly.
+
+Cons:
+
+- Loses recency/frequency/ghost history.
+- Warm cache may behave cold after restore.
+- Restore order becomes a semantic choice.
+
+### Warm-state restore
+
+Serialize policy metadata too: list order, frequency counters, clock hand,
+ghost lists, ARC target, etc.
+
+Pros:
+
+- Better post-restore hit rate.
+- Useful for long-lived caches that restart often.
+
+Cons:
+
+- Representation becomes part of the serialization contract.
+- Every restore must validate invariants.
+- Version migration becomes policy-specific.
+
+Warm-state restore should be opt-in per policy, not a blanket derive.
+
+## TTL and Time
+
+TTL is the hardest serialization case because monotonic ticks are not portable
+across process restarts. The TTL design doc recommends serializing **relative
+remaining duration**, not raw `Instant`-derived ticks.
+
+Rules for future TTL serialization:
+
+- Never serialize raw monotonic `Tick` as if it were wall time.
+- Capture remaining duration at serialization time.
+- Restore by adding remaining duration to the new process clock.
+- Expired-at-serialization entries should either be omitted or restored as
+  expired and immediately purged. Prefer omission for data-only restore.
+- Wall-clock deadlines require a separate API and explicit drift semantics.
+
+This keeps `Clock` pluggable and avoids replaying meaningless old monotonic
+values.
+
+## Hash Seeds and Map Order
+
+Do not serialize:
+
+- `RandomState` seeds.
+- `ShardSelector::randomized` key material.
+- Hash-map bucket order.
+- Internal `FxHashMap` iteration order.
+
+Serialize semantic data only: keys, values, capacity, policy config, and, if
+warm restore is explicitly chosen, policy metadata in a stable schema.
+
+`ShardSelector::new(shards, seed)` is the exception because deterministic
+routing is its public contract. If a type exposes deterministic sharding as
+part of serialized config, the seed is config data and must be treated as
+secret if keys are attacker-controlled.
+
+## `Arc<V>` and Sharing
+
+Several policies and stores use `Arc<V>`. Serialization should treat `Arc<V>`
+as `V`, not as identity-preserving shared ownership:
+
+- Do not attempt to preserve `Arc::ptr_eq` relationships.
+- Do not serialize refcounts.
+- Do not serialize weak references.
+
+If multiple keys point at the same `Arc<V>`, data-only serialization will
+duplicate the value unless the caller provides a higher-level interning scheme.
+That is acceptable; cachekit should not infer value identity.
+
+## Schema Discipline
+
+For serialized artifacts controlled by cachekit (benchmark JSON, metrics
+snapshots), use explicit schema rules:
+
+- Additive optional fields are minor schema changes.
+- Removing or renaming required fields is a major schema change.
+- Stable identifiers should be constants, not string literals.
+- Include enough metadata for interpretation: version, feature set where
+  relevant, timestamp, and config.
+
+For serde-derived Rust structs, `#[non_exhaustive]` is not enough for external
+JSON compatibility. A downstream JSON consumer still sees fields. If stable
+wire compatibility matters, introduce an explicit versioned artifact type
+rather than serializing internal structs directly.
+
+## What Not To Derive
+
+Do not add `#[derive(Serialize, Deserialize)]` to a policy type just because it
+compiles. Check:
+
+- Does the serialized form expose private pointers, slot ids, or tombstones?
+- Can deserialization validate every invariant?
+- What happens if the target version has different metadata layout?
+- Are hash seeds or time ticks being persisted accidentally?
+- Does restoring this type produce a live, safe cache or only a bag of entries?
+
+If the answer is not clear, add a separate DTO (`SerializableLruCache`) and a
+fallible `try_from` restore path.
+
+## See Also
+
+- [Metrics](metrics.md) - current serde-supported snapshot types
+- [TTL design](ttl.md) - relative TTL serialization recommendation
+- [Hashing and key identity](hashing.md) - hash seeds and map order
+- [Error model](error-model.md) - fallible restore should use `Result`
+- [`src/metrics/snapshot.rs`](../../src/metrics/snapshot.rs)
+- [`bench-support/src/json_results.rs`](../../bench-support/src/json_results.rs)
diff --git a/docs/design/sharding.md b/docs/design/sharding.md
new file mode 100644
index 0000000..ef5c3ea
--- /dev/null
+++ b/docs/design/sharding.md
@@ -0,0 +1,157 @@
+# Sharding
+
+> Status: design rationale for sharded data structures that exist today and
+> roadmap notes for sharded cache policies. Companion to
+> [`concurrency.md`](concurrency.md) and [`hashing.md`](hashing.md).
+
+Sharding reduces contention by splitting one shared structure into N independent
+substructures, each with its own lock and capacity accounting. cachekit already
+uses this pattern at the data-structure and store layers. It does **not** yet
+ship a generic `ShardedCache<C>` or sharded policy wrapper.
+
+## Current Sharded Primitives
+
+| Type | Layer | Purpose |
+|---|---|---|
+| `ShardedHashMapStore<K, V, S>` | store | N locked hash maps with global size counter |
+| `ShardedSlotArena<T>` | data structure | N arenas addressed by `ShardedSlotId` |
+| `ShardedFrequencyBuckets<K>` | data structure | N frequency bucket sets for concurrent LFU-style metadata |
+| `ShardSelector` | helper | keyed hash routing from key to shard |
+
+The sharded primitives are building blocks, not full cache policies. A future
+`ShardedLruCache` would have to compose a sharded key index, per-shard recency
+metadata, and global capacity semantics. That composition is where the hard
+policy questions live.
+
+## Why Shard?
+
+A single `RwLock` wrapper is simple and often fast enough. It fails when:
+
+- many threads mutate policy metadata (`get` on LRU, LFU, Clock);
+- read paths still need atomics or lock acquisition;
+- one hot lock dominates profile samples;
+- cores spend more time waiting than doing cache work.
+
+Sharding turns one contended lock into N less-contended locks. The cost is that
+each shard is now a smaller cache with less global knowledge.
+
+## Shard Routing
+
+All routing should go through [`ShardSelector`](../../src/ds/shard.rs):
+
+```rust,ignore
+let selector = ShardSelector::randomized(16);
+let shard = selector.shard_for_key(&key);
+```
+
+Routing requirements:
+
+- Deterministic within a selector: same key maps to same shard.
+- Uniform: no systematic bias toward lower shards.
+- Keyed: adversaries should not be able to craft keys that all land on shard 0.
+- Bounded: shard count is clamped to `[1, MAX_SHARDS]`.
+
+Use `ShardSelector::randomized` unless reproducibility is required. If using
+`ShardSelector::new(shards, seed)`, treat `seed` as secret when keys are
+user-controlled.
+
+## Capacity Semantics
+
+Two capacity models are possible:
+
+| Model | Behaviour | Pros | Cons |
+|---|---|---|---|
+| Per-shard capacity | total capacity split across shards | simple, one lock per op | hit rate fragmentation |
+| Global capacity | one shared capacity budget | better utilization | cross-shard locking or global victim selection |
+
+The primitives today mostly follow **per-shard local state with global gauges**:
+each shard owns its data; aggregate `len` is tracked separately where needed.
+This keeps operations single-lock. It also means a full shard can evict even if
+another shard has spare room.
+
+That is acceptable for stores and metadata primitives. For a full cache policy,
+it is a hit-rate trade-off and must be documented at the policy level.
+
+## Locking Discipline
+
+Current sharded operations acquire at most **one shard lock**. This is the most
+important invariant:
+
+- No deadlock cycles.
+- Lock hold time stays bounded by one shard operation.
+- Callers do not need a global lock ordering table.
+
+Any future operation that touches two shards must define an ordering rule, for
+example "lock lower shard index first." Avoid two-shard operations unless the
+hit-rate improvement justifies the concurrency risk.
+
+## `ShardedSlotId`
+
+`ShardedSlotArena<T>` cannot use a plain `SlotId`. A slot id must identify both
+the shard and the local slot:
+
+```text
+ShardedSlotId = (shard_index, local_slot_id)
+```
+
+This is why sharding lives at the data-structure layer instead of being hidden
+behind a generic wrapper. Once a policy stores handles, the handle type is part
+of the policy's metadata layout.
+
+## Global Metrics
+
+Sharded types should expose aggregate metrics but record locally when possible.
+The rule:
+
+- Per-operation counters can be local or atomic.
+- Gauges like total `len` need either an atomic aggregate or a shard scan.
+- Snapshot consistency is best-effort; do not lock every shard just to make a
+  metrics snapshot globally atomic.
+
+This matches the metrics design: observability must not dominate the hot path.
+
+## Roadmap: `ShardedCache<C>`
+
+A generic sharded cache wrapper would look roughly like:
+
+```rust,ignore
+pub struct ShardedCache<C, K> {
+    shards: Vec<RwLock<C>>,
+    selector: ShardSelector,
+    capacity_per_shard: usize,
+    _key: PhantomData<K>,
+}
+```
+
+Open questions:
+
+- Does `C` have to be constructible by `CacheFactory`, or does the builder own
+  all construction?
+- Is capacity split evenly, weighted by shard traffic, or global?
+- Do policies expose per-shard metrics only, or aggregate metrics too?
+- How does `DynCache` integrate: `DynCache::Sharded(Box<...>)` or a sibling
+  `DynShardedCache`?
+- Should shard count be caller-specified, CPU-count-derived, or both?
+
+The conservative first version should use per-shard capacity and one-lock
+operations. Global victim selection should wait for benchmark evidence.
+
+## When Not To Shard
+
+- Cache fits on one lock without contention.
+- Hit rate matters more than write throughput.
+- Workload has a small hot set: all hot keys may still map to one shard.
+- Cache capacity is small: per-shard fragmentation dominates.
+- You need globally strict eviction order (true global LRU, ARC target `p`).
+
+Sharding is a concurrency optimization, not a policy upgrade.
+
+## See Also
+
+- [Concurrency](concurrency.md)
+- [Hashing and key identity](hashing.md)
+- [Metrics](metrics.md)
+- [`src/ds/shard.rs`](../../src/ds/shard.rs)
+- [`src/store/hashmap.rs`](../../src/store/hashmap.rs)
+- [`src/ds/slot_arena.rs`](../../src/ds/slot_arena.rs)
+- [`src/ds/frequency_buckets.rs`](../../src/ds/frequency_buckets.rs)
diff --git a/docs/index.md b/docs/index.md
index 87d50a8..cbc9c14 100644
--- a/docs/index.md
+++ b/docs/index.md
@@ -18,6 +18,11 @@ Key features:
 - [Weighted eviction](design/weighted-eviction.md) — `WeightStore`, dual limits, GDS/GDSF pre-staging
 - [Metrics](design/metrics.md) — Recorder / snapshot / exporter split, Prometheus integration
 - [Error model](design/error-model.md) — Panic vs `Result` discipline, four error types
+- [Benchmarking design](design/benchmarking.md) — Benchmark layers, policy registry, JSON artifacts
+- [Hashing and key identity](design/hashing.md) — Hasher choices, key interning, shard routing
+- [Sharding](design/sharding.md) — Sharded primitives, routing, capacity semantics
+- [Serialization](design/serialization.md) — `serde` surface and cache-state persistence boundaries
+- [Non-goals](design/non-goals.md) — Explicit boundaries and out-of-scope features
 - [TTL design](design/ttl.md) — Worked example of every principle in one feature
 - [API surface](guides/api-surface.md) — Module map and entrypoints
 

From 0feec6b75836cbeb2ae9d61c2aa2b93a17039652 Mon Sep 17 00:00:00 2001
From: Thomas Korrison <thomas_korrison@hotmail.com>
Date: Wed, 13 May 2026 21:41:55 +0100
Subject: [PATCH 3/3] docs: update design documentation to reflect the addition
 of CAR policy and clarify policy counts

- Updated the builder and dynamic dispatch documentation to indicate that cachekit now ships 18 implemented eviction policies, with CAR being a concrete policy not yet exposed through `CachePolicy` / `DynCache`.
- Clarified the distinction between implemented policies and runtime-dispatch variants, ensuring accurate representation of the current state of the library.
- Revised concurrency and trait hierarchy documentation to reflect the updated policy count, enhancing clarity for users regarding available features and capabilities.

These changes improve the accuracy and comprehensiveness of the design documentation, aiding developers in understanding the current state of cachekit's policy implementations.
---
 docs/design/builder-and-dyn-dispatch.md | 42 +++++++++----
 docs/design/concurrency.md              | 80 ++++++++++++++++---------
 docs/design/metrics.md                  | 59 +++++++++---------
 docs/design/trait-hierarchy.md          |  2 +-
 docs/design/weighted-eviction.md        |  2 +-
 5 files changed, 114 insertions(+), 71 deletions(-)

diff --git a/docs/design/builder-and-dyn-dispatch.md b/docs/design/builder-and-dyn-dispatch.md
index 5f222f7..c7d8032 100644
--- a/docs/design/builder-and-dyn-dispatch.md
+++ b/docs/design/builder-and-dyn-dispatch.md
@@ -5,12 +5,14 @@
 > Companion to [`design.md`](design.md) §13, [`trait-hierarchy.md`](trait-hierarchy.md),
 > and [`concurrency.md`](concurrency.md).
 
-cachekit ships 17 implemented eviction policies. Most application code
-wants to pick one of them — possibly at runtime, based on configuration
-— without writing 17 monomorphized call sites. This document explains
-why that runtime choice is delivered through an enum dispatcher rather
-than a `Box<dyn Cache>`, what the user-visible cost is, and how to
-extend the surface when a new policy lands.
+cachekit ships 18 implemented eviction policies. The runtime dispatcher
+currently wires 17 of them; CAR exists as a concrete policy but is not yet a
+`CachePolicy` / `DynCache` variant. Most application code wants to pick a
+policy — possibly at runtime, based on configuration — without writing one
+monomorphized call site per policy. This document explains why that runtime
+choice is delivered through an enum dispatcher rather than a `Box<dyn Cache>`,
+what the user-visible cost is, and how to extend the surface when a new policy
+lands.
 
 ## The problem
 
@@ -22,8 +24,8 @@ cache.insert(key, value);
 cache.get(&key);
 ```
 
-without enumerating the 17 policies at every call site. The cache type
-must therefore be **uniform across policies** — the concrete type the
+without enumerating every builder-wired policy at each call site. The cache
+type must therefore be **uniform across policies** — the concrete type the
 caller holds cannot depend on which policy was chosen.
 
 Two Rust mechanisms give a uniform type:
@@ -158,6 +160,21 @@ enum CacheInner<K, V> /* same bounds */ {
   impossible, which forces feature requests through method additions
   rather than match-arm proliferation in user code.
 
+### CAR builder gap
+
+CAR is implemented as a concrete policy (`src/policy/car.rs`) and has a
+`policy-car` feature flag, but this branch does **not** currently expose it
+through `CachePolicy` / `DynCache`. Users who want CAR instantiate the concrete
+`CarCore<K, V>` type directly. Closing the gap means adding a
+`CachePolicy::Car` variant, a `CacheInner::Car(CarCore<K, V>)` variant, and the
+usual method / builder / test arms listed in [Adding a new policy](#adding-a-new-policy).
+
+Until that lands, read "implemented policies" and "`DynCache` variants" as two
+different sets:
+
+- **Implemented concrete policies:** 18.
+- **Runtime-dispatch variants:** 17.
+
 ## Type bounds: heavier than `Cache<K, V>`
 
 `Cache<K, V>` requires only what each individual policy implementation
@@ -313,7 +330,8 @@ The dispatcher's runtime cost is small. The **maintenance** cost is
 real:
 
 - **17 inner variants** × **~10 `DynCache` methods** = **~170 match
-  arms** that must stay in sync.
+  arms** that must stay in sync today. CAR will make this 18 variants
+  once it is wired into the dispatcher.
 - A `Debug` impl, a `default()` (where applicable), and a
   `validate_policy` arm per variant.
 - A `Cargo.toml` feature flag per variant.
@@ -393,9 +411,9 @@ Distinctness makes `Expiring<Expiring<DynCache>>` structurally
 unrepresentable, which prevents the "two clocks, two indexes"
 double-wrapping bug at the type level.
 
-The duplication is real: a parallel ~170 arms for the expiring
-variant. It is bounded (one type per cross-cutting capability) and
-the trade favours type-level safety over deduplication.
+The duplication is real: a parallel ~170 arms today, rising with the
+dispatcher variant count. It is bounded (one type per cross-cutting
+capability) and the trade favours type-level safety over deduplication.
 
 ## When not to use `DynCache`
 
diff --git a/docs/design/concurrency.md b/docs/design/concurrency.md
index 920b7ad..e418976 100644
--- a/docs/design/concurrency.md
+++ b/docs/design/concurrency.md
@@ -23,48 +23,70 @@ and where the gaps are.
 
 ## The dominant pattern: sequential core, concurrent wrapper
 
-Every concurrent type in cachekit follows the same shape:
+cachekit's concurrent types all keep the sequential core unaware of locking,
+but they do **not** all have the same struct shape. There are three families.
+
+### Cloneable policy handles
+
+Policy-level wrappers are shared handles around a locked policy core:
 
 ```text
-ConcurrentX<K, V> { inner: Arc<RwLock<X<K, V>>> }
+ConcurrentPolicy<K, V> { inner: Arc<RwLock<Policy<K, V>>> }
 ```
 
-where `X` is the single-threaded core (`LruCore`, `FifoCache`,
-`S3FifoCache`, `SlotArena`, `IntrusiveList`, `ClockRing`,
-`HashMapStore`, `SlabStore`, `WeightStore`, `HandleStore`,
-`FrequencyBuckets`). The wrapper:
+This shape is used by:
 
-1. holds the core behind an `Arc<RwLock<…>>`,
-2. presents owned/`Arc<V>` returns instead of borrowed `&V`,
-3. is `Clone` via `Arc::clone` so callers can hand copies to threads,
-4. is `Send + Sync` because the inner core's `Send + Sync` impls
-   auto-derive through `Arc<RwLock<…>>`.
+- `ConcurrentLruCache` — [`src/policy/lru.rs`](../../src/policy/lru.rs)
+- `ConcurrentFifoCache` — [`src/policy/fifo.rs`](../../src/policy/fifo.rs)
+- `ConcurrentS3FifoCache` — [`src/policy/s3_fifo.rs`](../../src/policy/s3_fifo.rs)
 
-The pattern is verbose but consistent and was chosen for three reasons:
+These types implement `Clone` via `Arc::clone`, so callers can hand cheap
+handles to threads. They expose owned / `Arc<V>` returns instead of borrowed
+`&V` because no reference can safely outlive the lock guard it came from.
 
-- **No `&mut self` in the public API.** Sharing requires interior
-  mutability; an `RwLock` is the cheapest tool that exposes both shared
-  reads and exclusive writes through `&self`.
-- **The sequential core stays unaware of locking.** Policy code under
-  [`src/policy/`](../../src/policy) is single-threaded and easier to
-  reason about. The locking discipline lives in one place per type.
-- **The lock is replaceable.** Swapping `parking_lot::RwLock` for a
-  different primitive (`std::sync::RwLock`, a sharded lock, a seqlock)
-  is a local change because the inner core has no opinion on it.
+### Owning store and data-structure wrappers
 
-The 11 concurrent wrappers shipped today live in:
+Store and data-structure wrappers usually own the lock directly:
+
+```text
+ConcurrentX<K, V> { inner: RwLock<X<K, V>>, ... }
+```
+
+Examples:
 
-- `ConcurrentLruCache` — [`src/policy/lru.rs`](../../src/policy/lru.rs)
-- `ConcurrentFifoCache` — [`src/policy/fifo.rs`](../../src/policy/fifo.rs)
-- `ConcurrentS3FifoCache` — [`src/policy/s3_fifo.rs`](../../src/policy/s3_fifo.rs)
 - `ConcurrentHashMapStore`, `ShardedHashMapStore` — [`src/store/hashmap.rs`](../../src/store/hashmap.rs)
 - `ConcurrentSlabStore` — [`src/store/slab.rs`](../../src/store/slab.rs)
 - `ConcurrentWeightStore` — [`src/store/weight.rs`](../../src/store/weight.rs)
 - `ConcurrentHandleStore` — [`src/store/handle.rs`](../../src/store/handle.rs)
-- `ConcurrentSlotArena`, `ShardedSlotArena` — [`src/ds/slot_arena.rs`](../../src/ds/slot_arena.rs)
+- `ConcurrentSlotArena` — [`src/ds/slot_arena.rs`](../../src/ds/slot_arena.rs)
 - `ConcurrentIntrusiveList` — [`src/ds/intrusive_list.rs`](../../src/ds/intrusive_list.rs)
 - `ConcurrentClockRing` — [`src/ds/clock_ring.rs`](../../src/ds/clock_ring.rs)
+
+These wrappers are not necessarily cloneable handles. If a caller wants shared
+ownership, they can wrap the whole type in `Arc<_>`. Keeping the `Arc` out of
+the struct avoids an unnecessary refcount on users who only need a single owner.
+
+### Sharded primitives
+
+Sharded types own multiple independently locked shards:
+
+```text
+ShardedX<K, V> {
+    shards: Vec<RwLock<ShardState<K, V>>>,
+    selector: ShardSelector,
+}
+```
+
+Examples:
+
+- `ShardedSlotArena` — [`src/ds/slot_arena.rs`](../../src/ds/slot_arena.rs)
 - `ShardedFrequencyBuckets` — [`src/ds/frequency_buckets.rs`](../../src/ds/frequency_buckets.rs)
+- `ShardedHashMapStore` — [`src/store/hashmap.rs`](../../src/store/hashmap.rs)
+
+The common design is not "`Arc<RwLock<_>>` everywhere"; it is **lock at the
+wrapper boundary and keep the sequential core lock-free**. The exact ownership
+shape depends on whether the type is intended to be a cloneable cache handle,
+an owning concurrent store, or a sharded primitive.
 
 ## Why `Concurrent*` does not implement `Cache<K, V>`
 
@@ -271,8 +293,8 @@ When sharding is **not** what you want:
 
 ## Concurrent policy coverage
 
-Of the 17 implemented policies, **3 ship with a `Concurrent*` wrapper
-today**: LRU, FIFO, S3-FIFO. The remaining 14 require external locking
+Of the 18 implemented policies, **3 ship with a `Concurrent*` wrapper
+today**: LRU, FIFO, S3-FIFO. The remaining 15 require external locking
 by the caller — typically `Arc<parking_lot::RwLock<CacheCore>>`. The
 relevant rustdoc on those policies (e.g. `LfuCache`, `HeapLfuCache`,
 `MfuCache`) calls this out.
@@ -282,7 +304,7 @@ mechanical: wrap the sequential core in `Arc<RwLock<…>>`, expose the
 `&self` API with `Arc<V>` returns, decide read-lock vs. write-lock per
 method, implement `Clone` via `Arc::clone`, and implement
 `unsafe impl ConcurrentCache`. The work is bounded; what's missing is
-the discipline to do it consistently across all 17 policies.
+the discipline to do it consistently across all 18 policies.
 
 ## Failure modes
 
diff --git a/docs/design/metrics.md b/docs/design/metrics.md
index 3a296d7..88ace2b 100644
--- a/docs/design/metrics.md
+++ b/docs/design/metrics.md
@@ -164,10 +164,10 @@ Two questions this design avoided:
 - **"Why not just `AtomicU64` for everything?"** Because counters
   on `&mut self` paths (the majority — `insert`, `get`, `evict`)
   do not need atomic semantics; the policy already holds exclusive
-  access. Using `AtomicU64` everywhere would impose memory-fence
-  cost on the hot path for no concurrency benefit. The split
-  reserves atomic-ish behaviour (`MetricsCell` + external lock) for
-  read paths only.
+  access. However, `MetricsCell` is only sound when `&self` metric
+  increments are protected by exclusive synchronization or are known
+  to be single-threaded. It is **not** a substitute for atomics under
+  shared `RwLock::read` access.
 
 ## `MetricsCell`: interior mutability under external lock
 
@@ -180,18 +180,20 @@ unsafe impl Sync for MetricsCell {}
 unsafe impl Send for MetricsCell {}
 ```
 
-This is the only `unsafe impl Sync` in the metrics surface. The
-contract:
-
-- **External synchronization is required.** `MetricsCell` lives
-  inside a policy struct that is itself behind an `RwLock` in any
-  concurrent wrapper (see [`concurrency.md`](concurrency.md)). The
-  read lock serializes concurrent `&self` access; the cell is
-  manipulated under that lock.
-- **The cell is observation-only.** Lost increments are
-  acceptable; the worst-case outcome is undercounting a metric,
-  which is a precision issue, not a correctness one. Cache hits and
-  evictions still behave correctly.
+This is the only `unsafe impl Sync` in the metrics surface, so its
+contract must be narrow:
+
+- **Exclusive external synchronization is required.** A shared
+  `RwLock::read` guard does **not** serialize readers, so it is not
+  sufficient protection for `Cell<u64>`. `MetricsCell` may be used
+  on single-threaded policy paths, or behind a write lock / mutex,
+  but not for counters mutated concurrently through read-locked
+  `&self` methods.
+- **Observation-only does not relax Rust's aliasing rules.** It is
+  acceptable for metrics to be approximate; it is not acceptable for
+  approximation to be implemented as unsynchronized `Cell` mutation.
+  Concurrent read-path counters must use `AtomicU64`, take an
+  exclusive lock, or be disabled for that path.
 - **`pub(crate)`.** The type does not escape the crate.
   Down-stream code can read counters through the snapshot API but
   cannot construct `MetricsCell` itself, which prevents misuse from
@@ -200,14 +202,15 @@ contract:
 The alternatives considered and rejected:
 
 - `Mutex<u64>` — cost dominates the counter increment.
-- `AtomicU64` — works, but imposes fence cost where no concurrency
-  exists for the increment itself.
+- `AtomicU64` — the correct choice for counters that can be
+  incremented concurrently through shared references; unnecessary
+  for single-threaded or exclusively locked counters.
 - `RefCell<u64>` — runtime borrow checking with panic on contention;
   not desirable on a metrics increment path.
 
-`MetricsCell` is the smallest tool that says "we know about the
-sync requirement; trust the external lock; pay no per-increment
-cost beyond a `Cell::set`."
+`MetricsCell` is the smallest tool for single-threaded or exclusively
+locked metric counters. Any policy or wrapper that records metrics
+from a read-locked path must not rely on `MetricsCell` for soundness.
 
 ## Snapshots: cheap, copyable, optionally serializable
 
@@ -482,11 +485,11 @@ What it does **not** guarantee:
   sequentially. A reader can observe `hits = 100, misses = 99`
   while a concurrent writer is mid-update; the next snapshot may
   show `hits = 100, misses = 101`. There is no "snapshot epoch."
-- **Lossless recording under contention.** `MetricsCell`
-  increments under a held read lock are safe; multiple read locks
-  are not serialized against each other. Concurrent `&self`
-  recorder calls on the same `MetricsCell` can lose increments.
-  This is the "best-effort observability" caveat.
+- **Concurrent `MetricsCell` recording.** `MetricsCell` must not be
+  incremented from multiple read-locked callers. Shared read locks do
+  not serialize readers, so those paths must use atomics or acquire an
+  exclusive lock before recording. Metrics may be best-effort, but
+  the implementation still has to be data-race-free.
 - **Wrap-safe arithmetic in release.** Release profile sets
   `overflow-checks = false`. Counters wrap silently. At one billion
   events per second, `u64` wraps in ~585 years — practically a
@@ -498,8 +501,8 @@ What it does **not** guarantee:
   principles level
 - [Cache trait hierarchy](trait-hierarchy.md) — `&self` / `&mut self`
   split that drives the read-vs-mutate recorder fork
-- [Concurrency](concurrency.md) — read/write lock model behind
-  `MetricsCell`'s soundness
+- [Concurrency](concurrency.md) — read/write lock model that
+  constrains where `MetricsCell` may be used
 - [Error model](error-model.md) — panic discipline shared by the
   exporter's poisoning behaviour
 - [`src/metrics/`](../../src/metrics) — the canonical implementation
diff --git a/docs/design/trait-hierarchy.md b/docs/design/trait-hierarchy.md
index 70a786b..2eee6ea 100644
--- a/docs/design/trait-hierarchy.md
+++ b/docs/design/trait-hierarchy.md
@@ -18,7 +18,7 @@ The trait surface optimizes for four things, roughly in order:
 
 1. **Code written against the kernel survives a policy swap.** Users
    writing `fn warm<C: Cache<K, V>>(c: &mut C, …)` can pick any of
-   the 17 implemented policies without changing call sites.
+   the 18 implemented concrete policies without changing call sites.
 2. **Optional behaviour is visible only when present.** A policy that
    doesn't track frequency should not have a `frequency()` method that
    returns garbage or panics. Capability traits exist so this remains
diff --git a/docs/design/weighted-eviction.md b/docs/design/weighted-eviction.md
index 7ce0910..7a8dbac 100644
--- a/docs/design/weighted-eviction.md
+++ b/docs/design/weighted-eviction.md
@@ -181,7 +181,7 @@ the module documentation.
 
 ## Why weight is at the **store** layer, not the policy layer
 
-The 17 implemented policies in `src/policy/` are all weight-unaware.
+The 18 implemented policies in `src/policy/` are all weight-unaware.
 They count entries and evict by entry. `WeightStore` is below them in
 the layering: