diff --git a/docs/design/2026_04_25_proposed_s3_admission_control.md b/docs/design/2026_04_25_proposed_s3_admission_control.md
new file mode 100644
index 000000000..a153f1697
--- /dev/null
+++ b/docs/design/2026_04_25_proposed_s3_admission_control.md
@@ -0,0 +1,468 @@
+# S3 PUT admission control
+
+> **Status: Proposed**
+> Author: bootjp
+> Date: 2026-04-25
+>
+> Companion to PR #636 (`s3ChunkBatchOps = 4`, Raft entry size aligned
+> with `MaxSizePerMsg = 4 MiB` per PR #593) and to the workload-class
+> isolation proposal (`docs/design/2026_04_24_proposed_workload_isolation.md`).
+> Where PR #636 fixes the *per-entry* memory accounting and the
+> isolation doc keeps Raft's CPU budget separate from heavy command
+> paths, this doc bounds the *aggregate* in-flight memory that S3
+> PUT body bytes can pin in the leader's Raft pipeline regardless of
+> how many clients are uploading at once.
+
+---
+
+## 1. Problem
+
+Even with PR #636 reducing the per-entry size to 4 MiB and aligning
+with `MaxSizePerMsg`, the **aggregate** in-flight memory bound is
+governed by client concurrency, which the server cannot currently
+limit:
+
+```text
+leader-side worst-case  =  concurrent_PUTs × pending_entries_per_PUT × entry_size
+                        ≈  concurrent_PUTs × MaxInflight × 4 MiB
+```
+
+A single PUT's pipeline is already capped by `MaxInflight = 1024` Raft
+in-flight messages times 4 MiB per entry = 4 GiB / peer (the bound PR
+#593 advertises). What multiplies that bound is concurrent uploads:
+
+- 4 simultaneous 5 GiB PUTs on a 4-vCPU 8 GiB-RAM node can pin
+  `4 × min(MaxInflight × 4 MiB, body_remaining)` on the leader. If a
+  follower stalls (GC pause, slow disk fsync) for several seconds, that
+  worst case is realised before backpressure kicks in.
+- The 2026-04-24 incident showed that one workload class can wedge the
+  Go runtime for the entire cluster. S3 PUT body bytes pinned in the
+  Raft pipeline play the same role as XREAD blobs in the Redis path:
+  they grow with client behaviour, not with anything the operator can
+  rate-limit per-handler.
+- `GOMEMLIMIT = 1800 MiB` plus `--memory = 2500m` (PR #617) set the
+  upper bound on heap, but they do not provide *flow control*. The Go
+  GC starts thrashing well before the limit, and memwatch (PR #612)
+  triggers graceful shutdown — a recovery mechanism, not a steady-state
+  control.
+
+There is no hard ceiling today on the aggregate body bytes that S3
+puts onto the Raft pipeline. PR #636 makes the per-entry slot
+predictable; this proposal makes the *number of slots in flight*
+predictable too.
+
+## 2. Goals & non-goals
+
+**Goals**
+
+- Hard cap on the total S3 PUT body bytes that have been read from
+  HTTP but not yet committed to Raft. This cap is a tunable property
+  of the S3 server, not of any single PUT.
+- Steady-state behaviour where new PUTs are admitted at the rate Raft
+  can drain. Bursting clients see HTTP 503 with `Retry-After`, not
+  silent ballooning of leader memory.
+- Composable with existing memwatch + GOMEMLIMIT. Admission failure
+  must be visible (metric + log line per rejected request) so an
+  operator can size the cap against a given memory budget.
+- Read path is unaffected. GET / HEAD / LIST do not consume the same
+  budget.
+
+**Non-goals**
+
+- No global QPS quota across all S3 verbs. Other verbs (GET, multipart
+  initiate / complete) have well-bounded memory and don't need this.
+- No per-bucket / per-key fairness. Fairness across tenants is a
+  separate workload-class concern (handled by the workload isolation
+  proposal). This doc only fixes the aggregate ceiling.
+- No backpressure-via-TCP (`SO_RCVBUF` shrink). Decoupling from kernel
+  buffers keeps the budget testable and avoids cross-OS surprises.
+- No multi-region / multi-cluster admission. One leader's perspective.
+
+## 3. Design
+
+### 3.1 Where the budget is checked
+
+Two natural insertion points exist on the PUT path:
+
+| Insertion point | Granularity | When the request is rejected |
+|---|---|---|
+| (A) Before `prepareStreamingPutBody` accepts the first byte | Whole-object | Pre-Raft, before any reads from the body |
+| (B) Inside the `flushBatch` loop, before `coordinator.Dispatch` | Per-batch | Mid-stream, after `s3ChunkBatchOps` chunks are buffered |
+
+Recommended: **both, at different scales**.
+
+- **(A) is the request-admission cap.** Use the `Content-Length`
+  header (rejected when absent for PUTs that are not aws-chunked) to
+  pre-charge the budget. If the budget would be exceeded we reply 503
+  immediately. This matches how AWS S3 itself handles
+  `SlowDown` / `ServiceUnavailable` and is the cheapest case to
+  surface — the body is never read from the socket.
+- **(B) is the in-flight cap.** Even after a request is admitted, its
+  4 MiB batches can pile up if Raft becomes slow. The flush loop
+  acquires a per-batch lease (4 MiB) from a shared semaphore *before*
+  reading the next chunk window. If the semaphore is empty for longer
+  than `dispatchAdmissionTimeout` (proposed default 30 s), the PUT
+  fails with 503 mid-stream. Dispatch latency stays bounded by the
+  semaphore size rather than by client behaviour.
+
+```text
+client ─[Content-Length]─► (A) reserve full body bytes
+                                │
+                                ▼
+                        prepareStreamingPutBody
+                                │
+                                ▼
+        (B) acquire 4 MiB slot ◄┐ (released on Dispatch ack)
+                                │
+                                ▼
+        coordinator.Dispatch ───┘
+```
+
+### 3.2 Concrete cap values
+
+Default cap:
+
+```go
+const (
+    // s3PutAdmissionMaxInflightBytes is the hard ceiling on S3 PUT body
+    // bytes accepted by this node but not yet committed to Raft. Picked
+    // so that, on a 3-node cluster with MaxInflight × MaxSizePerMsg =
+    // 4 GiB / peer per PR #593, all in-flight PUT bytes plus their Raft
+    // replication shadow stay under one quarter of GOMEMLIMIT (1800 MiB
+    // per PR #617) — leaving headroom for Lua, scan buffers, and Pebble
+    // memtables.
+    s3PutAdmissionMaxInflightBytes = 256 << 20 // 256 MiB
+    // s3RaftEntryByteBudget is the per-batch unit acquired and
+    // released against the semaphore. It must equal the byte
+    // budget of one Raft entry produced by PutObject /
+    // UploadPart's flush loop (PR #636: s3ChunkSize ×
+    // s3ChunkBatchOps = 1 MiB × 4 = 4 MiB minus protobuf framing
+    // overhead — kept abstract here so the admission contract
+    // does not lock the entry-size choice). The semaphore's
+    // capacity is `s3PutAdmissionMaxInflightBytes /
+    // s3ChunkSize` 1 MiB-units; per-batch acquire takes
+    // `s3RaftEntryByteBudget / s3ChunkSize` units at a time
+    // (= 4 units on the default tunables).
+    s3RaftEntryByteBudget = s3ChunkSize * s3ChunkBatchOps
+    // dispatchAdmissionTimeout is how long a per-batch flush will wait
+    // for a slot before giving up. The 256 MiB cap drains in ~2 s at
+    // 1 Gbps under steady-state Raft throughput (256 MiB / 125 MB/s),
+    // so 30 s is *not* sized against normal drain — it is the budget
+    // for a transiently stalled follower (GC pause, slow disk fsync,
+    // bounded leader re-election) to recover before we conclude the
+    // cluster is genuinely overloaded. Longer stalls surface as 503,
+    // which is the right signal: at that point the right action is
+    // operator intervention (scale out, investigate the stall), not
+    // continued accumulation.
+    dispatchAdmissionTimeout = 30 * time.Second
+)
+```
+
+The cap is intentionally **per-node** rather than cluster-wide:
+admission is enforced on the node receiving the HTTP request, which is
+also the node whose memory is at risk. Clients hitting a different
+node simply get a different budget.
+
+Both values are exposed as env vars (`ELASTICKV_S3_PUT_ADMISSION_MAX_INFLIGHT_BYTES`,
+`ELASTICKV_S3_DISPATCH_ADMISSION_TIMEOUT`) with the constants as
+defaults, following the pattern PR #593 established for
+`ELASTICKV_RAFT_MAX_INFLIGHT_MSGS` etc.
+
+### 3.3 Data structure
+
+```go
+type putAdmission struct {
+    semaphore chan struct{}   // capacity == max / s3ChunkSize, charges in 1 MiB units
+    inflight  atomic.Int64    // metric mirror; semaphore is the source of truth
+}
+
+func newPutAdmission(maxBytes int64) *putAdmission { … }
+
+// peekHeadroom is admission A. It returns ErrAdmissionExhausted
+// without acquiring slots when the requested byte count exceeds
+// the *current* free capacity of the semaphore. It does NOT take
+// out a reservation — the only effect is "fail fast at request
+// entry instead of partway through the body" — so it cannot
+// double-count against admission B.
+func (a *putAdmission) peekHeadroom(bytes int64) error { … }
+
+// acquire is admission B. It blocks until (bytes / s3ChunkSize)
+// slots are available or ctx fires. The returned release closure
+// MUST be called exactly once. If `bytes > capacity * s3ChunkSize`
+// (a malformed client whose frame exceeds the entire budget),
+// returns ErrAdmissionExhausted *immediately* without waiting —
+// otherwise we would block until ctx (typically
+// dispatchAdmissionTimeout) for a request that can never fit.
+func (a *putAdmission) acquire(ctx context.Context, bytes int64) (func(), error) { … }
+```
+
+The two-step contract avoids the double-charge / unbounded-window
+hazard the obvious "A is also an acquire" design would have:
+
+1. **Admission A — request-entry headroom check (peek only).** When
+   a PUT arrives with `Content-Length: N`, the handler calls
+   `peekHeadroom(N)`. If the result is `ErrAdmissionExhausted`,
+   reply 503 immediately and never read from the body. If `nil`,
+   admission A is done — no slots have been taken out of the
+   semaphore. This is intentionally racy with concurrent PUTs:
+   admission A only promises "at the moment we asked, the budget
+   *would have fit* this request"; it does not reserve the budget.
+2. **Admission B — per-batch acquire/release (the only path that
+   touches the semaphore).** The PUT handler then loops the body
+   in `s3ChunkSize × s3ChunkBatchOps = 4 MiB` windows; each window
+   acquires 4 MiB worth of slots via `acquire`, reads the next
+   window from the body, dispatches, and releases on Dispatch
+   ack. This is the bound that actually holds — at any instant the
+   sum of held slots across all in-flight PUTs cannot exceed the
+   semaphore's capacity. If admission A's racy estimate turns out
+   to be wrong (another PUT raced in between A's check and the
+   first B-acquire), the first B-acquire blocks until the
+   contending PUT releases or `dispatchAdmissionTimeout` fires.
+
+The semaphore is therefore charged **only by admission B**. Bytes
+in flight = `held_B_slots × s3ChunkSize`, full stop; admission A is
+a fast-fail gate, not a reservation. This is the model
+implementations MUST follow — a "both A and B charge" design would
+double-count every PUT against itself and an admission-A-only
+design would lose its bound the moment a PUT's body exceeded its
+declared `Content-Length` (chunked transfers, malformed clients).
+
+### 3.3.1 aws-chunked transfers (`Content-Length: -1`)
+
+A naïve "reserve `s3MaxObjectSizeBytes` (5 GiB) up front" is rejected:
+on default tunables (`s3PutAdmissionMaxInflightBytes = 256 MiB`) a
+single chunked PUT would consume **20×** the entire budget at request
+admission time, head-of-line-blocking every other PUT until the
+chunked stream finishes — exactly the failure mode admission control
+exists to prevent. We therefore split chunked admission across two
+mechanisms instead of pre-charging:
+
+1. **Bootstrap headroom check** at request entry. Calls
+   `peekHeadroom(s3RaftEntryByteBudget)` — exactly the admission-A
+   contract: a fast-fail check that 4 MiB *would have fit* at the
+   moment we asked. **No slot is acquired.** This is intentionally
+   racy with concurrent PUTs (same as fixed-length admission A);
+   its job is to fail at request entry rather than partway
+   through the first decoded frame. Chunked PUTs are not "free"
+   — they still must beat the same admission queue as fixed-length
+   PUTs at the per-frame level.
+2. **Pay-as-you-decode** thereafter, charged via an
+   `awsChunkedReader` progress callback. The callback **buffers
+   decoded bytes until a full slot unit (`s3ChunkSize = 1 MiB`) is
+   accumulated**, then calls `acquire(s3ChunkSize)` on the
+   semaphore (same path as fixed-length admission B). This keeps
+   the slot unit coherent: the semaphore's capacity is
+   `s3PutAdmissionMaxInflightBytes / s3ChunkSize` 1 MiB-units, so
+   acquiring at sub-MiB granularity is not representable. The
+   slot is released once the corresponding
+   `coordinator.Dispatch` acks the chunk. The buffer never holds
+   more than `s3ChunkSize - 1` decoded bytes, so the worst-case
+   memory overhead beyond the semaphore-tracked bytes is bounded
+   by 1 MiB per concurrent chunked PUT.
+
+Failure modes:
+
+- If the awsChunkedReader produces decoded bytes faster than Raft
+  drains, the next 1 MiB acquire blocks (capped by
+  `dispatchAdmissionTimeout`). Beyond that timeout, mid-stream 503
+  closes the connection. The legacy "reserve 5 GiB" approach
+  would have surfaced as 503 *at request entry* for unrelated
+  PUTs; this approach surfaces as mid-stream 503 for the chunked
+  PUT itself, which is the right blame attribution.
+- The bootstrap check at step 1 is racy: another PUT can consume
+  the headroom between the check and the first per-frame acquire.
+  When that happens the first acquire blocks (or 503s on
+  timeout) — the same path the fixed-length admission B handles
+  for the contending case. The race is intentional: making the
+  check a real reservation would multiply per-request slot hold
+  by `concurrent_chunked_PUTs × 4 MiB` of bootstrap-only credit
+  with no corresponding payload, reintroducing a head-of-line
+  hazard.
+- If the awsChunkedReader produces a single frame whose decoded
+  size never accumulates to a full `s3ChunkSize`, the buffer
+  flushes on stream EOF: a final `acquire(actual_buffered_bytes)`
+  rounded up to one slot is taken (semaphore charges in 1-slot
+  units regardless of actual byte count), so the bound holds.
+- A malformed client that decodes bytes faster than Raft drains
+  *cannot* trigger the immediate-503 path the way a fixed-length
+  PUT can. The accumulation design (callback always calls
+  `acquire(s3ChunkSize)`, never larger) means the per-frame
+  acquire request is bounded by 1 MiB — the
+  "if `bytes > capacity * s3ChunkSize`" early-return in
+  `acquire`'s spec is never hit on the chunked path. Instead,
+  successive 1 MiB acquires block under Raft pressure and the
+  PUT eventually surfaces 503 on `dispatchAdmissionTimeout` —
+  the same path a slow follower triggers. The "immediate 503 for
+  oversized request" failure mode applies only to fixed-length
+  PUTs (via `peekHeadroom(Content-Length > 256 MiB)`).
+
+This change moves chunked admission from M4 (originally "deferred
+optimisation") into M1 (the first shippable milestone). M1 ships
+with the progress-callback wired *unconditionally* for all chunked
+PUTs; an env-var switch falls back to "bootstrap-only" charging
+without the per-decode credit if a corner case requires it
+(`ELASTICKV_S3_PUT_ADMISSION_CHUNKED_INCREMENTAL=false`, default
+`true`). The fallback path keeps the 5 GiB-reservation hazard
+behind an explicit operator decision rather than letting it
+materialise by default.
+
+### 3.4 Failure mode
+
+- `503 Service Unavailable` with `Retry-After: 1` (small, jittered).
+  AWS S3 SDK clients (boto3, aws-sdk-go-v2) auto-retry this code with
+  exponential backoff out of the box.
+- Body for the response is the standard S3 XML error envelope:
+  `<Code>SlowDown</Code><Message>Reduce your request rate</Message>`.
+  This is the AWS-defined code for admission rejection and matches
+  what real S3 returns.
+- Mid-stream rejection (admission B) closes the connection with a
+  `connection: close` header so partial body reads do not corrupt the
+  client's pipeline. The PUT handler also calls
+  `cleanupManifestBlobs` for any partial blobs that already landed in
+  Pebble.
+
+### 3.5 Metrics
+
+```text
+elastickv_s3_put_admission_inflight_bytes        gauge
+elastickv_s3_put_admission_rejections_total      counter (labels:
+                                                    stage    = "prereserve" | "perbatch",
+                                                    protocol = "fixed-length" | "chunked")
+elastickv_s3_put_admission_wait_seconds          histogram (labels: stage, protocol)
+```
+
+The `protocol` label distinguishes fixed-length PUTs (those with a
+declared `Content-Length`, hitting admission A's `peekHeadroom`)
+from aws-chunked PUTs (admission via §3.3.1's pay-as-you-decode).
+This split is what makes the chunked-PUT 503 surface (§6) and the
+rolling-upgrade alerting story actionable: a spike on
+`stage="perbatch", protocol="chunked"` points at "chunked clients
+beat Raft drain"; a spike on `stage="prereserve",
+protocol="fixed-length"` points at "client concurrency exceeds
+the per-node aggregate cap." Without the dimension the two
+failure modes are indistinguishable in a single counter.
+
+Grafana panel: inflight gauge with the cap as a horizontal line so
+the operator sees how often the system saturates. Rejection rate
+suggests bumping the cap or scaling out (more nodes spreads PUT load).
+
+## 4. Interaction with related subsystems
+
+- **PR #636 (entry size alignment).** The 4 MiB per-batch unit is
+  the natural admission grain. The two changes are independent:
+  alignment is necessary for the per-peer Raft bound to be correct;
+  admission is necessary for the *aggregate* bound to be hard.
+- **PR #612 (memwatch graceful shutdown).** Continues to function as
+  the last-resort safety net. Admission control should fire at well
+  below the memwatch threshold (`s3PutAdmissionMaxInflightBytes` is
+  ~14% of `GOMEMLIMIT`) so memwatch sees a much lower steady-state
+  pressure and the graceful-shutdown path stays a rare event.
+- **Workload isolation proposal.** That doc proposes per-class CPU
+  reservation for Raft. Admission control is the memory-axis sibling.
+  Both are needed — limiting CPU does not bound queue depth.
+- **`coordinator.Dispatch` retries.** Today the S3 path has its
+  own retry loop (`s3TxnRetryMaxAttempts = 8` with exponential
+  backoff capped at `s3TxnRetryMaxBackoff = 32 ms`). The admission
+  contract is **hold-through-retry**: the per-batch slot acquired
+  in admission B is released exactly once, on the *final* outcome
+  of the retry chain (success ack, terminal error, or
+  `dispatchAdmissionTimeout` expiring), not between attempts.
+  Rationale: the bytes are still buffered in the PUT handler's
+  pendingBatch slice for the entire retry window, so the budget
+  must reflect them; a release-between-retries scheme would let a
+  second PUT proceed while the first is still memory-resident,
+  breaking the bound. The S3 PUT path uses the inbound
+  `*http.Request` context for `coordinator.Dispatch` (no
+  S3-specific Dispatch timeout — the HTTP server's
+  `writeTimeout` / client-side cancellation is the upper bound on
+  one Dispatch attempt), so the wall-clock cost of holding the
+  slot through one full retry chain is bounded by
+  `s3TxnRetryMaxAttempts × (single_dispatch_budget + s3TxnRetryMaxBackoff)`
+  where `single_dispatch_budget` is whatever the request context
+  permits at that moment. If the retry chain duration ever
+  exceeds `dispatchAdmissionTimeout` the per-batch acquire on the
+  *next* batch surfaces as 503 — the right failure mode
+  (chronic dispatch failure → caller learns instead of silently
+  consuming the budget).
+
+## 5. Implementation plan
+
+| Milestone | Scope | Risk |
+|---|---|---|
+| M1 | Add `putAdmission` type + per-node singleton + fixed-length `Content-Length` admission (`peekHeadroom`). Wire `prepareStreamingPutBody` to acquire / release. **aws-chunked progress-callback admission** (§3.3.1) ships in this milestone too — the conservative 5 GiB pre-charge fallback only sits behind `ELASTICKV_S3_PUT_ADMISSION_CHUNKED_INCREMENTAL=false`. **`dispatchAdmissionTimeout` ships here** (the chunked per-frame `acquire(s3ChunkSize)` path is gated on it from day one), not in M2. Metric scaffolding (gauge + counter). | Medium. Chunked progress callback needs `awsChunkedReader` to expose a hook. |
+| M2 | Per-batch admission B inside `flushBatch` for **fixed-length** PUTs (chunked PUTs already use admission B as of M1). Mid-stream 503 with cleanup on the fixed-length path. | Medium. Cleanup path on partial failure. |
+| M3 | Env-var tunables. Histogram metric. Grafana panel. | Low. |
+| M4 | Per-tenant / per-bucket admission classes (handed off to the workload-isolation rollout). | Medium. Out-of-scope for the v1 cap. |
+
+### Rolling upgrade
+
+Admission is purely additive on the request entry path: a node
+without the cap behaves identically to a node with the cap set
+infinitely high. A mixed cluster (some nodes M1, some still on
+`main`) is therefore safe — clients hitting the upgraded node see
+admission, clients hitting an old node see no admission, but
+neither path corrupts state. The default cap is intentionally
+generous enough that even single-node M1 traffic falls below the
+threshold under typical load, so the rollout signature is
+"503 SlowDown rate goes from 0 to negligible" rather than a step
+function. Operators can pin
+`ELASTICKV_S3_PUT_ADMISSION_MAX_INFLIGHT_BYTES=$((1<<63))` to
+disable the cap on M1 nodes during the burn-in window if desired.
+
+The aws-chunked progress-callback path is the only behaviour
+change visible to clients: a chunked PUT that would have succeeded
+under the old "no admission" code can now 503 mid-stream when Raft
+drain falls behind. This is by design — that is the failure mode
+admission control exists to surface — but operators should expect
+to see chunked-upload 503s where there were none before. The
+`stage="perbatch", protocol="chunked"` rejection-counter label
+isolates this signal; bumping the cap or
+`ELASTICKV_S3_PUT_ADMISSION_CHUNKED_INCREMENTAL=false` (with the
+HoL hazard re-introduced as a known trade-off) restores legacy
+behaviour during incident response.
+
+Acceptance criteria:
+
+- `go test ./adapter/ -short -run TestS3PutAdmission` covers reject /
+  admit / mid-stream-timeout paths.
+- A loaded test that opens 32 concurrent PUTs of 100 MiB each must
+  hold leader memory below `s3PutAdmissionMaxInflightBytes + epsilon`
+  for the duration of the test.
+- No regression in `Test_grpc_transaction` (which is currently the
+  longest leader-stress test).
+
+## 6. Risks
+
+- **Tail-latency for legitimate clients.** A long-running PUT that
+  loses a 4 MiB slot mid-stream returns 503 even though it is making
+  progress. Mitigated by `dispatchAdmissionTimeout = 30s`, well above
+  the steady-state Raft drain time. If we observe spurious 503s in
+  practice, drop the timeout into a config knob and tune.
+- **Operator confusion.** "Why does S3 return 503 when CPU is at
+  20%?" Mitigated by a sharp Grafana panel and a clear `Retry-After`
+  value so SDK behaviour is predictable.
+- **New chunked-PUT 503 surface.** Pay-as-you-decode admission
+  (§3.3.1) ships in M1 alongside fixed-length admission, so the
+  legacy 5 GiB pre-charge hazard does not materialise as a
+  steady-state risk. The residual risk it introduces is the
+  inverse: a chunked PUT that would have silently succeeded under
+  the no-admission code can now 503 mid-stream when Raft drain
+  falls behind. This is by design — that is the failure mode
+  admission control exists to surface — but it is the only
+  client-visible behaviour change in M1 and is what operators
+  should expect to see in dashboards. The
+  `stage="perbatch", protocol="chunked"` label on the rejection
+  counter (§3.5) isolates the signal; the operator escape hatch
+  is `ELASTICKV_S3_PUT_ADMISSION_CHUNKED_INCREMENTAL=false`, which
+  reverts to bootstrap-only charging at the cost of
+  re-introducing the 5 GiB head-of-line hazard.
+
+## 7. Out of scope (future work)
+
+- Per-bucket admission classes (e.g. system buckets get their own
+  budget). Punted to the workload-isolation rollout.
+- Coordinated admission across the multi-region read replica path
+  proposed in `docs/design/2026_04_18_proposed_raft_grpc_streaming_transport.md`.
+- Token-bucket rate-shaping (e.g. bytes-per-second). The current
+  proposal only bounds *concurrent* bytes; rate-shaping is a separate
+  policy choice.
diff --git a/docs/design/2026_04_25_proposed_s3_raft_blob_offload.md b/docs/design/2026_04_25_proposed_s3_raft_blob_offload.md
new file mode 100644
index 000000000..3d02154a2
--- /dev/null
+++ b/docs/design/2026_04_25_proposed_s3_raft_blob_offload.md
@@ -0,0 +1,678 @@
+# S3 raft blob offload — keep large object payloads out of the Raft log
+
+> **Status: Proposed**
+> Author: bootjp
+> Date: 2026-04-25
+>
+> Companion to PR #636 (`s3ChunkBatchOps = 4`, Raft entry size aligned
+> with `MaxSizePerMsg = 4 MiB` per PR #593) and to the S3 PUT
+> admission-control proposal
+> (`docs/design/2026_04_25_proposed_s3_admission_control.md`).
+>
+> PR #636 caps the *per-entry* size; admission control caps the
+> *aggregate in-flight* memory; this doc removes large blob payloads
+> from the *Raft log itself* so that snapshots and follower catch-up
+> stay bounded as the data set grows.
+
+---
+
+## 1. Problem
+
+Today every byte of an S3 object travels through the Raft log:
+
+```text
+HTTP PUT body  ─►  s3ChunkSize (1 MiB) chunks
+                 ─►  s3ChunkBatchOps × 1 MiB Raft entry
+                 ─►  Raft log entry (Pebble WAL on every node)
+                 ─►  applyLoop → s3keys.BlobKey(...) → MVCCStore Put
+```
+
+`s3keys.BlobKey(bucket, generation, objectKey, uploadID, partNo, chunkNo)`
+is the key actually written to Pebble; the log entry that proposed it
+contains the full chunk value. Two consequences:
+
+1. **WAL & snapshot growth scales with object bytes.** A node that
+   serves 100 GiB of S3 PUT traffic ends up with a 100 GiB Pebble WAL
+   plus the same 100 GiB persisted in the engine's state machine.
+   Snapshot transfer for a falling-behind follower carries the
+   payload twice — once as Raft log replication (during catch-up
+   inside the WAL window) and once as a snapshot dump if the leader
+   has already truncated.
+2. **Follower catch-up after a long absence is expensive.** Right now
+   a follower that misses 5 GiB of PUT traffic re-applies it as Raft
+   log entries, each going through the full `applyRequests` →
+   `MVCCStore.Put` path on a single thread. The ApplyLoop becomes a
+   bottleneck that holds the rest of the cluster waiting for the
+   `commitIndex` advance.
+
+S3 blob payloads are special: they are **idempotent, content-
+addressable, large, and rarely re-read inside the leader's apply
+window**. They look more like attachments than like Raft log records.
+Treating them as Raft entries is overkill — Raft only needs to
+linearise *what changed* (the manifest), not *the chunk bytes
+themselves*.
+
+## 2. Goals & non-goals
+
+**Goals**
+
+- Raft log entries for S3 PUT carry a *reference* to the chunk
+  payload, not the payload itself. Replication traffic and WAL
+  size become O(manifest size), not O(object size).
+- Followers fetch blob payloads out-of-band when they apply a
+  manifest reference. Apply order remains Raft-defined; only the
+  bytes are pulled lazily.
+- Snapshot transfer for a falling-behind follower is O(manifest
+  count + small blob index), not O(stored bytes).
+- The new path is opt-in and lives alongside the current direct-Raft
+  path until parity is proven. Existing S3 traffic is unaffected
+  during rollout.
+
+**Non-goals**
+
+- No external object store dependency (S3, MinIO, Ceph). The blob
+  offload uses Pebble itself plus a peer-to-peer fetch protocol;
+  introducing an external dependency would re-create the operational
+  surface elastickv exists to replace.
+- No *user-facing* deduplication API or storage-accounting credit.
+  The `chunkblob` keyspace is content-addressed by SHA-256 (§3.1)
+  and reference-counted (§3.5), which means two distinct objects
+  whose chunks happen to hash identically will share one
+  `chunkblob` row at the storage layer — that is a structural
+  property of content addressing, not a feature we expose. We do
+  *not*: surface dedup ratios, charge storage by post-dedup bytes,
+  rebalance dedup credit across tenants, or treat dedup hits as
+  semantically observable from S3 verbs. The reference layer
+  (`chunkref`) keeps `(bucket, objectKey, uploadID, partNo,
+  chunkNo)` granular and per-object, so DELETE / lifecycle still
+  reason about objects independently. Authorisation enforcement
+  remains on `chunkref` reads, never on `chunkblob` reads
+  (§3.3 covers the proxy-on-miss path; ACL checks fire before the
+  blob fetch is initiated, so a tenant cannot dereference a peer's
+  `chunkblob` by guessing a SHA — see §6 *Cross-tenant blob fetch*).
+- No change to MVCC semantics. Manifest commits remain the
+  serialisation point; blob fetch is a side channel that does not
+  change visibility rules.
+- No removal of the existing `BlobKey` path in this doc. We will
+  ship in two stages (manifest-only writes through Raft, blob
+  payload via a side channel), and the legacy path stays available
+  until enough operational evidence accumulates.
+
+## 3. Design
+
+### 3.1 New keyspace
+
+```text
+!s3|chunkref|<bucket>|<gen>|<objectKey>|<uploadID>|<partNo>|<chunkNo>
+    → ChunkRef{
+        ContentSHA256 [32]byte
+        Size          uint64
+        // Optional: leader-locality hints. Followers without the
+        // payload locally fetch from a peer that advertises the
+        // chunk in its catalog.
+        SourcePeer    NodeID
+      }
+
+!s3|chunkblob|<contentSHA256>
+    → raw bytes (the chunk payload)
+```
+
+Two separate keyspaces:
+
+- `chunkref` is replicated through Raft. Cheap (32 B + small header
+  per chunk) and ordered against the manifest commit.
+- `chunkblob` is **not** written through Raft. It is written
+  directly to Pebble on the receiving node and pulled by peers via
+  the new fetch protocol when they apply the corresponding
+  `chunkref`.
+
+### 3.2 PUT path
+
+```text
+client ─► HTTP PUT body
+        ─► chunk loop (s3ChunkSize):
+             1. compute SHA-256 of chunk
+             2. write chunk to LOCAL Pebble at !s3|chunkblob|<SHA>
+                (fsync)         ─┐ pipelined: bytes also stream out
+             3. PushChunkBlob to ─┤ to chunkBlobMinReplicas-1 followers
+                followers in parallel (one RPC per follower; bytes
+                start flowing the moment the leader has them — not
+                after local fsync completes)
+             4. wait until BOTH local fsync AND a quorum of follower
+                fsync-acks have returned (= chunkBlobMinReplicas
+                durable copies including the leader)
+             5. queue ChunkRef into pendingBatch
+        ─► flushBatch:
+             coordinator.Dispatch(OperationGroup{
+                 Elems: [ chunkref Puts ... + manifest Put ],
+             })
+        ─► HTTP 200 OK once Dispatch acks
+```
+
+Step 3 — synchronous chunkblob replication before the chunkref
+commit — is the difference between "Raft-equivalent durability"
+and "leader-only durability." Without it, a leader crash between
+the chunkref commit and the eventual async fetch would leave a
+committed manifest pointing at a chunkblob nobody else has — Raft's
+quorum guarantees the chunkref but tells you nothing about the
+blob payload. We close that gap by treating the chunkblob like a
+mini-Raft entry of its own with **semi-synchronous quorum**:
+
+1. Leader starts writing the chunkblob to local Pebble (fsync in
+   flight).
+2. **Concurrently** with step 1, the leader streams the chunkblob
+   to `chunkBlobMinReplicas - 1` followers via parallel
+   `S3BlobFetch.PushChunkBlob` RPCs. Pushes are **fanned out in
+   parallel**, not sequential — each follower's RPC is started
+   immediately, and the leader waits on a quorum of fsync-acks
+   rather than serially blocking on each one. (`PushChunkBlob` is
+   the leader-initiated counterpart to the follower-initiated
+   `FetchChunkBlob` defined in §3.6.)
+3. The leader waits for *both* the local fsync AND a quorum of
+   follower fsync-acks. The dominant cost is therefore
+   `max(local_fsync, slowest_quorum_follower_fsync)` — typically
+   ≈ 10 ms on consumer SSD, equivalent to a Raft quorum write.
+   This is what makes the p99 latency claim below load-bearing:
+   if step 1 and step 2 were *sequential* (write local → then
+   push to followers → then wait), per-chunk latency would be
+   `chunkBlobMinReplicas × fsync_latency` and silently double the
+   PUT p99 vs. the legacy path. The pipelined / parallel model
+   is part of the contract, not an optimization.
+4. Only after the chunkblob is durable on a quorum of nodes does
+   the leader propose the chunkref through Raft.
+
+`chunkBlobMinReplicas` defaults to **2** on a 3-node cluster (= a
+quorum of 2 includes the leader and one follower). For larger
+clusters the floor is `(N/2)+1` to match Raft's quorum size; the
+operator can opt into N for stronger-than-Raft durability. A
+follower that crashes after acking the push but before the chunkref
+commits is fine — the chunkref will be retried by the leader on the
+next attempt because it has not yet entered the Raft log.
+
+**Behaviour during cluster shrink / partial outage.** A controlled
+decommission (5 → 3 nodes) or a transient partition can leave the
+leader with fewer reachable peers than `(N/2)+1` configured at the
+last membership commit. Blocking PUTs until the configured minimum
+is reachable would surface as an indefinite hang, which is worse
+than the legacy "every byte through Raft" path (which fails when
+Raft itself loses quorum, but otherwise succeeds). The contract is
+therefore degraded availability with a hard floor:
+
+- If reachable peers ≥ `chunkBlobMinReplicas`, normal path: ack
+  after the configured minimum.
+- If reachable peers < `chunkBlobMinReplicas` but ≥ `floor(N/2)+1`
+  (Raft quorum is intact), degrade to "as many as currently
+  available, but never fewer than 2 — i.e. leader + at least one
+  follower." Emit `s3_chunkblob_replication_degraded_total` so
+  operators see the degradation. This matches Raft's own behaviour
+  during the same window: Raft would still commit at quorum, just
+  with one fewer redundant ack.
+- If reachable peers < 2 (leader-only durability), fail the PUT
+  with 503 — single-node durability is what `BlobKey`-on-Raft
+  already loses on leader crash, and is the regression this design
+  exists to prevent.
+
+The floor of 2 is the strict invariant: leader-only writes are the
+case the design refuses to accept regardless of operator
+configuration. Tuning `chunkBlobMinReplicas` higher trades PUT
+availability for stronger durability; tuning lower than 2 is
+rejected at config-load.
+
+**Important durability note for N > 3 clusters.** On a 3-node
+cluster the degraded floor of 2 chunkblob copies happens to match
+Raft's quorum-of-2, so a single node failure is tolerated by both
+the chunkref *and* the chunkblob. For N > 3 this is no longer
+true: a 5-node cluster has Raft quorum 3 and tolerates 2
+simultaneous failures for the chunkref, but the degraded
+chunkblob path (leader + 1 follower) tolerates only 1. If the
+leader and the chunkblob-holding follower both fail during the
+degraded window, the surviving Raft quorum elects a new leader,
+finds a committed chunkref, and discovers that no surviving node
+holds the chunkblob — the chunkref is durable but the object data
+is lost. This is **weaker than the legacy "every byte through
+Raft" path**, which loses data only when Raft itself loses quorum
+(3 simultaneous failures on N=5). Operators on N > 3 clusters who
+need the legacy "blob durability == Raft durability" guarantee
+should configure `chunkBlobMinReplicas = N` (full replication;
+trades some PUT availability — any single peer outage stalls
+PUTs — for the strongest durability the cluster can offer).
+The default `(N/2)+1` is sized for "match Raft quorum," not "match
+Raft fault tolerance"; this distinction is invisible at N=3 but
+material at N≥5 and is what makes this configuration knob
+operationally meaningful.
+
+The trade-off is PUT latency: a PUT now blocks on
+`chunkBlobMinReplicas - 1` follower fsyncs in addition to the Raft
+quorum write of the chunkref. Empirically the chunkblob fsync is
+the dominant cost (1 MiB write, ~5–10 ms on consumer SSD), so PUT
+p99 is roughly equivalent to today's "every byte through Raft"
+latency — we are paying the same fsync cost, just to a different
+keyspace.
+
+The `chunkref` keys are < 100 B each. A 1 GiB PUT generates 1024
+of them = ~100 KiB of Raft log payload. Compared with today's
+1 GiB through Raft, that is a **10⁴× reduction** in Raft log
+write amplification — even with semi-synchronous chunkblob
+replication, the Raft log itself is unaffected by chunk size, so
+log replay time, snapshot transfer, and follower catch-up still
+collapse to O(manifest count).
+
+### 3.3 Follower apply path
+
+When a follower's apply loop sees a `chunkref` Put:
+
+1. Stage the `chunkref` key in MVCCStore as usual.
+2. Schedule an async fetch of `!s3|chunkblob|<SHA>` from
+   `SourcePeer` (or a quorum-style fanout to all known peers).
+3. The fetch worker writes the chunk to local Pebble at
+   `!s3|chunkblob|<SHA>` once the body arrives and verifies the
+   SHA-256 (mismatch → drop and retry from another peer).
+
+`SourcePeer` is a **best-effort hint** captured at write time
+(§3.1). It is *not* authoritative — the recorded peer may have
+crashed, restarted, evicted the blob via §3.5 GC, or simply lost
+the local copy to disk failure. Callers MUST treat
+`FetchChunkBlob → NOT_FOUND` from `SourcePeer` as a normal fallback
+trigger, not an error: drop to fanout against the rest of the
+peer set and accept the first peer that returns the bytes (with
+SHA-256 verification at the receiving end). Treating
+`NOT_FOUND` as a hard failure would make a single peer's GC tick
+pin clients on a bad source.
+
+GET / range-read on the follower checks the local `chunkblob`
+first; if absent (because the async fetch is still pending), it
+either:
+
+- proxies the read to a peer that holds the chunk (using the
+  `SourcePeer` hint, falling back to fanout on `NOT_FOUND`), or
+- replies 503 with `Retry-After`, identical to S3's behaviour
+  during a region failover.
+
+The choice is per-deployment; Phase-1 ships proxy-on-miss.
+
+#### 3.3.1 GET vs. GC delete race
+
+Even with the §3.5 grace window, a GET that proxies to peer X for
+a chunkblob can lose a race against peer X's sweeper if the
+sweeper deletes the local copy *between* the GET arriving on the
+caller's node and the proxy RPC reaching peer X. The blob remains
+reachable globally (other peers still hold it; the chunkref is
+unchanged), so the right behaviour is for the caller to fall back
+to fanout. We therefore mandate:
+
+- `FetchChunkBlob` on a peer whose local sweeper just removed the
+  blob returns `NOT_FOUND`, **not** an internal error.
+- The caller's GET handler treats both `NOT_FOUND` and
+  `INVALID_ARGUMENT (sha mismatch)` as fallback triggers and
+  cycles through the remaining peers in randomised order.
+- If the entire peer set returns `NOT_FOUND` for an SHA whose
+  `chunkref` is still present, that is a genuine durability
+  failure (every replica including the leader's GC raced); the
+  GET surfaces 500 *and* the read path emits a
+  `s3_chunkblob_unrecoverable_total` metric so operators detect
+  the underlying GC bug. With the §3.2 quorum-write durability
+  and the §3.5 grace window, this case requires a coincident
+  failure across a quorum of nodes within a 1-hour window — vastly
+  more unlikely than the per-peer 404 the fanout absorbs as a
+  matter of course.
+
+### 3.4 Snapshot
+
+Today a follower snapshot dump includes every `BlobKey` Pebble has
+ever stored. Under the new design:
+
+- The Raft snapshot serialises only `chunkref` keys plus the rest
+  of the MVCC state (manifests, bucket meta, ACLs).
+- A separate **blob catalog snapshot** lists every locally-held
+  `chunkblob` SHA. This is included in the snapshot stream as a
+  manifest of "blobs you should fetch from me on demand."
+- Once the follower has consumed the Raft snapshot and the blob
+  catalog, it begins serving GET / HEAD by proxying chunkblob
+  fetches to peers as in 3.3.
+
+A 100 GiB cluster's Raft snapshot drops from ~100 GiB to a few
+megabytes (one `chunkref` per chunk, plus the manifest set). The
+blob catalog adds 32 B × N_chunks = ~3 MiB per 100 GiB. Snapshot
+stream time falls by orders of magnitude; the recovery cost
+shifts from "leader's WAL dump" to "follower's lazy blob fetch
+amortised across reads."
+
+### 3.5 Garbage collection
+
+A blob whose `chunkref` has been deleted (DELETE, lifecycle policy,
+object version pruned, manifest aborted) is reclaimable. The
+sweeper needs to know not just *that* the RC reached zero but
+*when*, otherwise the documented grace window is unimplementable
+(a plain counter at zero carries no time signal). We make the
+"became eligible at T" fact a first-class Raft entry:
+
+1. **Reference counting** via `!s3|chunkref-rc|<SHA>`, a counter
+   updated inside the same Raft txn that adds / removes a
+   `chunkref`. The atomic `(chunkref change, RC update)` pair is
+   the linearisation point for "this blob is now / no longer
+   reachable."
+2. **GC eligibility queue.** When the txn that decrements an RC
+   would drive it to zero, the *same* txn additionally writes
+   `!s3|chunkblob-gc-queue|<commitTS-nanos>|<SHA>` → empty. The
+   commitTS-prefixed key is the time signal: the queue is
+   naturally sorted by eligibility-start time, and any node can
+   determine the grace-period boundary by scanning the queue with
+   `endKey = !s3|chunkblob-gc-queue|<now-gracePeriod>|`. If a
+   subsequent txn re-references the same SHA before the sweeper
+   runs (e.g. an upload reuses a content hash), that txn deletes
+   the queue entry as part of incrementing the RC; the queue
+   therefore reflects "currently RC==0" rather than "ever was zero."
+3. **Node-local sweeper.** Each node runs an independent sweeper
+   every `chunkBlobGCInterval` (proposed default 5 minutes) that:
+   a. scans the queue range `[!s3|chunkblob-gc-queue|, !s3|chunkblob-gc-queue|<now-gracePeriod>|)`
+      for entries whose grace window has elapsed,
+   b. for each `<SHA>` returned, re-checks the RC counter at the
+      sweeper's read timestamp.
+
+      *The deletion is two-phase across two storage layers and is
+      NOT a single transaction* — `!s3|chunkblob|<SHA>` is local
+      Pebble (per §3.1, never written through Raft), while
+      `!s3|chunkblob-gc-queue|…` is Raft-replicated. The phases
+      MUST run in this order:
+
+      i. **Raft phase first — conditional delete.** Delete the
+         queue entry through a Raft txn that is **conditional on
+         the queue entry existing AND the RC counter still being
+         0 at the txn's read timestamp**. The conditional form is
+         load-bearing: if a re-reference txn has committed
+         between the sweeper's queue scan and this txn (driving
+         RC back to 1 and atomically removing the queue entry —
+         see §3.1's atomic invariant), the conditional delete
+         fails and the sweeper aborts before reaching the local
+         phase. An *unconditional* delete would silently succeed
+         on the now-absent queue entry and let the sweeper
+         proceed to local-delete a chunkblob that is currently
+         live (RC=1) — a **correctness bug, not just a space
+         leak**. Concurrent sweepers also serialise on this txn
+         (write-write-conflict on the queue key); only the winner
+         proceeds.
+      ii. **Local phase second.** Delete the local
+          `!s3|chunkblob|<SHA>` from Pebble. No Raft round-trip.
+          Reaching this phase implies (i) succeeded, which
+          implies the RC was 0 at the txn read timestamp and
+          remained 0 throughout the txn's commit window — i.e.
+          the blob is genuinely unreachable.
+
+      The phase ordering is the load-bearing detail. If we did
+      local-first then Raft, a crash between the two phases would
+      leave the chunkblob gone locally but the queue entry still
+      present — every subsequent sweep would re-attempt the local
+      delete (no-op) and the queue entry would never get removed
+      until manual intervention. The Raft-first ordering trades
+      that for the inverse failure mode: a crash between the two
+      phases leaves the queue entry deleted but the local
+      chunkblob still on disk — a **bounded local space leak,
+      not a correctness bug**. A periodic "orphan scan" reclaims
+      these.
+
+      The orphan scan covers two distinct sources of orphans:
+
+      - **Sweeper crash between Phase (3b.i) and (3b.ii)** — the
+        case described above; queue entry was removed via Raft
+        but the local Pebble delete never fired.
+      - **PUT failure before chunkref Dispatch** — chunkblob
+        bytes were written to local Pebble in §3.2 step 2, then
+        the PUT aborted before reaching `coordinator.Dispatch`
+        (admission control 503, client disconnect, `PushChunkBlob`
+        quorum failure, request context cancel). In that
+        scenario neither an RC entry nor a GC queue entry was
+        ever written, so the sweeper's queue-range scan never
+        sees these orphans — only the orphan scan does.
+
+      Detection criterion (covers both): `!s3|chunkblob|<SHA>`
+      keys whose SHA has either no RC entry at all, or RC=0 with
+      no corresponding queue entry. The orphan scan runs at low
+      priority out of band from the sweeper (proposed default
+      `chunkBlobOrphanScanInterval = 1 hour`); it is the safety
+      net behind both the sweeper crash path and the PUT-abort
+      cleanup path, so the PUT handler does not need its own
+      best-effort local-delete on the abort path.
+   c. if the RC has bounced above 0 in the meantime, the queue
+      entry is stale (a re-reference txn forgot to remove it, or
+      the sweeper raced) and the sweeper deletes only the queue
+      entry through a Raft txn, leaving the chunkblob in place.
+
+The queue is the authoritative "blob is GC-eligible since T"
+signal *and* the global "we are GC-ing this SHA" lock — its
+Raft-replicated single-writer-per-key property is what makes
+concurrent sweepers safe across nodes. The RC is the
+authoritative "is reachable" signal, also Raft-replicated. Local
+chunkblob deletes are deliberately *not* replicated: each node
+deletes its own copy independently after the queue-entry txn
+commits, because that's the whole point of the architecture.
+
+`chunkBlobGCGracePeriod` defaults to 1 hour. The grace window
+absorbs in-flight reads (a peer that has already started fetching
+the blob completes its fetch before the sweeper runs), in-flight
+upload aborts (a multipart abort flips RCs to zero, then a retry
+creates a new manifest with the same chunks and bumps them back),
+and clock skew between nodes (we use the Raft `commitTS` from the
+RC-update txn, not wall clock, so skew is bounded to the
+HLC-physical-shift the cluster already tolerates).
+
+### 3.6 Fetch protocol
+
+Two RPCs on the existing internal raft transport service — one
+follower-initiated (lazy fetch on miss), one leader-initiated
+(synchronous replication before chunkref commit, see §3.2 step 3):
+
+```protobuf
+service S3BlobFetch {
+  // FetchChunkBlob returns the bytes of a chunkblob this peer holds
+  // locally. Caller must verify SHA-256. Used by followers on the
+  // proxy-on-miss GET path (§3.3) and during snapshot catch-up.
+  rpc FetchChunkBlob(FetchChunkBlobRequest) returns (stream FetchChunkBlobResponse);
+
+  // PushChunkBlob streams a chunkblob from the leader to a follower
+  // and acks once the bytes are durable in the receiver's Pebble.
+  // Used by §3.2 step 3 to make chunkblob writes survive a leader
+  // crash without depending on the async fetch path catching up.
+  // The receiver SHOULD verify SHA-256 against the request header
+  // before fsync; mismatch fails the RPC and the leader retries.
+  rpc PushChunkBlob(stream PushChunkBlobRequest) returns (PushChunkBlobResponse);
+}
+
+message FetchChunkBlobRequest {
+  bytes  content_sha256 = 1;
+}
+
+message FetchChunkBlobResponse {
+  bytes  payload = 1;
+  bool   eof     = 2;
+}
+
+message PushChunkBlobRequest {
+  bytes  content_sha256 = 1; // sent in the first frame only
+  bytes  payload        = 2;
+  bool   eof            = 3;
+}
+
+message PushChunkBlobResponse {
+  bool   durable        = 1; // true == fsynced
+}
+```
+
+Streamed because a chunkblob is up to `s3ChunkSize = 1 MiB`. The
+existing gRPC `MaxRecvMsgSize = 64 MiB` (PR #593 → `internal.GRPCCallOptions`)
+already covers this in a single RPC, but streaming keeps the
+implementation symmetric with how the future Raft streaming
+transport (proposed under
+`docs/design/2026_04_18_proposed_raft_grpc_streaming_transport.md`)
+handles large payloads.
+
+### 3.7 Backwards compatibility & rollout
+
+The legacy `BlobKey` path remains available. New PUTs use the
+offload path when `ELASTICKV_S3_BLOB_OFFLOAD=true`; existing data
+keeps reading through the legacy `BlobKey` path until a background
+migrator (separate proposal) rewrites it. Mixed keyspace coexistence
+works because `!s3|chunkblob|*` and the legacy
+`!s3|blob|<bucket>|<gen>|...` namespaces are disjoint.
+
+The opt-in flag stays for at least one full release cycle so we can
+revert by flipping a single env var if any of the following surface:
+
+- a SHA-256 collision (~zero probability but a hard kill criterion),
+- a follower fetch storm overwhelming peer-to-peer bandwidth,
+- a GC bug that leaks reachable blobs,
+- semi-synchronous `PushChunkBlob` latency exceeding the legacy
+  PUT p99 by an unacceptable margin (the soak-test acceptance
+  criterion in §5).
+
+### 3.8 Mixed-version cluster behaviour
+
+Until *every* node in the cluster speaks the offload protocol, PUTs
+on the offload path cannot proceed safely: a node that does not
+implement `PushChunkBlob` cannot ack a quorum write, and a follower
+that does not implement `FetchChunkBlob` cannot resolve a chunkref
+on apply. We therefore gate the offload path on cluster-level
+feature negotiation rather than a single env var:
+
+- A node advertises offload capability by setting
+  `feature_s3_blob_offload=true` in the `AdminServer.GetClusterOverview`
+  response (alongside the existing role / version metadata).
+- The leader inspects every peer's advertised capabilities at PUT
+  admission time. If any peer is missing the capability, the PUT
+  falls back to the legacy `BlobKey` path for that request — even
+  if the leader has the env var enabled.
+- During an upgrade window the leader continues to emit legacy
+  writes; once the last peer rolls and re-advertises, subsequent
+  PUTs flip to offload automatically. A roll-back works the same
+  way in reverse: the first downgraded peer drops its capability
+  flag and the leader resumes legacy emission within the next
+  capability-refresh interval (default 30 s).
+- Reads always succeed regardless of mixed state, because both
+  keyspaces are namespaced and the GET path checks legacy then
+  offload (or vice-versa) and serves whichever resolves.
+
+This gives operators a **strict two-step rolling upgrade** with no
+PUT data path that depends on a half-upgraded cluster:
+
+1. Roll out the new binary with `ELASTICKV_S3_BLOB_OFFLOAD=false`
+   on every node. PUTs continue on the legacy path. Validate
+   stability for a soak window (24 h on the canary cluster in
+   §5's M0 acceptance criteria).
+2. Flip `ELASTICKV_S3_BLOB_OFFLOAD=true` on the leader, then on
+   followers. Once every node advertises capability, PUTs switch
+   to the offload path.
+
+Roll-back: flip the env var to `false` on any node; the leader's
+capability check sees the disagreement and falls back to legacy
+within ≤ refresh interval. The migrator (M4) is independently
+gated and never runs during a roll-back window.
+
+A node with `ELASTICKV_S3_BLOB_OFFLOAD=true` running against a
+cluster where offload is disabled (e.g. a stuck rollout) is safe
+— it advertises capability but the leader's per-PUT capability
+check sees other peers missing it and routes through legacy. No
+data is written into the offload keyspace until a quorum of
+capability-advertising peers exists.
+
+## 4. Interaction with related subsystems
+
+- **PR #636 + admission control.** The admission control budget
+  drops in importance under the offload path because Raft entries
+  are tiny (~100 B per chunkref). However the *body bytes still
+  flow through HTTP* and `prepareStreamingPutBody` continues to
+  hold them in memory until the local Pebble write returns. The
+  admission cap must stay; only the per-peer Raft-side worst-case
+  bound (`MaxInflight × MaxSizePerMsg = 4 GiB`) gets *much* easier
+  to honour.
+- **PR #589 (snapshot tuning) and PR #614 (etcd-snapshot-disk-offload).**
+  Already implemented. The offload path makes those tunables more
+  effective by reducing the per-snapshot byte count.
+- **`docs/design/2026_04_18_proposed_raft_grpc_streaming_transport.md`.**
+  The blob-fetch RPC reuses the same chunked-streaming approach
+  proposed for Raft transport. We can land both behind the same
+  abstraction.
+- **Lease read & MVCC snapshot reads.** No change. Manifests remain
+  the linearisation point; chunk bytes are immutable once committed
+  (content-addressable), so a stale local copy on a follower is
+  still correct.
+
+## 5. Implementation plan
+
+| Milestone | Scope | Risk |
+|---|---|---|
+| M0 | Spike: prove the chunkref + chunkblob keyspaces under a feature flag with 1 % traffic. Measure local Pebble write amp & blob fetch latency. | Low (observability only). |
+| M1 | PUT path emits chunkrefs through Raft; chunkblob writes go directly to local Pebble. **`FetchChunkBlob` and `PushChunkBlob` RPCs ship in this milestone** because both M1 PUT (semi-synchronous push) and M1 GET (proxy-on-miss) depend on them — without them M1 GET could only serve local-hit or 503. | Medium (race ordering, RPC plumbing on the request goroutine). |
+| M2 | Async fetch worker pool for follower apply (catch-up after a long absence). Independent of M1's synchronous `FetchChunkBlob` use on the GET path. SHA verification + retry from alternate peer on mismatch. | Medium (fanout cost on snapshot apply). |
+| M3 | Reference-count + grace-period GC (the queue-based scheme in §3.5). | Medium (correctness of RC under concurrent ops). |
+| M4 | Migrator: rewrite legacy `BlobKey` data in the background. Off by default until M0–M3 burn in for 30 days in production. | High (long-running batch over live traffic). |
+
+Acceptance criteria for M3 (the milestone that flips `ELASTICKV_S3_BLOB_OFFLOAD=true` by default):
+
+- WAL growth per GiB of S3 PUT < 1 MiB on a one-week soak test.
+- Snapshot transfer for a 100 GiB-cluster follower restart completes
+  in < 60 s on a 1 Gbps interconnect.
+- No regression in PUT p99 latency or GET p99 latency vs. the legacy
+  path (measured on the 24 h pre-cutover window).
+
+## 6. Risks
+
+- **Race between local chunkblob write and the chunkref commit.**
+  Mitigated by writing the chunkblob to a local Pebble batch with
+  fsync *before* the chunkref enters `coordinator.Dispatch`. The
+  manifest commit is the linearisation point; if the chunkblob is
+  durable on the leader, peers can fetch it as soon as they apply.
+- **Follower fetch storm.** A new follower that catches up sees a
+  flood of `chunkref` Puts and could DDoS the source peer with
+  fetches. Mitigation: bounded fetch worker pool + token bucket
+  per-source. The Raft apply loop does *not* block on the fetch —
+  it stages `chunkref` and lets the fetch lag — so apply latency
+  stays bounded.
+- **SHA-256 collision.** Operationally improbable; shipped with a
+  metric (`s3_chunkblob_sha_mismatch_total`) and a hard-fail option
+  for paranoid operators.
+- **Leader-only durability before chunkref commit.** Without
+  intervention, a leader crash between writing the chunkblob to its
+  own Pebble and the eventual async fetch on followers would leave
+  a Raft-committed chunkref pointing at a chunkblob no surviving
+  node has. Mitigation: §3.2 step 3 — synchronous semi-quorum
+  replication via `PushChunkBlob` before the chunkref enters Raft.
+  `chunkBlobMinReplicas` defaults to a Raft-quorum-equivalent
+  floor; operators who want N-way durability bump it explicitly.
+  This restores end-to-end durability parity with the legacy
+  "every byte through Raft" path at the cost of one extra fsync
+  per chunkblob on the followers in the quorum.
+- **Cross-tenant blob fetch via SHA-256 guessing.** Because
+  `chunkblob` keys are SHA-256-addressed, *if* a malicious tenant
+  could (a) guess a victim tenant's chunk SHA and (b) bypass
+  authorisation, they could exfiltrate the chunk. SHA-256 guessing
+  is computationally infeasible for non-trivial content, but we
+  remove the second prerequisite by enforcing authorisation
+  *exclusively at the `chunkref` layer*. The `S3BlobFetch.FetchChunkBlob`
+  RPC is internal-only (raft-transport credentials, not exposed to
+  S3 clients); user-facing GET resolves through `chunkref` first,
+  which carries the bucket / key tenancy context, and only after
+  the ACL check does the server proxy to a peer for the
+  corresponding `chunkblob`. A future design that exposes
+  blob fetch on a public surface would need to reintroduce
+  tenant-scoped authorisation at the blob layer; this proposal
+  intentionally does not.
+
+## 7. Out of scope (future work)
+
+- Cross-cluster blob replication (CRR / disaster recovery).
+- Tiered storage (cold blobs to S3-IA / Glacier-equivalent).
+- Erasure coding for blob payloads.
+- Compression. The current S3 spec is "the bytes the client sent";
+  any compression layer is a separate negotiated feature.
+
+## 8. Open questions
+
+- Do we need a per-follower bandwidth cap on blob fetch? If the
+  cluster network is constrained, a runaway catch-up could starve
+  user-path GET / Raft heartbeat traffic. Probably yes — defer to
+  the workload-isolation rollout.
+- Is content-addressing at the chunk granularity (`chunkSize = 1 MiB`)
+  the right unit, or should we content-address whole objects and
+  range-fetch sub-chunks? The chunk granularity matches what
+  `prefetchObjectChunks` already does and keeps content addressing
+  predictable; whole-object addressing would require re-hashing on
+  partial reads. Tentatively: chunk granularity for v1.