diff --git a/docs/design/2026_04_25_proposed_s3_admission_control.md b/docs/design/2026_04_25_proposed_s3_admission_control.md
new file mode 100644
index 000000000..a153f1697
--- /dev/null
+++ b/docs/design/2026_04_25_proposed_s3_admission_control.md
@@ -0,0 +1,468 @@
+# S3 PUT admission control
+
+> **Status: Proposed**
+> Author: bootjp
+> Date: 2026-04-25
+>
+> Companion to PR #636 (`s3ChunkBatchOps = 4`, Raft entry size aligned
+> with `MaxSizePerMsg = 4 MiB` per PR #593) and to the workload-class
+> isolation proposal (`docs/design/2026_04_24_proposed_workload_isolation.md`).
+> Where PR #636 fixes the *per-entry* memory accounting and the
+> isolation doc keeps Raft's CPU budget separate from heavy command
+> paths, this doc bounds the *aggregate* in-flight memory that S3
+> PUT body bytes can pin in the leader's Raft pipeline regardless of
+> how many clients are uploading at once.
+
+---
+
+## 1. Problem
+
+Even with PR #636 reducing the per-entry size to 4 MiB and aligning
+with `MaxSizePerMsg`, the **aggregate** in-flight memory bound is
+governed by client concurrency, which the server cannot currently
+limit:
+
+```text
+leader-side worst-case = concurrent_PUTs × pending_entries_per_PUT × entry_size
+ ≈ concurrent_PUTs × MaxInflight × 4 MiB
+```
+
+A single PUT's pipeline is already capped by `MaxInflight = 1024` Raft
+in-flight messages times 4 MiB per entry = 4 GiB / peer (the bound PR
+#593 advertises). What multiplies that bound is concurrent uploads:
+
+- 4 simultaneous 5 GiB PUTs on a 4-vCPU 8 GiB-RAM node can pin
+ `4 × min(MaxInflight × 4 MiB, body_remaining)` on the leader. If a
+ follower stalls (GC pause, slow disk fsync) for several seconds, that
+ worst case is realised before backpressure kicks in.
+- The 2026-04-24 incident showed that one workload class can wedge the
+ Go runtime for the entire cluster. S3 PUT body bytes pinned in the
+ Raft pipeline play the same role as XREAD blobs in the Redis path:
+ they grow with client behaviour, not with anything the operator can
+ rate-limit per-handler.
+- `GOMEMLIMIT = 1800 MiB` plus `--memory = 2500m` (PR #617) set the
+ upper bound on heap, but they do not provide *flow control*. The Go
+ GC starts thrashing well before the limit, and memwatch (PR #612)
+ triggers graceful shutdown — a recovery mechanism, not a steady-state
+ control.
+
+There is no hard ceiling today on the aggregate body bytes that S3
+puts onto the Raft pipeline. PR #636 makes the per-entry slot
+predictable; this proposal makes the *number of slots in flight*
+predictable too.
+
+## 2. Goals & non-goals
+
+**Goals**
+
+- Hard cap on the total S3 PUT body bytes that have been read from
+ HTTP but not yet committed to Raft. This cap is a tunable property
+ of the S3 server, not of any single PUT.
+- Steady-state behaviour where new PUTs are admitted at the rate Raft
+ can drain. Bursting clients see HTTP 503 with `Retry-After`, not
+ silent ballooning of leader memory.
+- Composable with existing memwatch + GOMEMLIMIT. Admission failure
+ must be visible (metric + log line per rejected request) so an
+ operator can size the cap against a given memory budget.
+- Read path is unaffected. GET / HEAD / LIST do not consume the same
+ budget.
+
+**Non-goals**
+
+- No global QPS quota across all S3 verbs. Other verbs (GET, multipart
+ initiate / complete) have well-bounded memory and don't need this.
+- No per-bucket / per-key fairness. Fairness across tenants is a
+ separate workload-class concern (handled by the workload isolation
+ proposal). This doc only fixes the aggregate ceiling.
+- No backpressure-via-TCP (`SO_RCVBUF` shrink). Decoupling from kernel
+ buffers keeps the budget testable and avoids cross-OS surprises.
+- No multi-region / multi-cluster admission. One leader's perspective.
+
+## 3. Design
+
+### 3.1 Where the budget is checked
+
+Two natural insertion points exist on the PUT path:
+
+| Insertion point | Granularity | When the request is rejected |
+|---|---|---|
+| (A) Before `prepareStreamingPutBody` accepts the first byte | Whole-object | Pre-Raft, before any reads from the body |
+| (B) Inside the `flushBatch` loop, before `coordinator.Dispatch` | Per-batch | Mid-stream, after `s3ChunkBatchOps` chunks are buffered |
+
+Recommended: **both, at different scales**.
+
+- **(A) is the request-admission cap.** Use the `Content-Length`
+ header (rejected when absent for PUTs that are not aws-chunked) to
+ pre-charge the budget. If the budget would be exceeded we reply 503
+ immediately. This matches how AWS S3 itself handles
+ `SlowDown` / `ServiceUnavailable` and is the cheapest case to
+ surface — the body is never read from the socket.
+- **(B) is the in-flight cap.** Even after a request is admitted, its
+ 4 MiB batches can pile up if Raft becomes slow. The flush loop
+ acquires a per-batch lease (4 MiB) from a shared semaphore *before*
+ reading the next chunk window. If the semaphore is empty for longer
+ than `dispatchAdmissionTimeout` (proposed default 30 s), the PUT
+ fails with 503 mid-stream. Dispatch latency stays bounded by the
+ semaphore size rather than by client behaviour.
+
+```text
+client ─[Content-Length]─► (A) reserve full body bytes
+ │
+ ▼
+ prepareStreamingPutBody
+ │
+ ▼
+ (B) acquire 4 MiB slot ◄┐ (released on Dispatch ack)
+ │
+ ▼
+ coordinator.Dispatch ───┘
+```
+
+### 3.2 Concrete cap values
+
+Default cap:
+
+```go
+const (
+ // s3PutAdmissionMaxInflightBytes is the hard ceiling on S3 PUT body
+ // bytes accepted by this node but not yet committed to Raft. Picked
+ // so that, on a 3-node cluster with MaxInflight × MaxSizePerMsg =
+ // 4 GiB / peer per PR #593, all in-flight PUT bytes plus their Raft
+ // replication shadow stay under one quarter of GOMEMLIMIT (1800 MiB
+ // per PR #617) — leaving headroom for Lua, scan buffers, and Pebble
+ // memtables.
+ s3PutAdmissionMaxInflightBytes = 256 << 20 // 256 MiB
+ // s3RaftEntryByteBudget is the per-batch unit acquired and
+ // released against the semaphore. It must equal the byte
+ // budget of one Raft entry produced by PutObject /
+ // UploadPart's flush loop (PR #636: s3ChunkSize ×
+ // s3ChunkBatchOps = 1 MiB × 4 = 4 MiB minus protobuf framing
+ // overhead — kept abstract here so the admission contract
+ // does not lock the entry-size choice). The semaphore's
+ // capacity is `s3PutAdmissionMaxInflightBytes /
+ // s3ChunkSize` 1 MiB-units; per-batch acquire takes
+ // `s3RaftEntryByteBudget / s3ChunkSize` units at a time
+ // (= 4 units on the default tunables).
+ s3RaftEntryByteBudget = s3ChunkSize * s3ChunkBatchOps
+ // dispatchAdmissionTimeout is how long a per-batch flush will wait
+ // for a slot before giving up. The 256 MiB cap drains in ~2 s at
+ // 1 Gbps under steady-state Raft throughput (256 MiB / 125 MB/s),
+ // so 30 s is *not* sized against normal drain — it is the budget
+ // for a transiently stalled follower (GC pause, slow disk fsync,
+ // bounded leader re-election) to recover before we conclude the
+ // cluster is genuinely overloaded. Longer stalls surface as 503,
+ // which is the right signal: at that point the right action is
+ // operator intervention (scale out, investigate the stall), not
+ // continued accumulation.
+ dispatchAdmissionTimeout = 30 * time.Second
+)
+```
+
+The cap is intentionally **per-node** rather than cluster-wide:
+admission is enforced on the node receiving the HTTP request, which is
+also the node whose memory is at risk. Clients hitting a different
+node simply get a different budget.
+
+Both values are exposed as env vars (`ELASTICKV_S3_PUT_ADMISSION_MAX_INFLIGHT_BYTES`,
+`ELASTICKV_S3_DISPATCH_ADMISSION_TIMEOUT`) with the constants as
+defaults, following the pattern PR #593 established for
+`ELASTICKV_RAFT_MAX_INFLIGHT_MSGS` etc.
+
+### 3.3 Data structure
+
+```go
+type putAdmission struct {
+ semaphore chan struct{} // capacity == max / s3ChunkSize, charges in 1 MiB units
+ inflight atomic.Int64 // metric mirror; semaphore is the source of truth
+}
+
+func newPutAdmission(maxBytes int64) *putAdmission { … }
+
+// peekHeadroom is admission A. It returns ErrAdmissionExhausted
+// without acquiring slots when the requested byte count exceeds
+// the *current* free capacity of the semaphore. It does NOT take
+// out a reservation — the only effect is "fail fast at request
+// entry instead of partway through the body" — so it cannot
+// double-count against admission B.
+func (a *putAdmission) peekHeadroom(bytes int64) error { … }
+
+// acquire is admission B. It blocks until (bytes / s3ChunkSize)
+// slots are available or ctx fires. The returned release closure
+// MUST be called exactly once. If `bytes > capacity * s3ChunkSize`
+// (a malformed client whose frame exceeds the entire budget),
+// returns ErrAdmissionExhausted *immediately* without waiting —
+// otherwise we would block until ctx (typically
+// dispatchAdmissionTimeout) for a request that can never fit.
+func (a *putAdmission) acquire(ctx context.Context, bytes int64) (func(), error) { … }
+```
+
+The two-step contract avoids the double-charge / unbounded-window
+hazard the obvious "A is also an acquire" design would have:
+
+1. **Admission A — request-entry headroom check (peek only).** When
+ a PUT arrives with `Content-Length: N`, the handler calls
+ `peekHeadroom(N)`. If the result is `ErrAdmissionExhausted`,
+ reply 503 immediately and never read from the body. If `nil`,
+ admission A is done — no slots have been taken out of the
+ semaphore. This is intentionally racy with concurrent PUTs:
+ admission A only promises "at the moment we asked, the budget
+ *would have fit* this request"; it does not reserve the budget.
+2. **Admission B — per-batch acquire/release (the only path that
+ touches the semaphore).** The PUT handler then loops the body
+ in `s3ChunkSize × s3ChunkBatchOps = 4 MiB` windows; each window
+ acquires 4 MiB worth of slots via `acquire`, reads the next
+ window from the body, dispatches, and releases on Dispatch
+ ack. This is the bound that actually holds — at any instant the
+ sum of held slots across all in-flight PUTs cannot exceed the
+ semaphore's capacity. If admission A's racy estimate turns out
+ to be wrong (another PUT raced in between A's check and the
+ first B-acquire), the first B-acquire blocks until the
+ contending PUT releases or `dispatchAdmissionTimeout` fires.
+
+The semaphore is therefore charged **only by admission B**. Bytes
+in flight = `held_B_slots × s3ChunkSize`, full stop; admission A is
+a fast-fail gate, not a reservation. This is the model
+implementations MUST follow — a "both A and B charge" design would
+double-count every PUT against itself and an admission-A-only
+design would lose its bound the moment a PUT's body exceeded its
+declared `Content-Length` (chunked transfers, malformed clients).
+
+### 3.3.1 aws-chunked transfers (`Content-Length: -1`)
+
+A naïve "reserve `s3MaxObjectSizeBytes` (5 GiB) up front" is rejected:
+on default tunables (`s3PutAdmissionMaxInflightBytes = 256 MiB`) a
+single chunked PUT would consume **20×** the entire budget at request
+admission time, head-of-line-blocking every other PUT until the
+chunked stream finishes — exactly the failure mode admission control
+exists to prevent. We therefore split chunked admission across two
+mechanisms instead of pre-charging:
+
+1. **Bootstrap headroom check** at request entry. Calls
+ `peekHeadroom(s3RaftEntryByteBudget)` — exactly the admission-A
+ contract: a fast-fail check that 4 MiB *would have fit* at the
+ moment we asked. **No slot is acquired.** This is intentionally
+ racy with concurrent PUTs (same as fixed-length admission A);
+ its job is to fail at request entry rather than partway
+ through the first decoded frame. Chunked PUTs are not "free"
+ — they still must beat the same admission queue as fixed-length
+ PUTs at the per-frame level.
+2. **Pay-as-you-decode** thereafter, charged via an
+ `awsChunkedReader` progress callback. The callback **buffers
+ decoded bytes until a full slot unit (`s3ChunkSize = 1 MiB`) is
+ accumulated**, then calls `acquire(s3ChunkSize)` on the
+ semaphore (same path as fixed-length admission B). This keeps
+ the slot unit coherent: the semaphore's capacity is
+ `s3PutAdmissionMaxInflightBytes / s3ChunkSize` 1 MiB-units, so
+ acquiring at sub-MiB granularity is not representable. The
+ slot is released once the corresponding
+ `coordinator.Dispatch` acks the chunk. The buffer never holds
+ more than `s3ChunkSize - 1` decoded bytes, so the worst-case
+ memory overhead beyond the semaphore-tracked bytes is bounded
+ by 1 MiB per concurrent chunked PUT.
+
+Failure modes:
+
+- If the awsChunkedReader produces decoded bytes faster than Raft
+ drains, the next 1 MiB acquire blocks (capped by
+ `dispatchAdmissionTimeout`). Beyond that timeout, mid-stream 503
+ closes the connection. The legacy "reserve 5 GiB" approach
+ would have surfaced as 503 *at request entry* for unrelated
+ PUTs; this approach surfaces as mid-stream 503 for the chunked
+ PUT itself, which is the right blame attribution.
+- The bootstrap check at step 1 is racy: another PUT can consume
+ the headroom between the check and the first per-frame acquire.
+ When that happens the first acquire blocks (or 503s on
+ timeout) — the same path the fixed-length admission B handles
+ for the contending case. The race is intentional: making the
+ check a real reservation would multiply per-request slot hold
+ by `concurrent_chunked_PUTs × 4 MiB` of bootstrap-only credit
+ with no corresponding payload, reintroducing a head-of-line
+ hazard.
+- If the awsChunkedReader produces a single frame whose decoded
+ size never accumulates to a full `s3ChunkSize`, the buffer
+ flushes on stream EOF: a final `acquire(actual_buffered_bytes)`
+ rounded up to one slot is taken (semaphore charges in 1-slot
+ units regardless of actual byte count), so the bound holds.
+- A malformed client that decodes bytes faster than Raft drains
+ *cannot* trigger the immediate-503 path the way a fixed-length
+ PUT can. The accumulation design (callback always calls
+ `acquire(s3ChunkSize)`, never larger) means the per-frame
+ acquire request is bounded by 1 MiB — the
+ "if `bytes > capacity * s3ChunkSize`" early-return in
+ `acquire`'s spec is never hit on the chunked path. Instead,
+ successive 1 MiB acquires block under Raft pressure and the
+ PUT eventually surfaces 503 on `dispatchAdmissionTimeout` —
+ the same path a slow follower triggers. The "immediate 503 for
+ oversized request" failure mode applies only to fixed-length
+ PUTs (via `peekHeadroom(Content-Length > 256 MiB)`).
+
+This change moves chunked admission from M4 (originally "deferred
+optimisation") into M1 (the first shippable milestone). M1 ships
+with the progress-callback wired *unconditionally* for all chunked
+PUTs; an env-var switch falls back to "bootstrap-only" charging
+without the per-decode credit if a corner case requires it
+(`ELASTICKV_S3_PUT_ADMISSION_CHUNKED_INCREMENTAL=false`, default
+`true`). The fallback path keeps the 5 GiB-reservation hazard
+behind an explicit operator decision rather than letting it
+materialise by default.
+
+### 3.4 Failure mode
+
+- `503 Service Unavailable` with `Retry-After: 1` (small, jittered).
+ AWS S3 SDK clients (boto3, aws-sdk-go-v2) auto-retry this code with
+ exponential backoff out of the box.
+- Body for the response is the standard S3 XML error envelope:
+ `SlowDownReduce your request rate`.
+ This is the AWS-defined code for admission rejection and matches
+ what real S3 returns.
+- Mid-stream rejection (admission B) closes the connection with a
+ `connection: close` header so partial body reads do not corrupt the
+ client's pipeline. The PUT handler also calls
+ `cleanupManifestBlobs` for any partial blobs that already landed in
+ Pebble.
+
+### 3.5 Metrics
+
+```text
+elastickv_s3_put_admission_inflight_bytes gauge
+elastickv_s3_put_admission_rejections_total counter (labels:
+ stage = "prereserve" | "perbatch",
+ protocol = "fixed-length" | "chunked")
+elastickv_s3_put_admission_wait_seconds histogram (labels: stage, protocol)
+```
+
+The `protocol` label distinguishes fixed-length PUTs (those with a
+declared `Content-Length`, hitting admission A's `peekHeadroom`)
+from aws-chunked PUTs (admission via §3.3.1's pay-as-you-decode).
+This split is what makes the chunked-PUT 503 surface (§6) and the
+rolling-upgrade alerting story actionable: a spike on
+`stage="perbatch", protocol="chunked"` points at "chunked clients
+beat Raft drain"; a spike on `stage="prereserve",
+protocol="fixed-length"` points at "client concurrency exceeds
+the per-node aggregate cap." Without the dimension the two
+failure modes are indistinguishable in a single counter.
+
+Grafana panel: inflight gauge with the cap as a horizontal line so
+the operator sees how often the system saturates. Rejection rate
+suggests bumping the cap or scaling out (more nodes spreads PUT load).
+
+## 4. Interaction with related subsystems
+
+- **PR #636 (entry size alignment).** The 4 MiB per-batch unit is
+ the natural admission grain. The two changes are independent:
+ alignment is necessary for the per-peer Raft bound to be correct;
+ admission is necessary for the *aggregate* bound to be hard.
+- **PR #612 (memwatch graceful shutdown).** Continues to function as
+ the last-resort safety net. Admission control should fire at well
+ below the memwatch threshold (`s3PutAdmissionMaxInflightBytes` is
+ ~14% of `GOMEMLIMIT`) so memwatch sees a much lower steady-state
+ pressure and the graceful-shutdown path stays a rare event.
+- **Workload isolation proposal.** That doc proposes per-class CPU
+ reservation for Raft. Admission control is the memory-axis sibling.
+ Both are needed — limiting CPU does not bound queue depth.
+- **`coordinator.Dispatch` retries.** Today the S3 path has its
+ own retry loop (`s3TxnRetryMaxAttempts = 8` with exponential
+ backoff capped at `s3TxnRetryMaxBackoff = 32 ms`). The admission
+ contract is **hold-through-retry**: the per-batch slot acquired
+ in admission B is released exactly once, on the *final* outcome
+ of the retry chain (success ack, terminal error, or
+ `dispatchAdmissionTimeout` expiring), not between attempts.
+ Rationale: the bytes are still buffered in the PUT handler's
+ pendingBatch slice for the entire retry window, so the budget
+ must reflect them; a release-between-retries scheme would let a
+ second PUT proceed while the first is still memory-resident,
+ breaking the bound. The S3 PUT path uses the inbound
+ `*http.Request` context for `coordinator.Dispatch` (no
+ S3-specific Dispatch timeout — the HTTP server's
+ `writeTimeout` / client-side cancellation is the upper bound on
+ one Dispatch attempt), so the wall-clock cost of holding the
+ slot through one full retry chain is bounded by
+ `s3TxnRetryMaxAttempts × (single_dispatch_budget + s3TxnRetryMaxBackoff)`
+ where `single_dispatch_budget` is whatever the request context
+ permits at that moment. If the retry chain duration ever
+ exceeds `dispatchAdmissionTimeout` the per-batch acquire on the
+ *next* batch surfaces as 503 — the right failure mode
+ (chronic dispatch failure → caller learns instead of silently
+ consuming the budget).
+
+## 5. Implementation plan
+
+| Milestone | Scope | Risk |
+|---|---|---|
+| M1 | Add `putAdmission` type + per-node singleton + fixed-length `Content-Length` admission (`peekHeadroom`). Wire `prepareStreamingPutBody` to acquire / release. **aws-chunked progress-callback admission** (§3.3.1) ships in this milestone too — the conservative 5 GiB pre-charge fallback only sits behind `ELASTICKV_S3_PUT_ADMISSION_CHUNKED_INCREMENTAL=false`. **`dispatchAdmissionTimeout` ships here** (the chunked per-frame `acquire(s3ChunkSize)` path is gated on it from day one), not in M2. Metric scaffolding (gauge + counter). | Medium. Chunked progress callback needs `awsChunkedReader` to expose a hook. |
+| M2 | Per-batch admission B inside `flushBatch` for **fixed-length** PUTs (chunked PUTs already use admission B as of M1). Mid-stream 503 with cleanup on the fixed-length path. | Medium. Cleanup path on partial failure. |
+| M3 | Env-var tunables. Histogram metric. Grafana panel. | Low. |
+| M4 | Per-tenant / per-bucket admission classes (handed off to the workload-isolation rollout). | Medium. Out-of-scope for the v1 cap. |
+
+### Rolling upgrade
+
+Admission is purely additive on the request entry path: a node
+without the cap behaves identically to a node with the cap set
+infinitely high. A mixed cluster (some nodes M1, some still on
+`main`) is therefore safe — clients hitting the upgraded node see
+admission, clients hitting an old node see no admission, but
+neither path corrupts state. The default cap is intentionally
+generous enough that even single-node M1 traffic falls below the
+threshold under typical load, so the rollout signature is
+"503 SlowDown rate goes from 0 to negligible" rather than a step
+function. Operators can pin
+`ELASTICKV_S3_PUT_ADMISSION_MAX_INFLIGHT_BYTES=$((1<<63))` to
+disable the cap on M1 nodes during the burn-in window if desired.
+
+The aws-chunked progress-callback path is the only behaviour
+change visible to clients: a chunked PUT that would have succeeded
+under the old "no admission" code can now 503 mid-stream when Raft
+drain falls behind. This is by design — that is the failure mode
+admission control exists to surface — but operators should expect
+to see chunked-upload 503s where there were none before. The
+`stage="perbatch", protocol="chunked"` rejection-counter label
+isolates this signal; bumping the cap or
+`ELASTICKV_S3_PUT_ADMISSION_CHUNKED_INCREMENTAL=false` (with the
+HoL hazard re-introduced as a known trade-off) restores legacy
+behaviour during incident response.
+
+Acceptance criteria:
+
+- `go test ./adapter/ -short -run TestS3PutAdmission` covers reject /
+ admit / mid-stream-timeout paths.
+- A loaded test that opens 32 concurrent PUTs of 100 MiB each must
+ hold leader memory below `s3PutAdmissionMaxInflightBytes + epsilon`
+ for the duration of the test.
+- No regression in `Test_grpc_transaction` (which is currently the
+ longest leader-stress test).
+
+## 6. Risks
+
+- **Tail-latency for legitimate clients.** A long-running PUT that
+ loses a 4 MiB slot mid-stream returns 503 even though it is making
+ progress. Mitigated by `dispatchAdmissionTimeout = 30s`, well above
+ the steady-state Raft drain time. If we observe spurious 503s in
+ practice, drop the timeout into a config knob and tune.
+- **Operator confusion.** "Why does S3 return 503 when CPU is at
+ 20%?" Mitigated by a sharp Grafana panel and a clear `Retry-After`
+ value so SDK behaviour is predictable.
+- **New chunked-PUT 503 surface.** Pay-as-you-decode admission
+ (§3.3.1) ships in M1 alongside fixed-length admission, so the
+ legacy 5 GiB pre-charge hazard does not materialise as a
+ steady-state risk. The residual risk it introduces is the
+ inverse: a chunked PUT that would have silently succeeded under
+ the no-admission code can now 503 mid-stream when Raft drain
+ falls behind. This is by design — that is the failure mode
+ admission control exists to surface — but it is the only
+ client-visible behaviour change in M1 and is what operators
+ should expect to see in dashboards. The
+ `stage="perbatch", protocol="chunked"` label on the rejection
+ counter (§3.5) isolates the signal; the operator escape hatch
+ is `ELASTICKV_S3_PUT_ADMISSION_CHUNKED_INCREMENTAL=false`, which
+ reverts to bootstrap-only charging at the cost of
+ re-introducing the 5 GiB head-of-line hazard.
+
+## 7. Out of scope (future work)
+
+- Per-bucket admission classes (e.g. system buckets get their own
+ budget). Punted to the workload-isolation rollout.
+- Coordinated admission across the multi-region read replica path
+ proposed in `docs/design/2026_04_18_proposed_raft_grpc_streaming_transport.md`.
+- Token-bucket rate-shaping (e.g. bytes-per-second). The current
+ proposal only bounds *concurrent* bytes; rate-shaping is a separate
+ policy choice.
diff --git a/docs/design/2026_04_25_proposed_s3_raft_blob_offload.md b/docs/design/2026_04_25_proposed_s3_raft_blob_offload.md
new file mode 100644
index 000000000..3d02154a2
--- /dev/null
+++ b/docs/design/2026_04_25_proposed_s3_raft_blob_offload.md
@@ -0,0 +1,678 @@
+# S3 raft blob offload — keep large object payloads out of the Raft log
+
+> **Status: Proposed**
+> Author: bootjp
+> Date: 2026-04-25
+>
+> Companion to PR #636 (`s3ChunkBatchOps = 4`, Raft entry size aligned
+> with `MaxSizePerMsg = 4 MiB` per PR #593) and to the S3 PUT
+> admission-control proposal
+> (`docs/design/2026_04_25_proposed_s3_admission_control.md`).
+>
+> PR #636 caps the *per-entry* size; admission control caps the
+> *aggregate in-flight* memory; this doc removes large blob payloads
+> from the *Raft log itself* so that snapshots and follower catch-up
+> stay bounded as the data set grows.
+
+---
+
+## 1. Problem
+
+Today every byte of an S3 object travels through the Raft log:
+
+```text
+HTTP PUT body ─► s3ChunkSize (1 MiB) chunks
+ ─► s3ChunkBatchOps × 1 MiB Raft entry
+ ─► Raft log entry (Pebble WAL on every node)
+ ─► applyLoop → s3keys.BlobKey(...) → MVCCStore Put
+```
+
+`s3keys.BlobKey(bucket, generation, objectKey, uploadID, partNo, chunkNo)`
+is the key actually written to Pebble; the log entry that proposed it
+contains the full chunk value. Two consequences:
+
+1. **WAL & snapshot growth scales with object bytes.** A node that
+ serves 100 GiB of S3 PUT traffic ends up with a 100 GiB Pebble WAL
+ plus the same 100 GiB persisted in the engine's state machine.
+ Snapshot transfer for a falling-behind follower carries the
+ payload twice — once as Raft log replication (during catch-up
+ inside the WAL window) and once as a snapshot dump if the leader
+ has already truncated.
+2. **Follower catch-up after a long absence is expensive.** Right now
+ a follower that misses 5 GiB of PUT traffic re-applies it as Raft
+ log entries, each going through the full `applyRequests` →
+ `MVCCStore.Put` path on a single thread. The ApplyLoop becomes a
+ bottleneck that holds the rest of the cluster waiting for the
+ `commitIndex` advance.
+
+S3 blob payloads are special: they are **idempotent, content-
+addressable, large, and rarely re-read inside the leader's apply
+window**. They look more like attachments than like Raft log records.
+Treating them as Raft entries is overkill — Raft only needs to
+linearise *what changed* (the manifest), not *the chunk bytes
+themselves*.
+
+## 2. Goals & non-goals
+
+**Goals**
+
+- Raft log entries for S3 PUT carry a *reference* to the chunk
+ payload, not the payload itself. Replication traffic and WAL
+ size become O(manifest size), not O(object size).
+- Followers fetch blob payloads out-of-band when they apply a
+ manifest reference. Apply order remains Raft-defined; only the
+ bytes are pulled lazily.
+- Snapshot transfer for a falling-behind follower is O(manifest
+ count + small blob index), not O(stored bytes).
+- The new path is opt-in and lives alongside the current direct-Raft
+ path until parity is proven. Existing S3 traffic is unaffected
+ during rollout.
+
+**Non-goals**
+
+- No external object store dependency (S3, MinIO, Ceph). The blob
+ offload uses Pebble itself plus a peer-to-peer fetch protocol;
+ introducing an external dependency would re-create the operational
+ surface elastickv exists to replace.
+- No *user-facing* deduplication API or storage-accounting credit.
+ The `chunkblob` keyspace is content-addressed by SHA-256 (§3.1)
+ and reference-counted (§3.5), which means two distinct objects
+ whose chunks happen to hash identically will share one
+ `chunkblob` row at the storage layer — that is a structural
+ property of content addressing, not a feature we expose. We do
+ *not*: surface dedup ratios, charge storage by post-dedup bytes,
+ rebalance dedup credit across tenants, or treat dedup hits as
+ semantically observable from S3 verbs. The reference layer
+ (`chunkref`) keeps `(bucket, objectKey, uploadID, partNo,
+ chunkNo)` granular and per-object, so DELETE / lifecycle still
+ reason about objects independently. Authorisation enforcement
+ remains on `chunkref` reads, never on `chunkblob` reads
+ (§3.3 covers the proxy-on-miss path; ACL checks fire before the
+ blob fetch is initiated, so a tenant cannot dereference a peer's
+ `chunkblob` by guessing a SHA — see §6 *Cross-tenant blob fetch*).
+- No change to MVCC semantics. Manifest commits remain the
+ serialisation point; blob fetch is a side channel that does not
+ change visibility rules.
+- No removal of the existing `BlobKey` path in this doc. We will
+ ship in two stages (manifest-only writes through Raft, blob
+ payload via a side channel), and the legacy path stays available
+ until enough operational evidence accumulates.
+
+## 3. Design
+
+### 3.1 New keyspace
+
+```text
+!s3|chunkref||||||
+ → ChunkRef{
+ ContentSHA256 [32]byte
+ Size uint64
+ // Optional: leader-locality hints. Followers without the
+ // payload locally fetch from a peer that advertises the
+ // chunk in its catalog.
+ SourcePeer NodeID
+ }
+
+!s3|chunkblob|
+ → raw bytes (the chunk payload)
+```
+
+Two separate keyspaces:
+
+- `chunkref` is replicated through Raft. Cheap (32 B + small header
+ per chunk) and ordered against the manifest commit.
+- `chunkblob` is **not** written through Raft. It is written
+ directly to Pebble on the receiving node and pulled by peers via
+ the new fetch protocol when they apply the corresponding
+ `chunkref`.
+
+### 3.2 PUT path
+
+```text
+client ─► HTTP PUT body
+ ─► chunk loop (s3ChunkSize):
+ 1. compute SHA-256 of chunk
+ 2. write chunk to LOCAL Pebble at !s3|chunkblob|
+ (fsync) ─┐ pipelined: bytes also stream out
+ 3. PushChunkBlob to ─┤ to chunkBlobMinReplicas-1 followers
+ followers in parallel (one RPC per follower; bytes
+ start flowing the moment the leader has them — not
+ after local fsync completes)
+ 4. wait until BOTH local fsync AND a quorum of follower
+ fsync-acks have returned (= chunkBlobMinReplicas
+ durable copies including the leader)
+ 5. queue ChunkRef into pendingBatch
+ ─► flushBatch:
+ coordinator.Dispatch(OperationGroup{
+ Elems: [ chunkref Puts ... + manifest Put ],
+ })
+ ─► HTTP 200 OK once Dispatch acks
+```
+
+Step 3 — synchronous chunkblob replication before the chunkref
+commit — is the difference between "Raft-equivalent durability"
+and "leader-only durability." Without it, a leader crash between
+the chunkref commit and the eventual async fetch would leave a
+committed manifest pointing at a chunkblob nobody else has — Raft's
+quorum guarantees the chunkref but tells you nothing about the
+blob payload. We close that gap by treating the chunkblob like a
+mini-Raft entry of its own with **semi-synchronous quorum**:
+
+1. Leader starts writing the chunkblob to local Pebble (fsync in
+ flight).
+2. **Concurrently** with step 1, the leader streams the chunkblob
+ to `chunkBlobMinReplicas - 1` followers via parallel
+ `S3BlobFetch.PushChunkBlob` RPCs. Pushes are **fanned out in
+ parallel**, not sequential — each follower's RPC is started
+ immediately, and the leader waits on a quorum of fsync-acks
+ rather than serially blocking on each one. (`PushChunkBlob` is
+ the leader-initiated counterpart to the follower-initiated
+ `FetchChunkBlob` defined in §3.6.)
+3. The leader waits for *both* the local fsync AND a quorum of
+ follower fsync-acks. The dominant cost is therefore
+ `max(local_fsync, slowest_quorum_follower_fsync)` — typically
+ ≈ 10 ms on consumer SSD, equivalent to a Raft quorum write.
+ This is what makes the p99 latency claim below load-bearing:
+ if step 1 and step 2 were *sequential* (write local → then
+ push to followers → then wait), per-chunk latency would be
+ `chunkBlobMinReplicas × fsync_latency` and silently double the
+ PUT p99 vs. the legacy path. The pipelined / parallel model
+ is part of the contract, not an optimization.
+4. Only after the chunkblob is durable on a quorum of nodes does
+ the leader propose the chunkref through Raft.
+
+`chunkBlobMinReplicas` defaults to **2** on a 3-node cluster (= a
+quorum of 2 includes the leader and one follower). For larger
+clusters the floor is `(N/2)+1` to match Raft's quorum size; the
+operator can opt into N for stronger-than-Raft durability. A
+follower that crashes after acking the push but before the chunkref
+commits is fine — the chunkref will be retried by the leader on the
+next attempt because it has not yet entered the Raft log.
+
+**Behaviour during cluster shrink / partial outage.** A controlled
+decommission (5 → 3 nodes) or a transient partition can leave the
+leader with fewer reachable peers than `(N/2)+1` configured at the
+last membership commit. Blocking PUTs until the configured minimum
+is reachable would surface as an indefinite hang, which is worse
+than the legacy "every byte through Raft" path (which fails when
+Raft itself loses quorum, but otherwise succeeds). The contract is
+therefore degraded availability with a hard floor:
+
+- If reachable peers ≥ `chunkBlobMinReplicas`, normal path: ack
+ after the configured minimum.
+- If reachable peers < `chunkBlobMinReplicas` but ≥ `floor(N/2)+1`
+ (Raft quorum is intact), degrade to "as many as currently
+ available, but never fewer than 2 — i.e. leader + at least one
+ follower." Emit `s3_chunkblob_replication_degraded_total` so
+ operators see the degradation. This matches Raft's own behaviour
+ during the same window: Raft would still commit at quorum, just
+ with one fewer redundant ack.
+- If reachable peers < 2 (leader-only durability), fail the PUT
+ with 503 — single-node durability is what `BlobKey`-on-Raft
+ already loses on leader crash, and is the regression this design
+ exists to prevent.
+
+The floor of 2 is the strict invariant: leader-only writes are the
+case the design refuses to accept regardless of operator
+configuration. Tuning `chunkBlobMinReplicas` higher trades PUT
+availability for stronger durability; tuning lower than 2 is
+rejected at config-load.
+
+**Important durability note for N > 3 clusters.** On a 3-node
+cluster the degraded floor of 2 chunkblob copies happens to match
+Raft's quorum-of-2, so a single node failure is tolerated by both
+the chunkref *and* the chunkblob. For N > 3 this is no longer
+true: a 5-node cluster has Raft quorum 3 and tolerates 2
+simultaneous failures for the chunkref, but the degraded
+chunkblob path (leader + 1 follower) tolerates only 1. If the
+leader and the chunkblob-holding follower both fail during the
+degraded window, the surviving Raft quorum elects a new leader,
+finds a committed chunkref, and discovers that no surviving node
+holds the chunkblob — the chunkref is durable but the object data
+is lost. This is **weaker than the legacy "every byte through
+Raft" path**, which loses data only when Raft itself loses quorum
+(3 simultaneous failures on N=5). Operators on N > 3 clusters who
+need the legacy "blob durability == Raft durability" guarantee
+should configure `chunkBlobMinReplicas = N` (full replication;
+trades some PUT availability — any single peer outage stalls
+PUTs — for the strongest durability the cluster can offer).
+The default `(N/2)+1` is sized for "match Raft quorum," not "match
+Raft fault tolerance"; this distinction is invisible at N=3 but
+material at N≥5 and is what makes this configuration knob
+operationally meaningful.
+
+The trade-off is PUT latency: a PUT now blocks on
+`chunkBlobMinReplicas - 1` follower fsyncs in addition to the Raft
+quorum write of the chunkref. Empirically the chunkblob fsync is
+the dominant cost (1 MiB write, ~5–10 ms on consumer SSD), so PUT
+p99 is roughly equivalent to today's "every byte through Raft"
+latency — we are paying the same fsync cost, just to a different
+keyspace.
+
+The `chunkref` keys are < 100 B each. A 1 GiB PUT generates 1024
+of them = ~100 KiB of Raft log payload. Compared with today's
+1 GiB through Raft, that is a **10⁴× reduction** in Raft log
+write amplification — even with semi-synchronous chunkblob
+replication, the Raft log itself is unaffected by chunk size, so
+log replay time, snapshot transfer, and follower catch-up still
+collapse to O(manifest count).
+
+### 3.3 Follower apply path
+
+When a follower's apply loop sees a `chunkref` Put:
+
+1. Stage the `chunkref` key in MVCCStore as usual.
+2. Schedule an async fetch of `!s3|chunkblob|` from
+ `SourcePeer` (or a quorum-style fanout to all known peers).
+3. The fetch worker writes the chunk to local Pebble at
+ `!s3|chunkblob|` once the body arrives and verifies the
+ SHA-256 (mismatch → drop and retry from another peer).
+
+`SourcePeer` is a **best-effort hint** captured at write time
+(§3.1). It is *not* authoritative — the recorded peer may have
+crashed, restarted, evicted the blob via §3.5 GC, or simply lost
+the local copy to disk failure. Callers MUST treat
+`FetchChunkBlob → NOT_FOUND` from `SourcePeer` as a normal fallback
+trigger, not an error: drop to fanout against the rest of the
+peer set and accept the first peer that returns the bytes (with
+SHA-256 verification at the receiving end). Treating
+`NOT_FOUND` as a hard failure would make a single peer's GC tick
+pin clients on a bad source.
+
+GET / range-read on the follower checks the local `chunkblob`
+first; if absent (because the async fetch is still pending), it
+either:
+
+- proxies the read to a peer that holds the chunk (using the
+ `SourcePeer` hint, falling back to fanout on `NOT_FOUND`), or
+- replies 503 with `Retry-After`, identical to S3's behaviour
+ during a region failover.
+
+The choice is per-deployment; Phase-1 ships proxy-on-miss.
+
+#### 3.3.1 GET vs. GC delete race
+
+Even with the §3.5 grace window, a GET that proxies to peer X for
+a chunkblob can lose a race against peer X's sweeper if the
+sweeper deletes the local copy *between* the GET arriving on the
+caller's node and the proxy RPC reaching peer X. The blob remains
+reachable globally (other peers still hold it; the chunkref is
+unchanged), so the right behaviour is for the caller to fall back
+to fanout. We therefore mandate:
+
+- `FetchChunkBlob` on a peer whose local sweeper just removed the
+ blob returns `NOT_FOUND`, **not** an internal error.
+- The caller's GET handler treats both `NOT_FOUND` and
+ `INVALID_ARGUMENT (sha mismatch)` as fallback triggers and
+ cycles through the remaining peers in randomised order.
+- If the entire peer set returns `NOT_FOUND` for an SHA whose
+ `chunkref` is still present, that is a genuine durability
+ failure (every replica including the leader's GC raced); the
+ GET surfaces 500 *and* the read path emits a
+ `s3_chunkblob_unrecoverable_total` metric so operators detect
+ the underlying GC bug. With the §3.2 quorum-write durability
+ and the §3.5 grace window, this case requires a coincident
+ failure across a quorum of nodes within a 1-hour window — vastly
+ more unlikely than the per-peer 404 the fanout absorbs as a
+ matter of course.
+
+### 3.4 Snapshot
+
+Today a follower snapshot dump includes every `BlobKey` Pebble has
+ever stored. Under the new design:
+
+- The Raft snapshot serialises only `chunkref` keys plus the rest
+ of the MVCC state (manifests, bucket meta, ACLs).
+- A separate **blob catalog snapshot** lists every locally-held
+ `chunkblob` SHA. This is included in the snapshot stream as a
+ manifest of "blobs you should fetch from me on demand."
+- Once the follower has consumed the Raft snapshot and the blob
+ catalog, it begins serving GET / HEAD by proxying chunkblob
+ fetches to peers as in 3.3.
+
+A 100 GiB cluster's Raft snapshot drops from ~100 GiB to a few
+megabytes (one `chunkref` per chunk, plus the manifest set). The
+blob catalog adds 32 B × N_chunks = ~3 MiB per 100 GiB. Snapshot
+stream time falls by orders of magnitude; the recovery cost
+shifts from "leader's WAL dump" to "follower's lazy blob fetch
+amortised across reads."
+
+### 3.5 Garbage collection
+
+A blob whose `chunkref` has been deleted (DELETE, lifecycle policy,
+object version pruned, manifest aborted) is reclaimable. The
+sweeper needs to know not just *that* the RC reached zero but
+*when*, otherwise the documented grace window is unimplementable
+(a plain counter at zero carries no time signal). We make the
+"became eligible at T" fact a first-class Raft entry:
+
+1. **Reference counting** via `!s3|chunkref-rc|`, a counter
+ updated inside the same Raft txn that adds / removes a
+ `chunkref`. The atomic `(chunkref change, RC update)` pair is
+ the linearisation point for "this blob is now / no longer
+ reachable."
+2. **GC eligibility queue.** When the txn that decrements an RC
+ would drive it to zero, the *same* txn additionally writes
+ `!s3|chunkblob-gc-queue||` → empty. The
+ commitTS-prefixed key is the time signal: the queue is
+ naturally sorted by eligibility-start time, and any node can
+ determine the grace-period boundary by scanning the queue with
+ `endKey = !s3|chunkblob-gc-queue||`. If a
+ subsequent txn re-references the same SHA before the sweeper
+ runs (e.g. an upload reuses a content hash), that txn deletes
+ the queue entry as part of incrementing the RC; the queue
+ therefore reflects "currently RC==0" rather than "ever was zero."
+3. **Node-local sweeper.** Each node runs an independent sweeper
+ every `chunkBlobGCInterval` (proposed default 5 minutes) that:
+ a. scans the queue range `[!s3|chunkblob-gc-queue|, !s3|chunkblob-gc-queue||)`
+ for entries whose grace window has elapsed,
+ b. for each `` returned, re-checks the RC counter at the
+ sweeper's read timestamp.
+
+ *The deletion is two-phase across two storage layers and is
+ NOT a single transaction* — `!s3|chunkblob|` is local
+ Pebble (per §3.1, never written through Raft), while
+ `!s3|chunkblob-gc-queue|…` is Raft-replicated. The phases
+ MUST run in this order:
+
+ i. **Raft phase first — conditional delete.** Delete the
+ queue entry through a Raft txn that is **conditional on
+ the queue entry existing AND the RC counter still being
+ 0 at the txn's read timestamp**. The conditional form is
+ load-bearing: if a re-reference txn has committed
+ between the sweeper's queue scan and this txn (driving
+ RC back to 1 and atomically removing the queue entry —
+ see §3.1's atomic invariant), the conditional delete
+ fails and the sweeper aborts before reaching the local
+ phase. An *unconditional* delete would silently succeed
+ on the now-absent queue entry and let the sweeper
+ proceed to local-delete a chunkblob that is currently
+ live (RC=1) — a **correctness bug, not just a space
+ leak**. Concurrent sweepers also serialise on this txn
+ (write-write-conflict on the queue key); only the winner
+ proceeds.
+ ii. **Local phase second.** Delete the local
+ `!s3|chunkblob|` from Pebble. No Raft round-trip.
+ Reaching this phase implies (i) succeeded, which
+ implies the RC was 0 at the txn read timestamp and
+ remained 0 throughout the txn's commit window — i.e.
+ the blob is genuinely unreachable.
+
+ The phase ordering is the load-bearing detail. If we did
+ local-first then Raft, a crash between the two phases would
+ leave the chunkblob gone locally but the queue entry still
+ present — every subsequent sweep would re-attempt the local
+ delete (no-op) and the queue entry would never get removed
+ until manual intervention. The Raft-first ordering trades
+ that for the inverse failure mode: a crash between the two
+ phases leaves the queue entry deleted but the local
+ chunkblob still on disk — a **bounded local space leak,
+ not a correctness bug**. A periodic "orphan scan" reclaims
+ these.
+
+ The orphan scan covers two distinct sources of orphans:
+
+ - **Sweeper crash between Phase (3b.i) and (3b.ii)** — the
+ case described above; queue entry was removed via Raft
+ but the local Pebble delete never fired.
+ - **PUT failure before chunkref Dispatch** — chunkblob
+ bytes were written to local Pebble in §3.2 step 2, then
+ the PUT aborted before reaching `coordinator.Dispatch`
+ (admission control 503, client disconnect, `PushChunkBlob`
+ quorum failure, request context cancel). In that
+ scenario neither an RC entry nor a GC queue entry was
+ ever written, so the sweeper's queue-range scan never
+ sees these orphans — only the orphan scan does.
+
+ Detection criterion (covers both): `!s3|chunkblob|`
+ keys whose SHA has either no RC entry at all, or RC=0 with
+ no corresponding queue entry. The orphan scan runs at low
+ priority out of band from the sweeper (proposed default
+ `chunkBlobOrphanScanInterval = 1 hour`); it is the safety
+ net behind both the sweeper crash path and the PUT-abort
+ cleanup path, so the PUT handler does not need its own
+ best-effort local-delete on the abort path.
+ c. if the RC has bounced above 0 in the meantime, the queue
+ entry is stale (a re-reference txn forgot to remove it, or
+ the sweeper raced) and the sweeper deletes only the queue
+ entry through a Raft txn, leaving the chunkblob in place.
+
+The queue is the authoritative "blob is GC-eligible since T"
+signal *and* the global "we are GC-ing this SHA" lock — its
+Raft-replicated single-writer-per-key property is what makes
+concurrent sweepers safe across nodes. The RC is the
+authoritative "is reachable" signal, also Raft-replicated. Local
+chunkblob deletes are deliberately *not* replicated: each node
+deletes its own copy independently after the queue-entry txn
+commits, because that's the whole point of the architecture.
+
+`chunkBlobGCGracePeriod` defaults to 1 hour. The grace window
+absorbs in-flight reads (a peer that has already started fetching
+the blob completes its fetch before the sweeper runs), in-flight
+upload aborts (a multipart abort flips RCs to zero, then a retry
+creates a new manifest with the same chunks and bumps them back),
+and clock skew between nodes (we use the Raft `commitTS` from the
+RC-update txn, not wall clock, so skew is bounded to the
+HLC-physical-shift the cluster already tolerates).
+
+### 3.6 Fetch protocol
+
+Two RPCs on the existing internal raft transport service — one
+follower-initiated (lazy fetch on miss), one leader-initiated
+(synchronous replication before chunkref commit, see §3.2 step 3):
+
+```protobuf
+service S3BlobFetch {
+ // FetchChunkBlob returns the bytes of a chunkblob this peer holds
+ // locally. Caller must verify SHA-256. Used by followers on the
+ // proxy-on-miss GET path (§3.3) and during snapshot catch-up.
+ rpc FetchChunkBlob(FetchChunkBlobRequest) returns (stream FetchChunkBlobResponse);
+
+ // PushChunkBlob streams a chunkblob from the leader to a follower
+ // and acks once the bytes are durable in the receiver's Pebble.
+ // Used by §3.2 step 3 to make chunkblob writes survive a leader
+ // crash without depending on the async fetch path catching up.
+ // The receiver SHOULD verify SHA-256 against the request header
+ // before fsync; mismatch fails the RPC and the leader retries.
+ rpc PushChunkBlob(stream PushChunkBlobRequest) returns (PushChunkBlobResponse);
+}
+
+message FetchChunkBlobRequest {
+ bytes content_sha256 = 1;
+}
+
+message FetchChunkBlobResponse {
+ bytes payload = 1;
+ bool eof = 2;
+}
+
+message PushChunkBlobRequest {
+ bytes content_sha256 = 1; // sent in the first frame only
+ bytes payload = 2;
+ bool eof = 3;
+}
+
+message PushChunkBlobResponse {
+ bool durable = 1; // true == fsynced
+}
+```
+
+Streamed because a chunkblob is up to `s3ChunkSize = 1 MiB`. The
+existing gRPC `MaxRecvMsgSize = 64 MiB` (PR #593 → `internal.GRPCCallOptions`)
+already covers this in a single RPC, but streaming keeps the
+implementation symmetric with how the future Raft streaming
+transport (proposed under
+`docs/design/2026_04_18_proposed_raft_grpc_streaming_transport.md`)
+handles large payloads.
+
+### 3.7 Backwards compatibility & rollout
+
+The legacy `BlobKey` path remains available. New PUTs use the
+offload path when `ELASTICKV_S3_BLOB_OFFLOAD=true`; existing data
+keeps reading through the legacy `BlobKey` path until a background
+migrator (separate proposal) rewrites it. Mixed keyspace coexistence
+works because `!s3|chunkblob|*` and the legacy
+`!s3|blob|||...` namespaces are disjoint.
+
+The opt-in flag stays for at least one full release cycle so we can
+revert by flipping a single env var if any of the following surface:
+
+- a SHA-256 collision (~zero probability but a hard kill criterion),
+- a follower fetch storm overwhelming peer-to-peer bandwidth,
+- a GC bug that leaks reachable blobs,
+- semi-synchronous `PushChunkBlob` latency exceeding the legacy
+ PUT p99 by an unacceptable margin (the soak-test acceptance
+ criterion in §5).
+
+### 3.8 Mixed-version cluster behaviour
+
+Until *every* node in the cluster speaks the offload protocol, PUTs
+on the offload path cannot proceed safely: a node that does not
+implement `PushChunkBlob` cannot ack a quorum write, and a follower
+that does not implement `FetchChunkBlob` cannot resolve a chunkref
+on apply. We therefore gate the offload path on cluster-level
+feature negotiation rather than a single env var:
+
+- A node advertises offload capability by setting
+ `feature_s3_blob_offload=true` in the `AdminServer.GetClusterOverview`
+ response (alongside the existing role / version metadata).
+- The leader inspects every peer's advertised capabilities at PUT
+ admission time. If any peer is missing the capability, the PUT
+ falls back to the legacy `BlobKey` path for that request — even
+ if the leader has the env var enabled.
+- During an upgrade window the leader continues to emit legacy
+ writes; once the last peer rolls and re-advertises, subsequent
+ PUTs flip to offload automatically. A roll-back works the same
+ way in reverse: the first downgraded peer drops its capability
+ flag and the leader resumes legacy emission within the next
+ capability-refresh interval (default 30 s).
+- Reads always succeed regardless of mixed state, because both
+ keyspaces are namespaced and the GET path checks legacy then
+ offload (or vice-versa) and serves whichever resolves.
+
+This gives operators a **strict two-step rolling upgrade** with no
+PUT data path that depends on a half-upgraded cluster:
+
+1. Roll out the new binary with `ELASTICKV_S3_BLOB_OFFLOAD=false`
+ on every node. PUTs continue on the legacy path. Validate
+ stability for a soak window (24 h on the canary cluster in
+ §5's M0 acceptance criteria).
+2. Flip `ELASTICKV_S3_BLOB_OFFLOAD=true` on the leader, then on
+ followers. Once every node advertises capability, PUTs switch
+ to the offload path.
+
+Roll-back: flip the env var to `false` on any node; the leader's
+capability check sees the disagreement and falls back to legacy
+within ≤ refresh interval. The migrator (M4) is independently
+gated and never runs during a roll-back window.
+
+A node with `ELASTICKV_S3_BLOB_OFFLOAD=true` running against a
+cluster where offload is disabled (e.g. a stuck rollout) is safe
+— it advertises capability but the leader's per-PUT capability
+check sees other peers missing it and routes through legacy. No
+data is written into the offload keyspace until a quorum of
+capability-advertising peers exists.
+
+## 4. Interaction with related subsystems
+
+- **PR #636 + admission control.** The admission control budget
+ drops in importance under the offload path because Raft entries
+ are tiny (~100 B per chunkref). However the *body bytes still
+ flow through HTTP* and `prepareStreamingPutBody` continues to
+ hold them in memory until the local Pebble write returns. The
+ admission cap must stay; only the per-peer Raft-side worst-case
+ bound (`MaxInflight × MaxSizePerMsg = 4 GiB`) gets *much* easier
+ to honour.
+- **PR #589 (snapshot tuning) and PR #614 (etcd-snapshot-disk-offload).**
+ Already implemented. The offload path makes those tunables more
+ effective by reducing the per-snapshot byte count.
+- **`docs/design/2026_04_18_proposed_raft_grpc_streaming_transport.md`.**
+ The blob-fetch RPC reuses the same chunked-streaming approach
+ proposed for Raft transport. We can land both behind the same
+ abstraction.
+- **Lease read & MVCC snapshot reads.** No change. Manifests remain
+ the linearisation point; chunk bytes are immutable once committed
+ (content-addressable), so a stale local copy on a follower is
+ still correct.
+
+## 5. Implementation plan
+
+| Milestone | Scope | Risk |
+|---|---|---|
+| M0 | Spike: prove the chunkref + chunkblob keyspaces under a feature flag with 1 % traffic. Measure local Pebble write amp & blob fetch latency. | Low (observability only). |
+| M1 | PUT path emits chunkrefs through Raft; chunkblob writes go directly to local Pebble. **`FetchChunkBlob` and `PushChunkBlob` RPCs ship in this milestone** because both M1 PUT (semi-synchronous push) and M1 GET (proxy-on-miss) depend on them — without them M1 GET could only serve local-hit or 503. | Medium (race ordering, RPC plumbing on the request goroutine). |
+| M2 | Async fetch worker pool for follower apply (catch-up after a long absence). Independent of M1's synchronous `FetchChunkBlob` use on the GET path. SHA verification + retry from alternate peer on mismatch. | Medium (fanout cost on snapshot apply). |
+| M3 | Reference-count + grace-period GC (the queue-based scheme in §3.5). | Medium (correctness of RC under concurrent ops). |
+| M4 | Migrator: rewrite legacy `BlobKey` data in the background. Off by default until M0–M3 burn in for 30 days in production. | High (long-running batch over live traffic). |
+
+Acceptance criteria for M3 (the milestone that flips `ELASTICKV_S3_BLOB_OFFLOAD=true` by default):
+
+- WAL growth per GiB of S3 PUT < 1 MiB on a one-week soak test.
+- Snapshot transfer for a 100 GiB-cluster follower restart completes
+ in < 60 s on a 1 Gbps interconnect.
+- No regression in PUT p99 latency or GET p99 latency vs. the legacy
+ path (measured on the 24 h pre-cutover window).
+
+## 6. Risks
+
+- **Race between local chunkblob write and the chunkref commit.**
+ Mitigated by writing the chunkblob to a local Pebble batch with
+ fsync *before* the chunkref enters `coordinator.Dispatch`. The
+ manifest commit is the linearisation point; if the chunkblob is
+ durable on the leader, peers can fetch it as soon as they apply.
+- **Follower fetch storm.** A new follower that catches up sees a
+ flood of `chunkref` Puts and could DDoS the source peer with
+ fetches. Mitigation: bounded fetch worker pool + token bucket
+ per-source. The Raft apply loop does *not* block on the fetch —
+ it stages `chunkref` and lets the fetch lag — so apply latency
+ stays bounded.
+- **SHA-256 collision.** Operationally improbable; shipped with a
+ metric (`s3_chunkblob_sha_mismatch_total`) and a hard-fail option
+ for paranoid operators.
+- **Leader-only durability before chunkref commit.** Without
+ intervention, a leader crash between writing the chunkblob to its
+ own Pebble and the eventual async fetch on followers would leave
+ a Raft-committed chunkref pointing at a chunkblob no surviving
+ node has. Mitigation: §3.2 step 3 — synchronous semi-quorum
+ replication via `PushChunkBlob` before the chunkref enters Raft.
+ `chunkBlobMinReplicas` defaults to a Raft-quorum-equivalent
+ floor; operators who want N-way durability bump it explicitly.
+ This restores end-to-end durability parity with the legacy
+ "every byte through Raft" path at the cost of one extra fsync
+ per chunkblob on the followers in the quorum.
+- **Cross-tenant blob fetch via SHA-256 guessing.** Because
+ `chunkblob` keys are SHA-256-addressed, *if* a malicious tenant
+ could (a) guess a victim tenant's chunk SHA and (b) bypass
+ authorisation, they could exfiltrate the chunk. SHA-256 guessing
+ is computationally infeasible for non-trivial content, but we
+ remove the second prerequisite by enforcing authorisation
+ *exclusively at the `chunkref` layer*. The `S3BlobFetch.FetchChunkBlob`
+ RPC is internal-only (raft-transport credentials, not exposed to
+ S3 clients); user-facing GET resolves through `chunkref` first,
+ which carries the bucket / key tenancy context, and only after
+ the ACL check does the server proxy to a peer for the
+ corresponding `chunkblob`. A future design that exposes
+ blob fetch on a public surface would need to reintroduce
+ tenant-scoped authorisation at the blob layer; this proposal
+ intentionally does not.
+
+## 7. Out of scope (future work)
+
+- Cross-cluster blob replication (CRR / disaster recovery).
+- Tiered storage (cold blobs to S3-IA / Glacier-equivalent).
+- Erasure coding for blob payloads.
+- Compression. The current S3 spec is "the bytes the client sent";
+ any compression layer is a separate negotiated feature.
+
+## 8. Open questions
+
+- Do we need a per-follower bandwidth cap on blob fetch? If the
+ cluster network is constrained, a runaway catch-up could starve
+ user-path GET / Raft heartbeat traffic. Probably yes — defer to
+ the workload-isolation rollout.
+- Is content-addressing at the chunk granularity (`chunkSize = 1 MiB`)
+ the right unit, or should we content-address whole objects and
+ range-fetch sub-chunks? The chunk granularity matches what
+ `prefetchObjectChunks` already does and keeps content addressing
+ predictable; whole-object addressing would require re-hashing on
+ partial reads. Tentatively: chunk granularity for v1.