From 577106d6f0e03e6f9e784b519674550287850bff Mon Sep 17 00:00:00 2001
From: "Yoshiaki Ueda (bootjp)" <contact@bootjp.me>
Date: Sat, 25 Apr 2026 23:36:01 +0900
Subject: [PATCH 1/8] docs(design): propose S3 admission control + raft blob
 offload
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Two companion proposals to PR #636 (s3ChunkBatchOps=4, Raft entry size
aligned with MaxSizePerMsg). PR #636 fixes the per-entry memory
accounting; these docs cover the two follow-up axes the in-PR
discussion identified:

- 2026_04_25_proposed_s3_admission_control.md
  Hard cap on the aggregate S3 PUT body bytes a node accepts but has
  not yet committed to Raft. Per-PUT pipeline is already capped by
  MaxInflight × MaxSizePerMsg; this proposal bounds the multiplier
  from concurrent PUTs so leader-side worst-case memory has a hard
  ceiling regardless of client behaviour. Two-level admission
  (Content-Length pre-charge + per-batch slot) with HTTP 503
  SlowDown rejection.

- 2026_04_25_proposed_s3_raft_blob_offload.md
  Take large chunk payloads off the Raft log entirely. Raft replicates
  ChunkRefs (32-byte SHA256 + size) plus the manifest; chunk bytes
  travel via a side-channel gRPC fetch protocol between peers. Snapshot
  size becomes O(manifest count), follower catch-up is amortised by
  lazy fetch, and WAL growth no longer scales with stored bytes.

Both docs are filed as proposals only; no code changes. They follow
the docs/design/README.md filename + header conventions and
cross-reference each other plus the prerequisite PRs (#593, #600,
#612, #617, #636) and the workload isolation roadmap.
---
 ...026_04_25_proposed_s3_admission_control.md | 280 +++++++++++++++
 ...026_04_25_proposed_s3_raft_blob_offload.md | 332 ++++++++++++++++++
 2 files changed, 612 insertions(+)
 create mode 100644 docs/design/2026_04_25_proposed_s3_admission_control.md
 create mode 100644 docs/design/2026_04_25_proposed_s3_raft_blob_offload.md

diff --git a/docs/design/2026_04_25_proposed_s3_admission_control.md b/docs/design/2026_04_25_proposed_s3_admission_control.md
new file mode 100644
index 000000000..6700be7f4
--- /dev/null
+++ b/docs/design/2026_04_25_proposed_s3_admission_control.md
@@ -0,0 +1,280 @@
+# S3 PUT admission control
+
+> **Status: Proposed**
+> Author: bootjp
+> Date: 2026-04-25
+>
+> Companion to PR #636 (`s3ChunkBatchOps = 4`, Raft entry size aligned
+> with `MaxSizePerMsg = 4 MiB` per PR #593) and to the workload-class
+> isolation proposal (`docs/design/2026_04_24_proposed_workload_isolation.md`).
+> Where PR #636 fixes the *per-entry* memory accounting and the
+> isolation doc keeps Raft's CPU budget separate from heavy command
+> paths, this doc bounds the *aggregate* in-flight memory that S3
+> PUT body bytes can pin in the leader's Raft pipeline regardless of
+> how many clients are uploading at once.
+
+---
+
+## 1. Problem
+
+Even with PR #636 reducing the per-entry size to 4 MiB and aligning
+with `MaxSizePerMsg`, the **aggregate** in-flight memory bound is
+governed by client concurrency, which the server cannot currently
+limit:
+
+```
+leader-side worst-case  =  concurrent_PUTs × pending_entries_per_PUT × entry_size
+                        ≈  concurrent_PUTs × MaxInflight × 4 MiB
+```
+
+A single PUT's pipeline is already capped by `MaxInflight = 1024` Raft
+in-flight messages times 4 MiB per entry = 4 GiB / peer (the bound PR
+#593 advertises). What multiplies that bound is concurrent uploads:
+
+- 4 simultaneous 5 GiB PUTs on a 4-vCPU 8 GiB-RAM node can pin
+  `4 × min(MaxInflight × 4 MiB, body_remaining)` on the leader. If a
+  follower stalls (GC pause, slow disk fsync) for several seconds, that
+  worst case is realised before backpressure kicks in.
+- The 2026-04-24 incident showed that one workload class can wedge the
+  Go runtime for the entire cluster. S3 PUT body bytes pinned in the
+  Raft pipeline play the same role as XREAD blobs in the Redis path:
+  they grow with client behaviour, not with anything the operator can
+  rate-limit per-handler.
+- `GOMEMLIMIT = 1800 MiB` plus `--memory = 2500m` (PR #617) set the
+  upper bound on heap, but they do not provide *flow control*. The Go
+  GC starts thrashing well before the limit, and memwatch (PR #612)
+  triggers graceful shutdown — a recovery mechanism, not a steady-state
+  control.
+
+There is no hard ceiling today on the aggregate body bytes that S3
+puts onto the Raft pipeline. PR #636 makes the per-entry slot
+predictable; this proposal makes the *number of slots in flight*
+predictable too.
+
+## 2. Goals & non-goals
+
+**Goals**
+
+- Hard cap on the total S3 PUT body bytes that have been read from
+  HTTP but not yet committed to Raft. This cap is a tunable property
+  of the S3 server, not of any single PUT.
+- Steady-state behaviour where new PUTs are admitted at the rate Raft
+  can drain. Bursting clients see HTTP 503 with `Retry-After`, not
+  silent ballooning of leader memory.
+- Composable with existing memwatch + GOMEMLIMIT. Admission failure
+  must be visible (metric + log line per rejected request) so an
+  operator can size the cap against a given memory budget.
+- Read path is unaffected. GET / HEAD / LIST do not consume the same
+  budget.
+
+**Non-goals**
+
+- No global QPS quota across all S3 verbs. Other verbs (GET, multipart
+  initiate / complete) have well-bounded memory and don't need this.
+- No per-bucket / per-key fairness. Fairness across tenants is a
+  separate workload-class concern (handled by the workload isolation
+  proposal). This doc only fixes the aggregate ceiling.
+- No backpressure-via-TCP (`SO_RCVBUF` shrink). Decoupling from kernel
+  buffers keeps the budget testable and avoids cross-OS surprises.
+- No multi-region / multi-cluster admission. One leader's perspective.
+
+## 3. Design
+
+### 3.1 Where the budget is checked
+
+Two natural insertion points exist on the PUT path:
+
+| Insertion point | Granularity | When the request is rejected |
+|---|---|---|
+| (A) Before `prepareStreamingPutBody` accepts the first byte | Whole-object | Pre-Raft, before any reads from the body |
+| (B) Inside the `flushBatch` loop, before `coordinator.Dispatch` | Per-batch | Mid-stream, after `s3ChunkBatchOps` chunks are buffered |
+
+Recommended: **both, at different scales**.
+
+- **(A) is the request-admission cap.** Use the `Content-Length`
+  header (rejected when absent for PUTs that are not aws-chunked) to
+  pre-charge the budget. If the budget would be exceeded we reply 503
+  immediately. This matches how AWS S3 itself handles
+  `SlowDown` / `ServiceUnavailable` and is the cheapest case to
+  surface — the body is never read from the socket.
+- **(B) is the in-flight cap.** Even after a request is admitted, its
+  4 MiB batches can pile up if Raft becomes slow. The flush loop
+  acquires a per-batch lease (4 MiB) from a shared semaphore *before*
+  reading the next chunk window. If the semaphore is empty for longer
+  than `dispatchAdmissionTimeout` (proposed default 30 s), the PUT
+  fails with 503 mid-stream. Dispatch latency stays bounded by the
+  semaphore size rather than by client behaviour.
+
+```
+client ─[Content-Length]─► (A) reserve full body bytes
+                                │
+                                ▼
+                        prepareStreamingPutBody
+                                │
+                                ▼
+        (B) acquire 4 MiB slot ◄┐ (released on Dispatch ack)
+                                │
+                                ▼
+        coordinator.Dispatch ───┘
+```
+
+### 3.2 Concrete cap values
+
+Default cap:
+
+```go
+const (
+    // s3PutAdmissionMaxInflightBytes is the hard ceiling on S3 PUT body
+    // bytes accepted by this node but not yet committed to Raft. Picked
+    // so that, on a 3-node cluster with MaxInflight × MaxSizePerMsg =
+    // 4 GiB / peer per PR #593, all in-flight PUT bytes plus their Raft
+    // replication shadow stay under one quarter of GOMEMLIMIT (1800 MiB
+    // per PR #617) — leaving headroom for Lua, scan buffers, and Pebble
+    // memtables.
+    s3PutAdmissionMaxInflightBytes = 256 << 20 // 256 MiB
+    // dispatchAdmissionTimeout is how long a per-batch flush will wait
+    // for a slot before giving up. Sized comfortably above the Raft
+    // in-flight queue's drain time at 1 Gbps (1024 × 4 MiB / 125 MB/s
+    // ≈ 33 s) so a transient stall does not surface as 503; longer
+    // stalls indicate the cluster is genuinely overloaded and rejection
+    // is the right signal.
+    dispatchAdmissionTimeout = 30 * time.Second
+)
+```
+
+The cap is intentionally **per-node** rather than cluster-wide:
+admission is enforced on the node receiving the HTTP request, which is
+also the node whose memory is at risk. Clients hitting a different
+node simply get a different budget.
+
+Both values are exposed as env vars (`ELASTICKV_S3_PUT_ADMISSION_MAX_INFLIGHT_BYTES`,
+`ELASTICKV_S3_DISPATCH_ADMISSION_TIMEOUT`) with the constants as
+defaults, following the pattern PR #593 established for
+`ELASTICKV_RAFT_MAX_INFLIGHT_MSGS` etc.
+
+### 3.3 Data structure
+
+```go
+type putAdmission struct {
+    semaphore chan struct{}   // capacity == max / s3ChunkSize, charges in 1 MiB units
+    inflight  atomic.Int64    // metric mirror; semaphore is the source of truth
+}
+
+func newPutAdmission(maxBytes int64) *putAdmission { … }
+
+// reserve charges (bytes) units against the semaphore. Returns a
+// release closure that the caller MUST call exactly once. Returns
+// ErrAdmissionExhausted if the deadline elapses before the budget
+// is available.
+func (a *putAdmission) reserve(ctx context.Context, bytes int64) (func(), error) { … }
+```
+
+`reserve` is non-trivial because we need partial-charge semantics:
+
+1. Pre-charge `Content-Length` total at request entry (admission A).
+   The release fires in a deferred handler on the request goroutine
+   regardless of success / failure.
+2. Per-batch sub-lease (admission B) is *not* a separate budget — it's
+   a synchronisation primitive on the existing semaphore. The PUT
+   handler acquires `s3ChunkSize × s3ChunkBatchOps = 4 MiB` units
+   before reading the next 4 MiB window from the body and releases
+   them on `coordinator.Dispatch` ack. This way the same budget covers
+   both the "request is in flight" and "this batch is buffered for
+   Raft" phases.
+
+For aws-chunked transfers (`Content-Length == -1`), the request-entry
+charge falls back to a conservative `s3MaxObjectSizeBytes` (5 GiB)
+reservation. The downside is that one chunked PUT can monopolise the
+budget; the upside is correctness without re-reading headers.
+We will instrument a metric to find out empirically how large that
+hit actually is before optimising.
+
+### 3.4 Failure mode
+
+- `503 Service Unavailable` with `Retry-After: 1` (small, jittered).
+  AWS S3 SDK clients (boto3, aws-sdk-go-v2) auto-retry this code with
+  exponential backoff out of the box.
+- Body for the response is the standard S3 XML error envelope:
+  `<Code>SlowDown</Code><Message>Reduce your request rate</Message>`.
+  This is the AWS-defined code for admission rejection and matches
+  what real S3 returns.
+- Mid-stream rejection (admission B) closes the connection with a
+  `connection: close` header so partial body reads do not corrupt the
+  client's pipeline. The PUT handler also calls
+  `cleanupManifestBlobs` for any partial blobs that already landed in
+  Pebble.
+
+### 3.5 Metrics
+
+```
+elastickv_s3_put_admission_inflight_bytes        gauge
+elastickv_s3_put_admission_rejections_total      counter (label: stage="prereserve" | "perbatch")
+elastickv_s3_put_admission_wait_seconds          histogram (label: stage)
+```
+
+Grafana panel: inflight gauge with the cap as a horizontal line so
+the operator sees how often the system saturates. Rejection rate
+suggests bumping the cap or scaling out (more nodes spreads PUT load).
+
+## 4. Interaction with related subsystems
+
+- **PR #636 (entry size alignment).** The 4 MiB per-batch unit is
+  the natural admission grain. The two changes are independent:
+  alignment is necessary for the per-peer Raft bound to be correct;
+  admission is necessary for the *aggregate* bound to be hard.
+- **PR #612 (memwatch graceful shutdown).** Continues to function as
+  the last-resort safety net. Admission control should fire at well
+  below the memwatch threshold (`s3PutAdmissionMaxInflightBytes` is
+  ~14% of `GOMEMLIMIT`) so memwatch sees a much lower steady-state
+  pressure and the graceful-shutdown path stays a rare event.
+- **Workload isolation proposal.** That doc proposes per-class CPU
+  reservation for Raft. Admission control is the memory-axis sibling.
+  Both are needed — limiting CPU does not bound queue depth.
+- **`coordinator.Dispatch` retries.** Today the S3 path has its own
+  retry loop (`s3TxnRetryMaxAttempts = 8`). Admission must release
+  its budget around retries, otherwise a long retry chain double-
+  counts.
+
+## 5. Implementation plan
+
+| Milestone | Scope | Risk |
+|---|---|---|
+| M1 | Add `putAdmission` type + per-node singleton + `Content-Length` admission. Wire `prepareStreamingPutBody` to acquire / release. Metric scaffolding only (gauge + counter). | Low. No mid-stream cancellation yet. |
+| M2 | Per-batch admission B inside `flushBatch`. Add `dispatchAdmissionTimeout`. Mid-stream 503 with cleanup. | Medium. Cleanup path on partial failure. |
+| M3 | Env-var tunables. Histogram metric. Grafana panel. | Low. |
+| M4 | aws-chunked path: emit a warn log when the conservative 5 GiB reservation kicks in, and add a follower-up to charge the actually-decoded byte count incrementally. | Medium. Needs `awsChunkedReader` to expose a progress callback. |
+
+Acceptance criteria:
+
+- `go test ./adapter/ -short -run TestS3PutAdmission` covers reject /
+  admit / mid-stream-timeout paths.
+- A loaded test that opens 32 concurrent PUTs of 100 MiB each must
+  hold leader memory below `s3PutAdmissionMaxInflightBytes + epsilon`
+  for the duration of the test.
+- No regression in `Test_grpc_transaction` (which is currently the
+  longest leader-stress test).
+
+## 6. Risks
+
+- **Tail-latency for legitimate clients.** A long-running PUT that
+  loses a 4 MiB slot mid-stream returns 503 even though it is making
+  progress. Mitigated by `dispatchAdmissionTimeout = 30s`, well above
+  the steady-state Raft drain time. If we observe spurious 503s in
+  practice, drop the timeout into a config knob and tune.
+- **Operator confusion.** "Why does S3 return 503 when CPU is at
+  20%?" Mitigated by a sharp Grafana panel and a clear `Retry-After`
+  value so SDK behaviour is predictable.
+- **aws-chunked overcharge.** A small chunked PUT charges 5 GiB
+  against the budget until M4 lands. We accept this temporarily
+  because aws-chunked traffic is rare in practice and the
+  conservative cap fails safe.
+
+## 7. Out of scope (future work)
+
+- Per-bucket admission classes (e.g. system buckets get their own
+  budget). Punted to the workload-isolation rollout.
+- Coordinated admission across the multi-region read replica path
+  proposed in `docs/design/2026_04_18_proposed_raft_grpc_streaming_transport.md`.
+- Token-bucket rate-shaping (e.g. bytes-per-second). The current
+  proposal only bounds *concurrent* bytes; rate-shaping is a separate
+  policy choice.
diff --git a/docs/design/2026_04_25_proposed_s3_raft_blob_offload.md b/docs/design/2026_04_25_proposed_s3_raft_blob_offload.md
new file mode 100644
index 000000000..e4aa37bfb
--- /dev/null
+++ b/docs/design/2026_04_25_proposed_s3_raft_blob_offload.md
@@ -0,0 +1,332 @@
+# S3 raft blob offload — keep large object payloads out of the Raft log
+
+> **Status: Proposed**
+> Author: bootjp
+> Date: 2026-04-25
+>
+> Companion to PR #636 (`s3ChunkBatchOps = 4`, Raft entry size aligned
+> with `MaxSizePerMsg = 4 MiB` per PR #593) and to the S3 PUT
+> admission-control proposal
+> (`docs/design/2026_04_25_proposed_s3_admission_control.md`).
+>
+> PR #636 caps the *per-entry* size; admission control caps the
+> *aggregate in-flight* memory; this doc removes large blob payloads
+> from the *Raft log itself* so that snapshots and follower catch-up
+> stay bounded as the data set grows.
+
+---
+
+## 1. Problem
+
+Today every byte of an S3 object travels through the Raft log:
+
+```
+HTTP PUT body  ─►  s3ChunkSize (1 MiB) chunks
+                 ─►  s3ChunkBatchOps × 1 MiB Raft entry
+                 ─►  Raft log entry (Pebble WAL on every node)
+                 ─►  applyLoop → s3keys.BlobKey(...) → MVCCStore Put
+```
+
+`s3keys.BlobKey(bucket, generation, objectKey, uploadID, partNo, chunkNo)`
+is the key actually written to Pebble; the log entry that proposed it
+contains the full chunk value. Two consequences:
+
+1. **WAL & snapshot growth scales with object bytes.** A node that
+   serves 100 GiB of S3 PUT traffic ends up with a 100 GiB Pebble WAL
+   plus the same 100 GiB persisted in the engine's state machine.
+   Snapshot transfer for a falling-behind follower carries the
+   payload twice — once as Raft log replication (during catch-up
+   inside the WAL window) and once as a snapshot dump if the leader
+   has already truncated.
+2. **Follower catch-up after a long absence is expensive.** Right now
+   a follower that misses 5 GiB of PUT traffic re-applies it as Raft
+   log entries, each going through the full `applyRequests` →
+   `MVCCStore.Put` path on a single thread. The ApplyLoop becomes a
+   bottleneck that holds the rest of the cluster waiting for the
+   `commitIndex` advance.
+
+S3 blob payloads are special: they are **idempotent, content-
+addressable, large, and rarely re-read inside the leader's apply
+window**. They look more like attachments than like Raft log records.
+Treating them as Raft entries is overkill — Raft only needs to
+linearise *what changed* (the manifest), not *the chunk bytes
+themselves*.
+
+## 2. Goals & non-goals
+
+**Goals**
+
+- Raft log entries for S3 PUT carry a *reference* to the chunk
+  payload, not the payload itself. Replication traffic and WAL
+  size become O(manifest size), not O(object size).
+- Followers fetch blob payloads out-of-band when they apply a
+  manifest reference. Apply order remains Raft-defined; only the
+  bytes are pulled lazily.
+- Snapshot transfer for a falling-behind follower is O(manifest
+  count + small blob index), not O(stored bytes).
+- The new path is opt-in and lives alongside the current direct-Raft
+  path until parity is proven. Existing S3 traffic is unaffected
+  during rollout.
+
+**Non-goals**
+
+- No external object store dependency (S3, MinIO, Ceph). The blob
+  offload uses Pebble itself plus a peer-to-peer fetch protocol;
+  introducing an external dependency would re-create the operational
+  surface elastickv exists to replace.
+- No deduplication across objects. Each `(bucket, objectKey,
+  uploadID, partNo, chunkNo)` keeps its own key. Real S3 also does
+  not deduplicate at this layer.
+- No change to MVCC semantics. Manifest commits remain the
+  serialisation point; blob fetch is a side channel that does not
+  change visibility rules.
+- No removal of the existing `BlobKey` path in this doc. We will
+  ship in two stages (manifest-only writes through Raft, blob
+  payload via a side channel), and the legacy path stays available
+  until enough operational evidence accumulates.
+
+## 3. Design
+
+### 3.1 New keyspace
+
+```
+!s3|chunkref|<bucket>|<gen>|<objectKey>|<uploadID>|<partNo>|<chunkNo>
+    → ChunkRef{
+        ContentSHA256 [32]byte
+        Size          uint64
+        // Optional: leader-locality hints. Followers without the
+        // payload locally fetch from a peer that advertises the
+        // chunk in its catalog.
+        SourcePeer    NodeID
+      }
+
+!s3|chunkblob|<contentSHA256>
+    → raw bytes (the chunk payload)
+```
+
+Two separate keyspaces:
+
+- `chunkref` is replicated through Raft. Cheap (32 B + small header
+  per chunk) and ordered against the manifest commit.
+- `chunkblob` is **not** written through Raft. It is written
+  directly to Pebble on the receiving node and pulled by peers via
+  the new fetch protocol when they apply the corresponding
+  `chunkref`.
+
+### 3.2 PUT path
+
+```
+client ─► HTTP PUT body
+        ─► chunk loop (s3ChunkSize):
+             1. compute SHA-256 of chunk
+             2. write chunk to LOCAL Pebble at !s3|chunkblob|<SHA>
+             3. queue ChunkRef into pendingBatch
+        ─► flushBatch:
+             coordinator.Dispatch(OperationGroup{
+                 Elems: [ chunkref Puts ... + manifest Put ],
+             })
+        ─► HTTP 200 OK once Dispatch acks
+```
+
+The order matters: the chunk *must* be persisted in local Pebble
+before its `chunkref` is committed through Raft, otherwise a
+follower that applies the `chunkref` and immediately re-fetches
+might race the leader's local write. Pebble's WAL fsync gives the
+ordering guarantee; the chunk write returns only after the data is
+durable.
+
+The `chunkref` keys are < 100 B each. A 1 GiB PUT generates 1024
+of them = ~100 KiB of Raft log payload. Compared with today's
+1 GiB through Raft, that is a **10⁴× reduction** in Raft log
+write amplification.
+
+### 3.3 Follower apply path
+
+When a follower's apply loop sees a `chunkref` Put:
+
+1. Stage the `chunkref` key in MVCCStore as usual.
+2. Schedule an async fetch of `!s3|chunkblob|<SHA>` from
+   `SourcePeer` (or a quorum-style fanout to all known peers).
+3. The fetch worker writes the chunk to local Pebble at
+   `!s3|chunkblob|<SHA>` once the body arrives and verifies the
+   SHA-256 (mismatch → drop and retry from another peer).
+
+GET / range-read on the follower checks the local `chunkblob`
+first; if absent (because the async fetch is still pending), it
+either:
+
+- proxies the read to a peer that *does* have the chunk (using the
+  `SourcePeer` hint and falling back to a fanout), or
+- replies 503 with `Retry-After`, identical to S3's behaviour
+  during a region failover.
+
+The choice is per-deployment; Phase-1 ships proxy-on-miss.
+
+### 3.4 Snapshot
+
+Today a follower snapshot dump includes every `BlobKey` Pebble has
+ever stored. Under the new design:
+
+- The Raft snapshot serialises only `chunkref` keys plus the rest
+  of the MVCC state (manifests, bucket meta, ACLs).
+- A separate **blob catalog snapshot** lists every locally-held
+  `chunkblob` SHA. This is included in the snapshot stream as a
+  manifest of "blobs you should fetch from me on demand."
+- Once the follower has consumed the Raft snapshot and the blob
+  catalog, it begins serving GET / HEAD by proxying chunkblob
+  fetches to peers as in 3.3.
+
+A 100 GiB cluster's Raft snapshot drops from ~100 GiB to a few
+megabytes (one `chunkref` per chunk, plus the manifest set). The
+blob catalog adds 32 B × N_chunks = ~3 MiB per 100 GiB. Snapshot
+stream time falls by orders of magnitude; the recovery cost
+shifts from "leader's WAL dump" to "follower's lazy blob fetch
+amortised across reads."
+
+### 3.5 Garbage collection
+
+A blob whose `chunkref` has been deleted (DELETE, lifecycle policy,
+object version pruned, manifest aborted) is reclaimable. We handle
+this two-staged:
+
+1. Reference counting via a `!s3|chunkref-rc|<SHA>` counter
+   updated inside the same Raft txn that adds / removes a
+   `chunkref`. RC == 0 marks the blob as eligible.
+2. A node-local sweeper periodically (e.g. every 5 minutes)
+   scans `!s3|chunkblob|*` and deletes blobs whose RC is 0 and
+   whose RC has been 0 for at least `chunkBlobGCGracePeriod`
+   (proposed default 1 hour). The grace window covers in-flight
+   reads and avoids deleting a blob a peer is just about to fetch.
+
+Sweeper deletion is local — it does not pass through Raft, because
+the authoritative state ("is this blob unreachable?") is the
+already-committed RC. Followers run independent sweepers and arrive
+at the same conclusion.
+
+### 3.6 Fetch protocol
+
+A new gRPC method on the existing internal raft transport service:
+
+```protobuf
+service S3BlobFetch {
+  // FetchChunkBlob returns the bytes of a chunkblob this peer holds
+  // locally. Caller must verify SHA-256.
+  rpc FetchChunkBlob(FetchChunkBlobRequest) returns (stream FetchChunkBlobResponse);
+}
+
+message FetchChunkBlobRequest {
+  bytes  content_sha256 = 1;
+}
+
+message FetchChunkBlobResponse {
+  bytes  payload = 1;
+  bool   eof     = 2;
+}
+```
+
+Streamed because a chunkblob is up to `s3ChunkSize = 1 MiB`. The
+existing gRPC `MaxRecvMsgSize = 64 MiB` (PR #593 → `internal.GRPCCallOptions`)
+already covers this in a single RPC, but streaming keeps the
+implementation symmetric with how the future Raft streaming
+transport (proposed under
+`docs/design/2026_04_18_proposed_raft_grpc_streaming_transport.md`)
+handles large payloads.
+
+### 3.7 Backwards compatibility & rollout
+
+The legacy `BlobKey` path remains available. New PUTs use the
+offload path when `ELASTICKV_S3_BLOB_OFFLOAD=true`; existing data
+keeps reading through the legacy `BlobKey` path until a background
+migrator (separate proposal) rewrites it. Mixed clusters work
+because both keyspaces are namespaced.
+
+The opt-in flag stays for at least one full release cycle so we can
+revert by flipping a single env var if any of the following surface:
+
+- a SHA-256 collision (~zero probability but a hard kill criterion),
+- a follower fetch storm overwhelming peer-to-peer bandwidth,
+- a GC bug that leaks reachable blobs.
+
+## 4. Interaction with related subsystems
+
+- **PR #636 + admission control.** The admission control budget
+  drops in importance under the offload path because Raft entries
+  are tiny (~100 B per chunkref). However the *body bytes still
+  flow through HTTP* and `prepareStreamingPutBody` continues to
+  hold them in memory until the local Pebble write returns. The
+  admission cap must stay; only the per-peer Raft-side worst-case
+  bound (`MaxInflight × MaxSizePerMsg = 4 GiB`) gets *much* easier
+  to honour.
+- **PR #589 (snapshot tuning) and PR #614 (etcd-snapshot-disk-offload).**
+  Already implemented. The offload path makes those tunables more
+  effective by reducing the per-snapshot byte count.
+- **`docs/design/2026_04_18_proposed_raft_grpc_streaming_transport.md`.**
+  The blob-fetch RPC reuses the same chunked-streaming approach
+  proposed for Raft transport. We can land both behind the same
+  abstraction.
+- **Lease read & MVCC snapshot reads.** No change. Manifests remain
+  the linearisation point; chunk bytes are immutable once committed
+  (content-addressable), so a stale local copy on a follower is
+  still correct.
+
+## 5. Implementation plan
+
+| Milestone | Scope | Risk |
+|---|---|---|
+| M0 | Spike: prove the chunkref + chunkblob keyspaces under a feature flag with 1 % traffic. Measure local Pebble write amp & blob fetch latency. | Low (observability only). |
+| M1 | PUT path emits chunkrefs through Raft; chunkblob writes go directly to local Pebble. GET path checks chunkblob locally with proxy-on-miss. | Medium (race ordering). |
+| M2 | Follower fetch protocol + async fetch worker pool. SHA verification + retry from alternate peer on mismatch. | Medium (fanout cost on snapshot apply). |
+| M3 | Reference-count + grace-period GC. | Medium (correctness of RC under concurrent ops). |
+| M4 | Migrator: rewrite legacy `BlobKey` data in the background. Off by default until M0–M3 burn in for 30 days in production. | High (long-running batch over live traffic). |
+
+Acceptance criteria for M3 (the milestone that flips `ELASTICKV_S3_BLOB_OFFLOAD=true` by default):
+
+- WAL growth per GiB of S3 PUT < 1 MiB on a one-week soak test.
+- Snapshot transfer for a 100 GiB-cluster follower restart completes
+  in < 60 s on a 1 Gbps interconnect.
+- No regression in PUT p99 latency or GET p99 latency vs. the legacy
+  path (measured on the 24 h pre-cutover window).
+
+## 6. Risks
+
+- **Race between local chunkblob write and the chunkref commit.**
+  Mitigated by writing the chunkblob to a local Pebble batch with
+  fsync *before* the chunkref enters `coordinator.Dispatch`. The
+  manifest commit is the linearisation point; if the chunkblob is
+  durable on the leader, peers can fetch it as soon as they apply.
+- **Follower fetch storm.** A new follower that catches up sees a
+  flood of `chunkref` Puts and could DDoS the source peer with
+  fetches. Mitigation: bounded fetch worker pool + token bucket
+  per-source. The Raft apply loop does *not* block on the fetch —
+  it stages `chunkref` and lets the fetch lag — so apply latency
+  stays bounded.
+- **SHA-256 collision.** Operationally improbable; shipped with a
+  metric (`s3_chunkblob_sha_mismatch_total`) and a hard-fail option
+  for paranoid operators.
+- **Local-blob-only on a single node.** If only one node has a
+  given chunkblob and that node fails before peers fetch it, data
+  is lost. Mitigation: PUT path replicates the chunkblob to *N*
+  peers asynchronously before returning 200 (e.g. quorum write
+  outside of Raft). N defaults to 2 (one extra copy = parity with
+  Raft's quorum durability).
+
+## 7. Out of scope (future work)
+
+- Cross-cluster blob replication (CRR / disaster recovery).
+- Tiered storage (cold blobs to S3-IA / Glacier-equivalent).
+- Erasure coding for blob payloads.
+- Compression. The current S3 spec is "the bytes the client sent";
+  any compression layer is a separate negotiated feature.
+
+## 8. Open questions
+
+- Do we need a per-follower bandwidth cap on blob fetch? If the
+  cluster network is constrained, a runaway catch-up could starve
+  user-path GET / Raft heartbeat traffic. Probably yes — defer to
+  the workload-isolation rollout.
+- Is content-addressing at the chunk granularity (`chunkSize = 1 MiB`)
+  the right unit, or should we content-address whole objects and
+  range-fetch sub-chunks? The chunk granularity matches what
+  `prefetchObjectChunks` already does and keeps content addressing
+  predictable; whole-object addressing would require re-hashing on
+  partial reads. Tentatively: chunk granularity for v1.

From bfd4a56f616f8a8abd2134a6e6d7237a19b78783 Mon Sep 17 00:00:00 2001
From: "Yoshiaki Ueda (bootjp)" <contact@bootjp.me>
Date: Sun, 26 Apr 2026 00:39:23 +0900
Subject: [PATCH 2/8] docs(design): clarify dedup non-goal for blob offload
 (Gemini medium)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Gemini medium on PR #637 — the original "No deduplication across
objects" non-goal contradicts §3.1 (chunkblob is content-addressed
by SHA-256) and §3.5 (reference-counted GC). Two distinct objects
whose chunks happen to hash identically *will* share a chunkblob
row; that is a structural property of content addressing.

Rewrite the bullet to be precise about what we are *not* promising
without contradicting what the design does:

- Surface no dedup API / metering / billing surface.
- Reference layer (chunkref) stays per-object so DELETE / lifecycle
  still reason about objects independently.
- Authorisation enforcement remains on chunkref reads, never on
  chunkblob reads — see the new "Cross-tenant blob fetch" risk
  entry that pins the threat model.

Add a matching risk entry under §6 covering the SHA-256-guessing
attack the dedup non-goal alludes to: feasible only if a tenant can
both guess a victim's chunk SHA *and* bypass auth. We block the
second by keeping S3BlobFetch internal-only and gating GET through
chunkref ACL before the proxy-on-miss path is invoked. A future
design that exposes blob fetch on a public surface would need to
reintroduce tenant-scoped authorisation at the blob layer; this
proposal explicitly does not.

No code changes — design doc only.
---
 ...026_04_25_proposed_s3_raft_blob_offload.md | 34 +++++++++++++++++--
 1 file changed, 31 insertions(+), 3 deletions(-)

diff --git a/docs/design/2026_04_25_proposed_s3_raft_blob_offload.md b/docs/design/2026_04_25_proposed_s3_raft_blob_offload.md
index e4aa37bfb..52106f3a1 100644
--- a/docs/design/2026_04_25_proposed_s3_raft_blob_offload.md
+++ b/docs/design/2026_04_25_proposed_s3_raft_blob_offload.md
@@ -74,9 +74,22 @@ themselves*.
   offload uses Pebble itself plus a peer-to-peer fetch protocol;
   introducing an external dependency would re-create the operational
   surface elastickv exists to replace.
-- No deduplication across objects. Each `(bucket, objectKey,
-  uploadID, partNo, chunkNo)` keeps its own key. Real S3 also does
-  not deduplicate at this layer.
+- No *user-facing* deduplication API or storage-accounting credit.
+  The `chunkblob` keyspace is content-addressed by SHA-256 (§3.1)
+  and reference-counted (§3.5), which means two distinct objects
+  whose chunks happen to hash identically will share one
+  `chunkblob` row at the storage layer — that is a structural
+  property of content addressing, not a feature we expose. We do
+  *not*: surface dedup ratios, charge storage by post-dedup bytes,
+  rebalance dedup credit across tenants, or treat dedup hits as
+  semantically observable from S3 verbs. The reference layer
+  (`chunkref`) keeps `(bucket, objectKey, uploadID, partNo,
+  chunkNo)` granular and per-object, so DELETE / lifecycle still
+  reason about objects independently. Authorisation enforcement
+  remains on `chunkref` reads, never on `chunkblob` reads
+  (§3.3 covers the proxy-on-miss path; ACL checks fire before the
+  blob fetch is initiated, so a tenant cannot dereference a peer's
+  `chunkblob` by guessing a SHA — see §6 *Cross-tenant blob fetch*).
 - No change to MVCC semantics. Manifest commits remain the
   serialisation point; blob fetch is a side channel that does not
   change visibility rules.
@@ -309,6 +322,21 @@ Acceptance criteria for M3 (the milestone that flips `ELASTICKV_S3_BLOB_OFFLOAD=
   peers asynchronously before returning 200 (e.g. quorum write
   outside of Raft). N defaults to 2 (one extra copy = parity with
   Raft's quorum durability).
+- **Cross-tenant blob fetch via SHA-256 guessing.** Because
+  `chunkblob` keys are SHA-256-addressed, *if* a malicious tenant
+  could (a) guess a victim tenant's chunk SHA and (b) bypass
+  authorisation, they could exfiltrate the chunk. SHA-256 guessing
+  is computationally infeasible for non-trivial content, but we
+  remove the second prerequisite by enforcing authorisation
+  *exclusively at the `chunkref` layer*. The `S3BlobFetch.FetchChunkBlob`
+  RPC is internal-only (raft-transport credentials, not exposed to
+  S3 clients); user-facing GET resolves through `chunkref` first,
+  which carries the bucket / key tenancy context, and only after
+  the ACL check does the server proxy to a peer for the
+  corresponding `chunkblob`. A future design that exposes
+  blob fetch on a public surface would need to reintroduce
+  tenant-scoped authorisation at the blob layer; this proposal
+  intentionally does not.
 
 ## 7. Out of scope (future work)
 

From 4d0707b251cac10770b3446622868a730a93a3b4 Mon Sep 17 00:00:00 2001
From: "Yoshiaki Ueda (bootjp)" <contact@bootjp.me>
Date: Sun, 26 Apr 2026 00:44:34 +0900
Subject: [PATCH 3/8] docs(design): durability + rolling-upgrade + chunked
 admission
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Round-2 review on PR #637 — Gemini high (durability, rolling
upgrade) + medium (aws-chunked head-of-line blocking).

s3_raft_blob_offload.md (Gemini high):

- §3.2: chunkblob writes now go through semi-synchronous quorum
  replication via a new PushChunkBlob RPC before the chunkref
  enters Raft. Without it, a leader crash between the chunkref
  commit and the eventual async fetch on followers would leave a
  Raft-committed manifest pointing at a chunkblob no surviving
  node has — Raft's quorum bound the chunkref but said nothing
  about the blob payload. chunkBlobMinReplicas defaults to a
  Raft-quorum-equivalent floor, restoring end-to-end durability
  parity with the legacy "every byte through Raft" path.
- §3.6: add the PushChunkBlob RPC alongside the existing
  FetchChunkBlob, with a Push-side fsync ack and a streaming
  request shape symmetric to the fetch path.
- §3.8 (new): explicit two-step rolling-upgrade strategy gated on
  cluster-wide capability advertisement (feature_s3_blob_offload
  in GetClusterOverview). PUTs only flip to the offload path
  when *every* peer advertises the capability; any disagreement
  causes the leader to route the PUT through the legacy BlobKey
  path. Roll-back works the same way in reverse with no data
  loss because both keyspaces are namespaced and reads serve
  whichever resolves first.
- §6: replace the optimistic "Local-blob-only on a single node"
  bullet with the explicit "Leader-only durability before
  chunkref commit" risk + the §3.2 mitigation.

s3_admission_control.md (Gemini medium):

- §3.3.1 (new): aws-chunked transfers no longer pre-charge
  s3MaxObjectSizeBytes (5 GiB on default tunables, 20× over the
  256 MiB cap and a guaranteed head-of-line block for every other
  PUT). Bootstrap reservation is now s3RaftEntryByteBudget
  (4 MiB) and the rest is paid-as-decoded via an awsChunkedReader
  progress callback that hooks into the same per-batch admission
  the fixed-length path uses. The legacy 5 GiB pre-charge stays
  available behind ELASTICKV_S3_PUT_ADMISSION_CHUNKED_INCREMENTAL=false
  for incident response, but ships disabled by default.
- §5: chunked progress callback moves from M4 to M1 so the HoL
  hazard is closed in the first shippable milestone. Adds an
  explicit "Rolling upgrade" subsection covering mixed-cluster
  safety (additive cap, no protocol change) and the new
  observable client behaviour (mid-stream 503 SlowDown for
  chunked PUTs that beat Raft drain).

No code changes; design docs only.
---
 ...026_04_25_proposed_s3_admission_control.md |  87 ++++++++--
 ...026_04_25_proposed_s3_raft_blob_offload.md | 148 +++++++++++++++---
 2 files changed, 207 insertions(+), 28 deletions(-)

diff --git a/docs/design/2026_04_25_proposed_s3_admission_control.md b/docs/design/2026_04_25_proposed_s3_admission_control.md
index 6700be7f4..4f9be3587 100644
--- a/docs/design/2026_04_25_proposed_s3_admission_control.md
+++ b/docs/design/2026_04_25_proposed_s3_admission_control.md
@@ -182,12 +182,54 @@ func (a *putAdmission) reserve(ctx context.Context, bytes int64) (func(), error)
    both the "request is in flight" and "this batch is buffered for
    Raft" phases.
 
-For aws-chunked transfers (`Content-Length == -1`), the request-entry
-charge falls back to a conservative `s3MaxObjectSizeBytes` (5 GiB)
-reservation. The downside is that one chunked PUT can monopolise the
-budget; the upside is correctness without re-reading headers.
-We will instrument a metric to find out empirically how large that
-hit actually is before optimising.
+### 3.3.1 aws-chunked transfers (`Content-Length: -1`)
+
+A naïve "reserve `s3MaxObjectSizeBytes` (5 GiB) up front" is rejected:
+on default tunables (`s3PutAdmissionMaxInflightBytes = 256 MiB`) a
+single chunked PUT would consume **20×** the entire budget at request
+admission time, head-of-line-blocking every other PUT until the
+chunked stream finishes — exactly the failure mode admission control
+exists to prevent. We therefore split chunked admission across two
+mechanisms instead of pre-charging:
+
+1. **Bootstrap reservation = `s3RaftEntryByteBudget` (4 MiB)** at
+   request entry. This is enough to admit the request and let the
+   awsChunkedReader produce its first decoded window. Chunked PUTs
+   are not "free" — they still must beat the same admission queue
+   as fixed-length PUTs at the per-batch level.
+2. **Pay-as-you-decode** thereafter, charged via an
+   `awsChunkedReader` progress callback. Each decoded chunk frame
+   (typically up to 64 KiB on the wire after framing overhead)
+   acquires a slot equal to the bytes about to flow into Pebble; the
+   slot is released once the corresponding `coordinator.Dispatch`
+   acks. This is the same path admission B uses for fixed-length
+   PUTs — chunked traffic just hooks into it incrementally instead of
+   pre-charging the worst case.
+
+Failure modes:
+
+- If the awsChunkedReader produces frames faster than Raft drains, the
+  per-batch acquire blocks (capped by `dispatchAdmissionTimeout`).
+  Beyond that timeout, mid-stream 503 closes the connection. The
+  legacy "reserve 5 GiB" approach would have surfaced as 503 *at
+  request entry* for unrelated PUTs; this approach surfaces as
+  mid-stream 503 for the chunked PUT itself, which is the right
+  blame attribution.
+- If the awsChunkedReader frame size ever exceeds
+  `s3RaftEntryByteBudget` (a malformed client), the per-batch acquire
+  asks for more than the cap allows and we 503 immediately — same
+  as a fixed-length PUT whose `Content-Length` exceeds the global
+  cap.
+
+This change moves chunked admission from M4 (originally "deferred
+optimisation") into M1 (the first shippable milestone). M1 ships
+with the progress-callback wired *unconditionally* for all chunked
+PUTs; an env-var switch falls back to "bootstrap-only" charging
+without the per-decode credit if a corner case requires it
+(`ELASTICKV_S3_PUT_ADMISSION_CHUNKED_INCREMENTAL=false`, default
+`true`). The fallback path keeps the 5 GiB-reservation hazard
+behind an explicit operator decision rather than letting it
+materialise by default.
 
 ### 3.4 Failure mode
 
@@ -239,10 +281,37 @@ suggests bumping the cap or scaling out (more nodes spreads PUT load).
 
 | Milestone | Scope | Risk |
 |---|---|---|
-| M1 | Add `putAdmission` type + per-node singleton + `Content-Length` admission. Wire `prepareStreamingPutBody` to acquire / release. Metric scaffolding only (gauge + counter). | Low. No mid-stream cancellation yet. |
-| M2 | Per-batch admission B inside `flushBatch`. Add `dispatchAdmissionTimeout`. Mid-stream 503 with cleanup. | Medium. Cleanup path on partial failure. |
+| M1 | Add `putAdmission` type + per-node singleton + fixed-length `Content-Length` admission. Wire `prepareStreamingPutBody` to acquire / release. **aws-chunked progress-callback admission** (§3.3.1) ships in this milestone too — the conservative 5 GiB pre-charge fallback only sits behind `ELASTICKV_S3_PUT_ADMISSION_CHUNKED_INCREMENTAL=false`. Metric scaffolding (gauge + counter). | Medium. Chunked progress callback needs `awsChunkedReader` to expose a hook. |
+| M2 | Per-batch admission B inside `flushBatch` for fixed-length PUTs. Add `dispatchAdmissionTimeout`. Mid-stream 503 with cleanup. (Chunked PUTs already use this path through their incremental charging.) | Medium. Cleanup path on partial failure. |
 | M3 | Env-var tunables. Histogram metric. Grafana panel. | Low. |
-| M4 | aws-chunked path: emit a warn log when the conservative 5 GiB reservation kicks in, and add a follower-up to charge the actually-decoded byte count incrementally. | Medium. Needs `awsChunkedReader` to expose a progress callback. |
+| M4 | Per-tenant / per-bucket admission classes (handed off to the workload-isolation rollout). | Medium. Out-of-scope for the v1 cap. |
+
+### Rolling upgrade
+
+Admission is purely additive on the request entry path: a node
+without the cap behaves identically to a node with the cap set
+infinitely high. A mixed cluster (some nodes M1, some still on
+`main`) is therefore safe — clients hitting the upgraded node see
+admission, clients hitting an old node see no admission, but
+neither path corrupts state. The default cap is intentionally
+generous enough that even single-node M1 traffic falls below the
+threshold under typical load, so the rollout signature is
+"503 SlowDown rate goes from 0 to negligible" rather than a step
+function. Operators can pin
+`ELASTICKV_S3_PUT_ADMISSION_MAX_INFLIGHT_BYTES=$((1<<63))` to
+disable the cap on M1 nodes during the burn-in window if desired.
+
+The aws-chunked progress-callback path is the only behaviour
+change visible to clients: a chunked PUT that would have succeeded
+under the old "no admission" code can now 503 mid-stream when Raft
+drain falls behind. This is by design — that is the failure mode
+admission control exists to surface — but operators should expect
+to see chunked-upload 503s where there were none before. The
+`stage="perbatch", protocol="chunked"` rejection-counter label
+isolates this signal; bumping the cap or
+`ELASTICKV_S3_PUT_ADMISSION_CHUNKED_INCREMENTAL=false` (with the
+HoL hazard re-introduced as a known trade-off) restores legacy
+behaviour during incident response.
 
 Acceptance criteria:
 
diff --git a/docs/design/2026_04_25_proposed_s3_raft_blob_offload.md b/docs/design/2026_04_25_proposed_s3_raft_blob_offload.md
index 52106f3a1..e8f0096af 100644
--- a/docs/design/2026_04_25_proposed_s3_raft_blob_offload.md
+++ b/docs/design/2026_04_25_proposed_s3_raft_blob_offload.md
@@ -133,7 +133,8 @@ client ─► HTTP PUT body
         ─► chunk loop (s3ChunkSize):
              1. compute SHA-256 of chunk
              2. write chunk to LOCAL Pebble at !s3|chunkblob|<SHA>
-             3. queue ChunkRef into pendingBatch
+             3. *** synchronously replicate to ≥ chunkBlobMinReplicas peers ***
+             4. queue ChunkRef into pendingBatch
         ─► flushBatch:
              coordinator.Dispatch(OperationGroup{
                  Elems: [ chunkref Puts ... + manifest Put ],
@@ -141,17 +142,47 @@ client ─► HTTP PUT body
         ─► HTTP 200 OK once Dispatch acks
 ```
 
-The order matters: the chunk *must* be persisted in local Pebble
-before its `chunkref` is committed through Raft, otherwise a
-follower that applies the `chunkref` and immediately re-fetches
-might race the leader's local write. Pebble's WAL fsync gives the
-ordering guarantee; the chunk write returns only after the data is
-durable.
+Step 3 — synchronous chunkblob replication before the chunkref
+commit — is the difference between "Raft-equivalent durability"
+and "leader-only durability." Without it, a leader crash between
+the chunkref commit and the eventual async fetch would leave a
+committed manifest pointing at a chunkblob nobody else has — Raft's
+quorum guarantees the chunkref but tells you nothing about the
+blob payload. We close that gap by treating the chunkblob like a
+mini-Raft entry of its own with **semi-synchronous quorum**:
+
+1. Leader writes the chunkblob to local Pebble (fsync).
+2. Leader pushes the chunkblob to `chunkBlobMinReplicas - 1`
+   followers via the `S3BlobFetch.PushChunkBlob` RPC and waits for
+   each follower's "fsync ack." (`PushChunkBlob` is the leader-
+   initiated counterpart to the follower-initiated `FetchChunkBlob`
+   defined in §3.6.)
+3. Only after the chunkblob is durable on a quorum of nodes does
+   the leader propose the chunkref through Raft.
+
+`chunkBlobMinReplicas` defaults to **2** on a 3-node cluster (= a
+quorum of 2 includes the leader and one follower). For larger
+clusters the floor is `(N/2)+1` to match Raft's quorum size; the
+operator can opt into N for stronger-than-Raft durability. A
+follower that crashes after acking the push but before the chunkref
+commits is fine — the chunkref will be retried by the leader on the
+next attempt because it has not yet entered the Raft log.
+
+The trade-off is PUT latency: a PUT now blocks on
+`chunkBlobMinReplicas - 1` follower fsyncs in addition to the Raft
+quorum write of the chunkref. Empirically the chunkblob fsync is
+the dominant cost (1 MiB write, ~5–10 ms on consumer SSD), so PUT
+p99 is roughly equivalent to today's "every byte through Raft"
+latency — we are paying the same fsync cost, just to a different
+keyspace.
 
 The `chunkref` keys are < 100 B each. A 1 GiB PUT generates 1024
 of them = ~100 KiB of Raft log payload. Compared with today's
 1 GiB through Raft, that is a **10⁴× reduction** in Raft log
-write amplification.
+write amplification — even with semi-synchronous chunkblob
+replication, the Raft log itself is unaffected by chunk size, so
+log replay time, snapshot transfer, and follower catch-up still
+collapse to O(manifest count).
 
 ### 3.3 Follower apply path
 
@@ -218,13 +249,24 @@ at the same conclusion.
 
 ### 3.6 Fetch protocol
 
-A new gRPC method on the existing internal raft transport service:
+Two RPCs on the existing internal raft transport service — one
+follower-initiated (lazy fetch on miss), one leader-initiated
+(synchronous replication before chunkref commit, see §3.2 step 3):
 
 ```protobuf
 service S3BlobFetch {
   // FetchChunkBlob returns the bytes of a chunkblob this peer holds
-  // locally. Caller must verify SHA-256.
+  // locally. Caller must verify SHA-256. Used by followers on the
+  // proxy-on-miss GET path (§3.3) and during snapshot catch-up.
   rpc FetchChunkBlob(FetchChunkBlobRequest) returns (stream FetchChunkBlobResponse);
+
+  // PushChunkBlob streams a chunkblob from the leader to a follower
+  // and acks once the bytes are durable in the receiver's Pebble.
+  // Used by §3.2 step 3 to make chunkblob writes survive a leader
+  // crash without depending on the async fetch path catching up.
+  // The receiver SHOULD verify SHA-256 against the request header
+  // before fsync; mismatch fails the RPC and the leader retries.
+  rpc PushChunkBlob(stream PushChunkBlobRequest) returns (PushChunkBlobResponse);
 }
 
 message FetchChunkBlobRequest {
@@ -235,6 +277,16 @@ message FetchChunkBlobResponse {
   bytes  payload = 1;
   bool   eof     = 2;
 }
+
+message PushChunkBlobRequest {
+  bytes  content_sha256 = 1; // sent in the first frame only
+  bytes  payload        = 2;
+  bool   eof            = 3;
+}
+
+message PushChunkBlobResponse {
+  bool   durable        = 1; // true == fsynced
+}
 ```
 
 Streamed because a chunkblob is up to `s3ChunkSize = 1 MiB`. The
@@ -250,15 +302,68 @@ handles large payloads.
 The legacy `BlobKey` path remains available. New PUTs use the
 offload path when `ELASTICKV_S3_BLOB_OFFLOAD=true`; existing data
 keeps reading through the legacy `BlobKey` path until a background
-migrator (separate proposal) rewrites it. Mixed clusters work
-because both keyspaces are namespaced.
+migrator (separate proposal) rewrites it. Mixed keyspace coexistence
+works because `!s3|chunkblob|*` and the legacy
+`!s3|blob|<bucket>|<gen>|...` namespaces are disjoint.
 
 The opt-in flag stays for at least one full release cycle so we can
 revert by flipping a single env var if any of the following surface:
 
 - a SHA-256 collision (~zero probability but a hard kill criterion),
 - a follower fetch storm overwhelming peer-to-peer bandwidth,
-- a GC bug that leaks reachable blobs.
+- a GC bug that leaks reachable blobs,
+- semi-synchronous `PushChunkBlob` latency exceeding the legacy
+  PUT p99 by an unacceptable margin (the soak-test acceptance
+  criterion in §5).
+
+### 3.8 Mixed-version cluster behaviour
+
+Until *every* node in the cluster speaks the offload protocol, PUTs
+on the offload path cannot proceed safely: a node that does not
+implement `PushChunkBlob` cannot ack a quorum write, and a follower
+that does not implement `FetchChunkBlob` cannot resolve a chunkref
+on apply. We therefore gate the offload path on cluster-level
+feature negotiation rather than a single env var:
+
+- A node advertises offload capability by setting
+  `feature_s3_blob_offload=true` in the `AdminServer.GetClusterOverview`
+  response (alongside the existing role / version metadata).
+- The leader inspects every peer's advertised capabilities at PUT
+  admission time. If any peer is missing the capability, the PUT
+  falls back to the legacy `BlobKey` path for that request — even
+  if the leader has the env var enabled.
+- During an upgrade window the leader continues to emit legacy
+  writes; once the last peer rolls and re-advertises, subsequent
+  PUTs flip to offload automatically. A roll-back works the same
+  way in reverse: the first downgraded peer drops its capability
+  flag and the leader resumes legacy emission within the next
+  capability-refresh interval (default 30 s).
+- Reads always succeed regardless of mixed state, because both
+  keyspaces are namespaced and the GET path checks legacy then
+  offload (or vice-versa) and serves whichever resolves.
+
+This gives operators a **strict two-step rolling upgrade** with no
+PUT data path that depends on a half-upgraded cluster:
+
+1. Roll out the new binary with `ELASTICKV_S3_BLOB_OFFLOAD=false`
+   on every node. PUTs continue on the legacy path. Validate
+   stability for a soak window (24 h on the canary cluster in
+   §5's M0 acceptance criteria).
+2. Flip `ELASTICKV_S3_BLOB_OFFLOAD=true` on the leader, then on
+   followers. Once every node advertises capability, PUTs switch
+   to the offload path.
+
+Roll-back: flip the env var to `false` on any node; the leader's
+capability check sees the disagreement and falls back to legacy
+within ≤ refresh interval. The migrator (M4) is independently
+gated and never runs during a roll-back window.
+
+A node with `ELASTICKV_S3_BLOB_OFFLOAD=true` running against a
+cluster where offload is disabled (e.g. a stuck rollout) is safe
+— it advertises capability but the leader's per-PUT capability
+check sees other peers missing it and routes through legacy. No
+data is written into the offload keyspace until a quorum of
+capability-advertising peers exists.
 
 ## 4. Interaction with related subsystems
 
@@ -316,12 +421,17 @@ Acceptance criteria for M3 (the milestone that flips `ELASTICKV_S3_BLOB_OFFLOAD=
 - **SHA-256 collision.** Operationally improbable; shipped with a
   metric (`s3_chunkblob_sha_mismatch_total`) and a hard-fail option
   for paranoid operators.
-- **Local-blob-only on a single node.** If only one node has a
-  given chunkblob and that node fails before peers fetch it, data
-  is lost. Mitigation: PUT path replicates the chunkblob to *N*
-  peers asynchronously before returning 200 (e.g. quorum write
-  outside of Raft). N defaults to 2 (one extra copy = parity with
-  Raft's quorum durability).
+- **Leader-only durability before chunkref commit.** Without
+  intervention, a leader crash between writing the chunkblob to its
+  own Pebble and the eventual async fetch on followers would leave
+  a Raft-committed chunkref pointing at a chunkblob no surviving
+  node has. Mitigation: §3.2 step 3 — synchronous semi-quorum
+  replication via `PushChunkBlob` before the chunkref enters Raft.
+  `chunkBlobMinReplicas` defaults to a Raft-quorum-equivalent
+  floor; operators who want N-way durability bump it explicitly.
+  This restores end-to-end durability parity with the legacy
+  "every byte through Raft" path at the cost of one extra fsync
+  per chunkblob on the followers in the quorum.
 - **Cross-tenant blob fetch via SHA-256 guessing.** Because
   `chunkblob` keys are SHA-256-addressed, *if* a malicious tenant
   could (a) guess a victim tenant's chunk SHA and (b) bypass

From 7e594099d80b5a745a1331f7353e31e87f8fc0b4 Mon Sep 17 00:00:00 2001
From: "Yoshiaki Ueda (bootjp)" <contact@bootjp.me>
Date: Sun, 26 Apr 2026 04:15:24 +0900
Subject: [PATCH 4/8] docs(design): address Claude bot review (2 HIGH, 4
 MEDIUM, 4 LOW)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

s3_admission_control.md — HIGH 1: stale §6 entry "aws-chunked
overcharge persists until M4" contradicted §3.3.1 / §5 milestone
table which now ship pay-as-you-decode in M1. Replace with the
actual residual M1 risk: a chunked PUT that previously succeeded
silently can now 503 mid-stream when Raft drain falls behind.

s3_admission_control.md — MEDIUM 1: §3.3 A+B semaphore semantics
were ambiguous. Pin the contract: A is a check-only headroom gate
(`peekHeadroom`, no acquire); B is the actual acquire/release on
the semaphore. The bytes-in-flight bound is enforced by B alone;
A is a fast-fail at request entry. Includes the explicit
"if bytes > capacity * unitSize, fail immediately without waiting"
rule for malformed oversized frames (LOW 2 from the same review).

s3_admission_control.md — MEDIUM 2: §4 retry-budget release was
underspecified. Fix the contract to hold-through-retry: per-batch
slot is released exactly once on the *final* outcome of the retry
chain, never between attempts. The bytes are still in pendingBatch
across retries, so the budget must reflect them; release-between-
retries would let a second PUT proceed while the first is still
memory-resident.

s3_admission_control.md — LOW 1: dispatchAdmissionTimeout comment
referenced the wrong drain baseline (33 s = full MaxInflight queue,
not the 256 MiB cap which drains in ~2 s). Rewrite to make clear
30 s is the budget for a transiently stalled follower to recover,
not normal drain time.

s3_raft_blob_offload.md — HIGH 2: §3.5 GC grace window was
unimplementable as written — a plain RC counter at zero carries no
"became eligible at T" signal so the sweeper cannot enforce the
1-hour grace. Switch to a queue-based scheme:
`!s3|chunkblob-gc-queue|<commitTS-nanos>|<SHA>` written in the same
Raft txn that drives RC to zero. The sweeper scans the queue range
older than the grace window, re-checks RC at read time, and
deletes via Raft so concurrent sweepers serialise on
write-write-conflict.

s3_raft_blob_offload.md — MEDIUM 3: M1 milestone shipped
proxy-on-miss but the FetchChunkBlob RPC was scheduled in M2,
so M1 GET could only serve local-hit or 503. Move both
FetchChunkBlob and PushChunkBlob into M1 (PushChunkBlob is
already needed by §3.2 quorum write); M2 narrows to the async
catch-up worker pool that does not exist on the request goroutine.

s3_raft_blob_offload.md — MEDIUM 4: §3.3 had no explicit
GET-vs-GC race section. Add §3.3.1 specifying that
FetchChunkBlob NOT_FOUND on a peer mid-GC is a normal fallback
trigger (cycle through remaining peers in randomised order) and
that a fanout-wide NOT_FOUND on an SHA whose chunkref is still
present is the genuine durability failure that surfaces 500 +
metric.

s3_raft_blob_offload.md — LOW 3: SourcePeer was implicitly a hint
but never said so. Document explicitly that SourcePeer is best-
effort, callers MUST handle NOT_FOUND from FetchChunkBlob on that
peer as a normal fallback to fanout, and that treating it as a
hard error would pin clients on a momentarily-bad source.

s3_raft_blob_offload.md — LOW 4: chunkBlobMinReplicas behaviour
during cluster shrink / partial outage was undefined. Specify the
degradation ladder: ≥ minReplicas → normal; < minReplicas but
Raft quorum intact → degrade to "as many as available, never
< 2" with a metric; < 2 reachable peers → 503 (leader-only
durability is the regression this design exists to prevent).

No code changes; design docs only.
---
 ...026_04_25_proposed_s3_admission_control.md | 118 +++++++++++----
 ...026_04_25_proposed_s3_raft_blob_offload.md | 140 +++++++++++++++---
 2 files changed, 209 insertions(+), 49 deletions(-)

diff --git a/docs/design/2026_04_25_proposed_s3_admission_control.md b/docs/design/2026_04_25_proposed_s3_admission_control.md
index 4f9be3587..a98492706 100644
--- a/docs/design/2026_04_25_proposed_s3_admission_control.md
+++ b/docs/design/2026_04_25_proposed_s3_admission_control.md
@@ -133,11 +133,15 @@ const (
     // memtables.
     s3PutAdmissionMaxInflightBytes = 256 << 20 // 256 MiB
     // dispatchAdmissionTimeout is how long a per-batch flush will wait
-    // for a slot before giving up. Sized comfortably above the Raft
-    // in-flight queue's drain time at 1 Gbps (1024 × 4 MiB / 125 MB/s
-    // ≈ 33 s) so a transient stall does not surface as 503; longer
-    // stalls indicate the cluster is genuinely overloaded and rejection
-    // is the right signal.
+    // for a slot before giving up. The 256 MiB cap drains in ~2 s at
+    // 1 Gbps under steady-state Raft throughput (256 MiB / 125 MB/s),
+    // so 30 s is *not* sized against normal drain — it is the budget
+    // for a transiently stalled follower (GC pause, slow disk fsync,
+    // bounded leader re-election) to recover before we conclude the
+    // cluster is genuinely overloaded. Longer stalls surface as 503,
+    // which is the right signal: at that point the right action is
+    // operator intervention (scale out, investigate the stall), not
+    // continued accumulation.
     dispatchAdmissionTimeout = 30 * time.Second
 )
 ```
@@ -162,25 +166,54 @@ type putAdmission struct {
 
 func newPutAdmission(maxBytes int64) *putAdmission { … }
 
-// reserve charges (bytes) units against the semaphore. Returns a
-// release closure that the caller MUST call exactly once. Returns
-// ErrAdmissionExhausted if the deadline elapses before the budget
-// is available.
-func (a *putAdmission) reserve(ctx context.Context, bytes int64) (func(), error) { … }
+// peekHeadroom is admission A. It returns ErrAdmissionExhausted
+// without acquiring slots when the requested byte count exceeds
+// the *current* free capacity of the semaphore. It does NOT take
+// out a reservation — the only effect is "fail fast at request
+// entry instead of partway through the body" — so it cannot
+// double-count against admission B.
+func (a *putAdmission) peekHeadroom(bytes int64) error { … }
+
+// acquire is admission B. It blocks until (bytes / s3ChunkSize)
+// slots are available or ctx fires. The returned release closure
+// MUST be called exactly once. If `bytes > capacity * s3ChunkSize`
+// (a malformed client whose frame exceeds the entire budget),
+// returns ErrAdmissionExhausted *immediately* without waiting —
+// otherwise we would block until ctx (typically
+// dispatchAdmissionTimeout) for a request that can never fit.
+func (a *putAdmission) acquire(ctx context.Context, bytes int64) (func(), error) { … }
 ```
 
-`reserve` is non-trivial because we need partial-charge semantics:
-
-1. Pre-charge `Content-Length` total at request entry (admission A).
-   The release fires in a deferred handler on the request goroutine
-   regardless of success / failure.
-2. Per-batch sub-lease (admission B) is *not* a separate budget — it's
-   a synchronisation primitive on the existing semaphore. The PUT
-   handler acquires `s3ChunkSize × s3ChunkBatchOps = 4 MiB` units
-   before reading the next 4 MiB window from the body and releases
-   them on `coordinator.Dispatch` ack. This way the same budget covers
-   both the "request is in flight" and "this batch is buffered for
-   Raft" phases.
+The two-step contract avoids the double-charge / unbounded-window
+hazard the obvious "A is also an acquire" design would have:
+
+1. **Admission A — request-entry headroom check (peek only).** When
+   a PUT arrives with `Content-Length: N`, the handler calls
+   `peekHeadroom(N)`. If the result is `ErrAdmissionExhausted`,
+   reply 503 immediately and never read from the body. If `nil`,
+   admission A is done — no slots have been taken out of the
+   semaphore. This is intentionally racy with concurrent PUTs:
+   admission A only promises "at the moment we asked, the budget
+   *would have fit* this request"; it does not reserve the budget.
+2. **Admission B — per-batch acquire/release (the only path that
+   touches the semaphore).** The PUT handler then loops the body
+   in `s3ChunkSize × s3ChunkBatchOps = 4 MiB` windows; each window
+   acquires 4 MiB worth of slots via `acquire`, reads the next
+   window from the body, dispatches, and releases on Dispatch
+   ack. This is the bound that actually holds — at any instant the
+   sum of held slots across all in-flight PUTs cannot exceed the
+   semaphore's capacity. If admission A's racy estimate turns out
+   to be wrong (another PUT raced in between A's check and the
+   first B-acquire), the first B-acquire blocks until the
+   contending PUT releases or `dispatchAdmissionTimeout` fires.
+
+The semaphore is therefore charged **only by admission B**. Bytes
+in flight = `held_B_slots × s3ChunkSize`, full stop; admission A is
+a fast-fail gate, not a reservation. This is the model
+implementations MUST follow — a "both A and B charge" design would
+double-count every PUT against itself and an admission-A-only
+design would lose its bound the moment a PUT's body exceeded its
+declared `Content-Length` (chunked transfers, malformed clients).
 
 ### 3.3.1 aws-chunked transfers (`Content-Length: -1`)
 
@@ -272,10 +305,24 @@ suggests bumping the cap or scaling out (more nodes spreads PUT load).
 - **Workload isolation proposal.** That doc proposes per-class CPU
   reservation for Raft. Admission control is the memory-axis sibling.
   Both are needed — limiting CPU does not bound queue depth.
-- **`coordinator.Dispatch` retries.** Today the S3 path has its own
-  retry loop (`s3TxnRetryMaxAttempts = 8`). Admission must release
-  its budget around retries, otherwise a long retry chain double-
-  counts.
+- **`coordinator.Dispatch` retries.** Today the S3 path has its
+  own retry loop (`s3TxnRetryMaxAttempts = 8` with exponential
+  backoff capped at `s3TxnRetryMaxBackoff = 32 ms`). The admission
+  contract is **hold-through-retry**: the per-batch slot acquired
+  in admission B is released exactly once, on the *final* outcome
+  of the retry chain (success ack, terminal error, or
+  `dispatchAdmissionTimeout` expiring), not between attempts.
+  Rationale: the bytes are still buffered in the PUT handler's
+  pendingBatch slice for the entire retry window, so the budget
+  must reflect them; a release-between-retries scheme would let a
+  second PUT proceed while the first is still memory-resident,
+  breaking the bound. The total wall-clock cost of holding through
+  one full retry chain is bounded by
+  `s3TxnRetryMaxAttempts × (redisDispatchTimeout + s3TxnRetryMaxBackoff)`;
+  if that ever exceeds `dispatchAdmissionTimeout` the per-batch
+  acquire on the *next* batch surfaces as 503, which is the right
+  failure mode (chronic dispatch failure → caller learns instead of
+  silently consuming the budget).
 
 ## 5. Implementation plan
 
@@ -333,10 +380,21 @@ Acceptance criteria:
 - **Operator confusion.** "Why does S3 return 503 when CPU is at
   20%?" Mitigated by a sharp Grafana panel and a clear `Retry-After`
   value so SDK behaviour is predictable.
-- **aws-chunked overcharge.** A small chunked PUT charges 5 GiB
-  against the budget until M4 lands. We accept this temporarily
-  because aws-chunked traffic is rare in practice and the
-  conservative cap fails safe.
+- **New chunked-PUT 503 surface.** Pay-as-you-decode admission
+  (§3.3.1) ships in M1 alongside fixed-length admission, so the
+  legacy 5 GiB pre-charge hazard does not materialise as a
+  steady-state risk. The residual risk it introduces is the
+  inverse: a chunked PUT that would have silently succeeded under
+  the no-admission code can now 503 mid-stream when Raft drain
+  falls behind. This is by design — that is the failure mode
+  admission control exists to surface — but it is the only
+  client-visible behaviour change in M1 and is what operators
+  should expect to see in dashboards. The
+  `stage="perbatch", protocol="chunked"` label on the rejection
+  counter (§3.5) isolates the signal; the operator escape hatch
+  is `ELASTICKV_S3_PUT_ADMISSION_CHUNKED_INCREMENTAL=false`, which
+  reverts to bootstrap-only charging at the cost of
+  re-introducing the 5 GiB head-of-line hazard.
 
 ## 7. Out of scope (future work)
 
diff --git a/docs/design/2026_04_25_proposed_s3_raft_blob_offload.md b/docs/design/2026_04_25_proposed_s3_raft_blob_offload.md
index e8f0096af..ed9e1eb6e 100644
--- a/docs/design/2026_04_25_proposed_s3_raft_blob_offload.md
+++ b/docs/design/2026_04_25_proposed_s3_raft_blob_offload.md
@@ -168,6 +168,35 @@ follower that crashes after acking the push but before the chunkref
 commits is fine — the chunkref will be retried by the leader on the
 next attempt because it has not yet entered the Raft log.
 
+**Behaviour during cluster shrink / partial outage.** A controlled
+decommission (5 → 3 nodes) or a transient partition can leave the
+leader with fewer reachable peers than `(N/2)+1` configured at the
+last membership commit. Blocking PUTs until the configured minimum
+is reachable would surface as an indefinite hang, which is worse
+than the legacy "every byte through Raft" path (which fails when
+Raft itself loses quorum, but otherwise succeeds). The contract is
+therefore degraded availability with a hard floor:
+
+- If reachable peers ≥ `chunkBlobMinReplicas`, normal path: ack
+  after the configured minimum.
+- If reachable peers < `chunkBlobMinReplicas` but ≥ `floor(N/2)+1`
+  (Raft quorum is intact), degrade to "as many as currently
+  available, but never fewer than 2 — i.e. leader + at least one
+  follower." Emit `s3_chunkblob_replication_degraded_total` so
+  operators see the degradation. This matches Raft's own behaviour
+  during the same window: Raft would still commit at quorum, just
+  with one fewer redundant ack.
+- If reachable peers < 2 (leader-only durability), fail the PUT
+  with 503 — single-node durability is what `BlobKey`-on-Raft
+  already loses on leader crash, and is the regression this design
+  exists to prevent.
+
+The floor of 2 is the strict invariant: leader-only writes are the
+case the design refuses to accept regardless of operator
+configuration. Tuning `chunkBlobMinReplicas` higher trades PUT
+availability for stronger durability; tuning lower than 2 is
+rejected at config-load.
+
 The trade-off is PUT latency: a PUT now blocks on
 `chunkBlobMinReplicas - 1` follower fsyncs in addition to the Raft
 quorum write of the chunkref. Empirically the chunkblob fsync is
@@ -195,17 +224,54 @@ When a follower's apply loop sees a `chunkref` Put:
    `!s3|chunkblob|<SHA>` once the body arrives and verifies the
    SHA-256 (mismatch → drop and retry from another peer).
 
+`SourcePeer` is a **best-effort hint** captured at write time
+(§3.1). It is *not* authoritative — the recorded peer may have
+crashed, restarted, evicted the blob via §3.5 GC, or simply lost
+the local copy to disk failure. Callers MUST treat
+`FetchChunkBlob → NOT_FOUND` from `SourcePeer` as a normal fallback
+trigger, not an error: drop to fanout against the rest of the
+peer set and accept the first peer that returns the bytes (with
+SHA-256 verification at the receiving end). Treating
+`NOT_FOUND` as a hard failure would make a single peer's GC tick
+pin clients on a bad source.
+
 GET / range-read on the follower checks the local `chunkblob`
 first; if absent (because the async fetch is still pending), it
 either:
 
-- proxies the read to a peer that *does* have the chunk (using the
-  `SourcePeer` hint and falling back to a fanout), or
+- proxies the read to a peer that holds the chunk (using the
+  `SourcePeer` hint, falling back to fanout on `NOT_FOUND`), or
 - replies 503 with `Retry-After`, identical to S3's behaviour
   during a region failover.
 
 The choice is per-deployment; Phase-1 ships proxy-on-miss.
 
+#### 3.3.1 GET vs. GC delete race
+
+Even with the §3.5 grace window, a GET that proxies to peer X for
+a chunkblob can lose a race against peer X's sweeper if the
+sweeper deletes the local copy *between* the GET arriving on the
+caller's node and the proxy RPC reaching peer X. The blob remains
+reachable globally (other peers still hold it; the chunkref is
+unchanged), so the right behaviour is for the caller to fall back
+to fanout. We therefore mandate:
+
+- `FetchChunkBlob` on a peer whose local sweeper just removed the
+  blob returns `NOT_FOUND`, **not** an internal error.
+- The caller's GET handler treats both `NOT_FOUND` and
+  `INVALID_ARGUMENT (sha mismatch)` as fallback triggers and
+  cycles through the remaining peers in randomised order.
+- If the entire peer set returns `NOT_FOUND` for an SHA whose
+  `chunkref` is still present, that is a genuine durability
+  failure (every replica including the leader's GC raced); the
+  GET surfaces 500 *and* the read path emits a
+  `s3_chunkblob_unrecoverable_total` metric so operators detect
+  the underlying GC bug. With the §3.2 quorum-write durability
+  and the §3.5 grace window, this case requires a coincident
+  failure across a quorum of nodes within a 1-hour window — vastly
+  more unlikely than the per-peer 404 the fanout absorbs as a
+  matter of course.
+
 ### 3.4 Snapshot
 
 Today a follower snapshot dump includes every `BlobKey` Pebble has
@@ -230,22 +296,58 @@ amortised across reads."
 ### 3.5 Garbage collection
 
 A blob whose `chunkref` has been deleted (DELETE, lifecycle policy,
-object version pruned, manifest aborted) is reclaimable. We handle
-this two-staged:
+object version pruned, manifest aborted) is reclaimable. The
+sweeper needs to know not just *that* the RC reached zero but
+*when*, otherwise the documented grace window is unimplementable
+(a plain counter at zero carries no time signal). We make the
+"became eligible at T" fact a first-class Raft entry:
 
-1. Reference counting via a `!s3|chunkref-rc|<SHA>` counter
+1. **Reference counting** via `!s3|chunkref-rc|<SHA>`, a counter
    updated inside the same Raft txn that adds / removes a
-   `chunkref`. RC == 0 marks the blob as eligible.
-2. A node-local sweeper periodically (e.g. every 5 minutes)
-   scans `!s3|chunkblob|*` and deletes blobs whose RC is 0 and
-   whose RC has been 0 for at least `chunkBlobGCGracePeriod`
-   (proposed default 1 hour). The grace window covers in-flight
-   reads and avoids deleting a blob a peer is just about to fetch.
-
-Sweeper deletion is local — it does not pass through Raft, because
-the authoritative state ("is this blob unreachable?") is the
-already-committed RC. Followers run independent sweepers and arrive
-at the same conclusion.
+   `chunkref`. The atomic `(chunkref change, RC update)` pair is
+   the linearisation point for "this blob is now / no longer
+   reachable."
+2. **GC eligibility queue.** When the txn that decrements an RC
+   would drive it to zero, the *same* txn additionally writes
+   `!s3|chunkblob-gc-queue|<commitTS-nanos>|<SHA>` → empty. The
+   commitTS-prefixed key is the time signal: the queue is
+   naturally sorted by elegibility-start time, and any node can
+   determine the grace-period boundary by scanning the queue with
+   `endKey = !s3|chunkblob-gc-queue|<now-gracePeriod>|`. If a
+   subsequent txn re-references the same SHA before the sweeper
+   runs (e.g. an upload reuses a content hash), that txn deletes
+   the queue entry as part of incrementing the RC; the queue
+   therefore reflects "currently RC==0" rather than "ever was zero."
+3. **Node-local sweeper.** Each node runs an independent sweeper
+   every `chunkBlobGCInterval` (proposed default 5 minutes) that:
+   a. scans the queue range `[!s3|chunkblob-gc-queue|, !s3|chunkblob-gc-queue|<now-gracePeriod>|)`
+      for entries whose grace window has elapsed,
+   b. for each `<SHA>` returned, re-checks the RC counter at the
+      sweeper's read timestamp; if the RC is still 0, deletes the
+      local `!s3|chunkblob|<SHA>` AND deletes the queue entry —
+      both as a single Raft txn so the queue stays consistent with
+      the keyspace,
+   c. if the RC has bounced above 0 in the meantime, the queue
+      entry is stale (a re-reference txn forgot to remove it, or
+      the sweeper raced) and the sweeper deletes only the queue
+      entry, leaving the chunkblob in place.
+
+The queue is the authoritative "blob is GC-eligible since T"
+signal; the RC is the authoritative "is reachable" signal. Both
+are Raft-replicated, so every node arrives at the same set of
+sweepable SHAs and the same eligibility window. Different nodes
+running sweepers concurrently is safe because step (3b) commits
+through Raft and the txn fails with a write-write conflict on the
+second sweeper, leaving the first sweeper's deletion authoritative.
+
+`chunkBlobGCGracePeriod` defaults to 1 hour. The grace window
+absorbs in-flight reads (a peer that has already started fetching
+the blob completes its fetch before the sweeper runs), in-flight
+upload aborts (a multipart abort flips RCs to zero, then a retry
+creates a new manifest with the same chunks and bumps them back),
+and clock skew between nodes (we use the Raft `commitTS` from the
+RC-update txn, not wall clock, so skew is bounded to the
+HLC-physical-shift the cluster already tolerates).
 
 ### 3.6 Fetch protocol
 
@@ -392,9 +494,9 @@ capability-advertising peers exists.
 | Milestone | Scope | Risk |
 |---|---|---|
 | M0 | Spike: prove the chunkref + chunkblob keyspaces under a feature flag with 1 % traffic. Measure local Pebble write amp & blob fetch latency. | Low (observability only). |
-| M1 | PUT path emits chunkrefs through Raft; chunkblob writes go directly to local Pebble. GET path checks chunkblob locally with proxy-on-miss. | Medium (race ordering). |
-| M2 | Follower fetch protocol + async fetch worker pool. SHA verification + retry from alternate peer on mismatch. | Medium (fanout cost on snapshot apply). |
-| M3 | Reference-count + grace-period GC. | Medium (correctness of RC under concurrent ops). |
+| M1 | PUT path emits chunkrefs through Raft; chunkblob writes go directly to local Pebble. **`FetchChunkBlob` and `PushChunkBlob` RPCs ship in this milestone** because both M1 PUT (semi-synchronous push) and M1 GET (proxy-on-miss) depend on them — without them M1 GET could only serve local-hit or 503. | Medium (race ordering, RPC plumbing on the request goroutine). |
+| M2 | Async fetch worker pool for follower apply (catch-up after a long absence). Independent of M1's synchronous `FetchChunkBlob` use on the GET path. SHA verification + retry from alternate peer on mismatch. | Medium (fanout cost on snapshot apply). |
+| M3 | Reference-count + grace-period GC (the queue-based scheme in §3.5). | Medium (correctness of RC under concurrent ops). |
 | M4 | Migrator: rewrite legacy `BlobKey` data in the background. Off by default until M0–M3 burn in for 30 days in production. | High (long-running batch over live traffic). |
 
 Acceptance criteria for M3 (the milestone that flips `ELASTICKV_S3_BLOB_OFFLOAD=true` by default):

From 747506002296d0ef1da04c595945af754ac3715e Mon Sep 17 00:00:00 2001
From: "Yoshiaki Ueda (bootjp)" <contact@bootjp.me>
Date: Sun, 26 Apr 2026 04:25:22 +0900
Subject: [PATCH 5/8] docs(design): address Claude bot round-4 review (1 HIGH,
 3 LOW)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

s3_raft_blob_offload.md — HIGH: §3.5 step (3b) claimed "delete the
local chunkblob AND the queue entry as a single Raft txn" — but
chunkblob is local-only Pebble per §3.1 and the queue is
Raft-replicated, so the two ops live in different storage layers
and cannot share a Raft txn. Rewrite the step to specify a
two-phase ordering:

  i.  Raft phase first: delete the queue entry through Raft.
      Concurrent sweepers serialise here on write-write-conflict;
      the queue is therefore the global "we are GC-ing this SHA"
      lock.
  ii. Local phase second: delete the local chunkblob from Pebble.

Document the failure mode of the inverse ordering (local-first
would orphan the queue entry on crash) and of the chosen ordering
(crash between phases leaves a local space leak — bounded, no
correctness consequence — recoverable by a periodic orphan scan).

s3_raft_blob_offload.md — LOW: the same §3.5 closing paragraph
said "Both are Raft-replicated" referring to the queue and RC.
That phrasing implied the chunkblob deletes were Raft-replicated
too. Rewrite to explicitly distinguish: queue + RC are
Raft-replicated; local chunkblob deletes are deliberately
node-local because that is the whole point of the architecture.

s3_admission_control.md — LOW: §4 retry-budget bound formula
referenced `redisDispatchTimeout`, a Redis-path constant copy-
pasted into the S3 design. The S3 PUT path actually uses the
inbound `*http.Request` context (no S3-specific Dispatch timeout),
so the formula now reads `single_dispatch_budget` with an explicit
note that the upper bound is whatever the request context allows
at that moment.

s3_admission_control.md — LOW: §3.5 metrics spec defined only
`stage="prereserve" | "perbatch"` but §6 and the Rolling Upgrade
subsection both reference a `stage="perbatch", protocol="chunked"`
label combination for isolating chunked-PUT rejection events. Add
the `protocol="fixed-length" | "chunked"` label dimension to
`elastickv_s3_put_admission_rejections_total` and
`elastickv_s3_put_admission_wait_seconds`, with a brief paragraph
explaining why the split is operationally meaningful (chunked HoL
events vs. fixed-length client-concurrency events).

No code changes; design docs only.
---
 ...026_04_25_proposed_s3_admission_control.md | 37 ++++++++++----
 ...026_04_25_proposed_s3_raft_blob_offload.md | 48 ++++++++++++++-----
 2 files changed, 65 insertions(+), 20 deletions(-)

diff --git a/docs/design/2026_04_25_proposed_s3_admission_control.md b/docs/design/2026_04_25_proposed_s3_admission_control.md
index a98492706..53a482e76 100644
--- a/docs/design/2026_04_25_proposed_s3_admission_control.md
+++ b/docs/design/2026_04_25_proposed_s3_admission_control.md
@@ -283,10 +283,23 @@ materialise by default.
 
 ```
 elastickv_s3_put_admission_inflight_bytes        gauge
-elastickv_s3_put_admission_rejections_total      counter (label: stage="prereserve" | "perbatch")
-elastickv_s3_put_admission_wait_seconds          histogram (label: stage)
+elastickv_s3_put_admission_rejections_total      counter (labels:
+                                                    stage    = "prereserve" | "perbatch",
+                                                    protocol = "fixed-length" | "chunked")
+elastickv_s3_put_admission_wait_seconds          histogram (labels: stage, protocol)
 ```
 
+The `protocol` label distinguishes fixed-length PUTs (those with a
+declared `Content-Length`, hitting admission A's `peekHeadroom`)
+from aws-chunked PUTs (admission via §3.3.1's pay-as-you-decode).
+This split is what makes the chunked-PUT 503 surface (§6) and the
+rolling-upgrade alerting story actionable: a spike on
+`stage="perbatch", protocol="chunked"` points at "chunked clients
+beat Raft drain"; a spike on `stage="prereserve",
+protocol="fixed-length"` points at "client concurrency exceeds
+the per-node aggregate cap." Without the dimension the two
+failure modes are indistinguishable in a single counter.
+
 Grafana panel: inflight gauge with the cap as a horizontal line so
 the operator sees how often the system saturates. Rejection rate
 suggests bumping the cap or scaling out (more nodes spreads PUT load).
@@ -316,13 +329,19 @@ suggests bumping the cap or scaling out (more nodes spreads PUT load).
   pendingBatch slice for the entire retry window, so the budget
   must reflect them; a release-between-retries scheme would let a
   second PUT proceed while the first is still memory-resident,
-  breaking the bound. The total wall-clock cost of holding through
-  one full retry chain is bounded by
-  `s3TxnRetryMaxAttempts × (redisDispatchTimeout + s3TxnRetryMaxBackoff)`;
-  if that ever exceeds `dispatchAdmissionTimeout` the per-batch
-  acquire on the *next* batch surfaces as 503, which is the right
-  failure mode (chronic dispatch failure → caller learns instead of
-  silently consuming the budget).
+  breaking the bound. The S3 PUT path uses the inbound
+  `*http.Request` context for `coordinator.Dispatch` (no
+  S3-specific Dispatch timeout — the HTTP server's
+  `writeTimeout` / client-side cancellation is the upper bound on
+  one Dispatch attempt), so the wall-clock cost of holding the
+  slot through one full retry chain is bounded by
+  `s3TxnRetryMaxAttempts × (single_dispatch_budget + s3TxnRetryMaxBackoff)`
+  where `single_dispatch_budget` is whatever the request context
+  permits at that moment. If the retry chain duration ever
+  exceeds `dispatchAdmissionTimeout` the per-batch acquire on the
+  *next* batch surfaces as 503 — the right failure mode
+  (chronic dispatch failure → caller learns instead of silently
+  consuming the budget).
 
 ## 5. Implementation plan
 
diff --git a/docs/design/2026_04_25_proposed_s3_raft_blob_offload.md b/docs/design/2026_04_25_proposed_s3_raft_blob_offload.md
index ed9e1eb6e..beb82e42c 100644
--- a/docs/design/2026_04_25_proposed_s3_raft_blob_offload.md
+++ b/docs/design/2026_04_25_proposed_s3_raft_blob_offload.md
@@ -323,22 +323,48 @@ sweeper needs to know not just *that* the RC reached zero but
    a. scans the queue range `[!s3|chunkblob-gc-queue|, !s3|chunkblob-gc-queue|<now-gracePeriod>|)`
       for entries whose grace window has elapsed,
    b. for each `<SHA>` returned, re-checks the RC counter at the
-      sweeper's read timestamp; if the RC is still 0, deletes the
-      local `!s3|chunkblob|<SHA>` AND deletes the queue entry —
-      both as a single Raft txn so the queue stays consistent with
-      the keyspace,
+      sweeper's read timestamp.
+
+      *The deletion is two-phase across two storage layers and is
+      NOT a single transaction* — `!s3|chunkblob|<SHA>` is local
+      Pebble (per §3.1, never written through Raft), while
+      `!s3|chunkblob-gc-queue|…` is Raft-replicated. The phases
+      MUST run in this order:
+
+      i. **Raft phase first.** Delete the queue entry through a
+         Raft txn. Concurrent sweepers serialise here on
+         write-write-conflict; only the winner proceeds to the
+         local phase. This makes the queue the global "we are
+         GC-ing this SHA" lock.
+      ii. **Local phase second.** Delete the local
+          `!s3|chunkblob|<SHA>` from Pebble. No Raft round-trip.
+
+      The phase ordering is the load-bearing detail. If we did
+      local-first then Raft, a crash between the two phases would
+      leave the chunkblob gone locally but the queue entry still
+      present — every subsequent sweep would re-attempt the local
+      delete (no-op) and the queue entry would never get removed
+      until manual intervention. The Raft-first ordering trades
+      that for the inverse failure mode: a crash between the two
+      phases leaves the queue entry deleted but the local
+      chunkblob still on disk — a **bounded local space leak,
+      not a correctness bug**. A periodic "orphan scan" (chunkblob
+      keys whose SHA has RC=0 *and* no queue entry) reclaims
+      these without urgency. The orphan scan runs at low priority
+      out of band from the sweeper.
    c. if the RC has bounced above 0 in the meantime, the queue
       entry is stale (a re-reference txn forgot to remove it, or
       the sweeper raced) and the sweeper deletes only the queue
-      entry, leaving the chunkblob in place.
+      entry through a Raft txn, leaving the chunkblob in place.
 
 The queue is the authoritative "blob is GC-eligible since T"
-signal; the RC is the authoritative "is reachable" signal. Both
-are Raft-replicated, so every node arrives at the same set of
-sweepable SHAs and the same eligibility window. Different nodes
-running sweepers concurrently is safe because step (3b) commits
-through Raft and the txn fails with a write-write conflict on the
-second sweeper, leaving the first sweeper's deletion authoritative.
+signal *and* the global "we are GC-ing this SHA" lock — its
+Raft-replicated single-writer-per-key property is what makes
+concurrent sweepers safe across nodes. The RC is the
+authoritative "is reachable" signal, also Raft-replicated. Local
+chunkblob deletes are deliberately *not* replicated: each node
+deletes its own copy independently after the queue-entry txn
+commits, because that's the whole point of the architecture.
 
 `chunkBlobGCGracePeriod` defaults to 1 hour. The grace window
 absorbs in-flight reads (a peer that has already started fetching

From fabc81e40612359841a6aba50a1f1d8bc81693ca Mon Sep 17 00:00:00 2001
From: "Yoshiaki Ueda (bootjp)" <contact@bootjp.me>
Date: Sun, 26 Apr 2026 04:36:19 +0900
Subject: [PATCH 6/8] docs(design): address Claude bot round-5 review (2
 MEDIUM, 2 LOW)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

s3_admission_control.md — MEDIUM: §3.3.1 "Bootstrap reservation"
was ambiguous between peek and acquire. Pin it as a peek
(`peekHeadroom(s3RaftEntryByteBudget)`, no slot acquisition,
matching admission A's contract) and rename the heading to
"Bootstrap headroom check." Document why it must be a peek (an
acquire would multiply per-request slot hold by
`concurrent_chunked_PUTs × 4 MiB` of bootstrap-only credit with no
corresponding payload, reintroducing the head-of-line hazard the
design exists to prevent).

s3_admission_control.md — LOW: §3.3.1 "frame size up to 64 KiB"
was incoherent with the §3.3 semaphore's 1 MiB slot unit (a
channel-backed semaphore can't acquire fractional slots). Clarify
that the awsChunkedReader progress callback **buffers decoded
bytes until a full s3ChunkSize is accumulated, then calls
acquire(s3ChunkSize)**. Worst-case extra buffer per concurrent
chunked PUT is bounded by 1 MiB; on stream EOF the partial buffer
flushes via one final acquire rounded up to one slot. Also adds
`s3RaftEntryByteBudget` to §3.2's constant block (it was used
throughout §3.3.1 but never defined) with a comment showing
the derivation (s3ChunkSize × s3ChunkBatchOps).

s3_raft_blob_offload.md — MEDIUM: §3.2 degraded path floor of 2
chunkblob copies provides weaker-than-Raft durability for N > 3
clusters. On a 5-node cluster Raft tolerates 2 simultaneous
failures for the chunkref but the degraded chunkblob path
(leader + 1 follower) tolerates only 1. Add an explicit note
acknowledging the asymmetry, recommend `chunkBlobMinReplicas = N`
for operators who need the legacy "blob durability == Raft
durability" guarantee, and clarify that the default `(N/2)+1` is
sized for "match Raft quorum" not "match Raft fault tolerance" —
a distinction that is invisible at N=3 and material at N≥5.

s3_raft_blob_offload.md — LOW: §3.5 Phase (3b.i) needs to specify
that the queue-entry delete is **conditional** on (a) the entry
existing and (b) the RC counter still being 0 at the txn's read
timestamp. An unconditional delete would silently succeed on a
queue entry that a re-reference txn has just removed, then proceed
to phase (3b.ii) and local-delete a chunkblob whose RC has bounced
back to 1 — a correctness bug, not just a space leak. The
conditional form is what makes the sweeper safe against the
re-reference race.

No code changes; design docs only.
---
 ...026_04_25_proposed_s3_admission_control.md | 82 +++++++++++++------
 ...026_04_25_proposed_s3_raft_blob_offload.md | 48 +++++++++--
 2 files changed, 102 insertions(+), 28 deletions(-)

diff --git a/docs/design/2026_04_25_proposed_s3_admission_control.md b/docs/design/2026_04_25_proposed_s3_admission_control.md
index 53a482e76..c5c937b7d 100644
--- a/docs/design/2026_04_25_proposed_s3_admission_control.md
+++ b/docs/design/2026_04_25_proposed_s3_admission_control.md
@@ -132,6 +132,18 @@ const (
     // per PR #617) — leaving headroom for Lua, scan buffers, and Pebble
     // memtables.
     s3PutAdmissionMaxInflightBytes = 256 << 20 // 256 MiB
+    // s3RaftEntryByteBudget is the per-batch unit acquired and
+    // released against the semaphore. It must equal the byte
+    // budget of one Raft entry produced by PutObject /
+    // UploadPart's flush loop (PR #636: s3ChunkSize ×
+    // s3ChunkBatchOps = 1 MiB × 4 = 4 MiB minus protobuf framing
+    // overhead — kept abstract here so the admission contract
+    // does not lock the entry-size choice). The semaphore's
+    // capacity is `s3PutAdmissionMaxInflightBytes /
+    // s3ChunkSize` 1 MiB-units; per-batch acquire takes
+    // `s3RaftEntryByteBudget / s3ChunkSize` units at a time
+    // (= 4 units on the default tunables).
+    s3RaftEntryByteBudget = s3ChunkSize * s3ChunkBatchOps
     // dispatchAdmissionTimeout is how long a per-batch flush will wait
     // for a slot before giving up. The 256 MiB cap drains in ~2 s at
     // 1 Gbps under steady-state Raft throughput (256 MiB / 125 MB/s),
@@ -225,34 +237,58 @@ chunked stream finishes — exactly the failure mode admission control
 exists to prevent. We therefore split chunked admission across two
 mechanisms instead of pre-charging:
 
-1. **Bootstrap reservation = `s3RaftEntryByteBudget` (4 MiB)** at
-   request entry. This is enough to admit the request and let the
-   awsChunkedReader produce its first decoded window. Chunked PUTs
-   are not "free" — they still must beat the same admission queue
-   as fixed-length PUTs at the per-batch level.
+1. **Bootstrap headroom check** at request entry. Calls
+   `peekHeadroom(s3RaftEntryByteBudget)` — exactly the admission-A
+   contract: a fast-fail check that 4 MiB *would have fit* at the
+   moment we asked. **No slot is acquired.** This is intentionally
+   racy with concurrent PUTs (same as fixed-length admission A);
+   its job is to fail at request entry rather than partway
+   through the first decoded frame. Chunked PUTs are not "free"
+   — they still must beat the same admission queue as fixed-length
+   PUTs at the per-frame level.
 2. **Pay-as-you-decode** thereafter, charged via an
-   `awsChunkedReader` progress callback. Each decoded chunk frame
-   (typically up to 64 KiB on the wire after framing overhead)
-   acquires a slot equal to the bytes about to flow into Pebble; the
-   slot is released once the corresponding `coordinator.Dispatch`
-   acks. This is the same path admission B uses for fixed-length
-   PUTs — chunked traffic just hooks into it incrementally instead of
-   pre-charging the worst case.
+   `awsChunkedReader` progress callback. The callback **buffers
+   decoded bytes until a full slot unit (`s3ChunkSize = 1 MiB`) is
+   accumulated**, then calls `acquire(s3ChunkSize)` on the
+   semaphore (same path as fixed-length admission B). This keeps
+   the slot unit coherent: the semaphore's capacity is
+   `s3PutAdmissionMaxInflightBytes / s3ChunkSize` 1 MiB-units, so
+   acquiring at sub-MiB granularity is not representable. The
+   slot is released once the corresponding
+   `coordinator.Dispatch` acks the chunk. The buffer never holds
+   more than `s3ChunkSize - 1` decoded bytes, so the worst-case
+   memory overhead beyond the semaphore-tracked bytes is bounded
+   by 1 MiB per concurrent chunked PUT.
 
 Failure modes:
 
-- If the awsChunkedReader produces frames faster than Raft drains, the
-  per-batch acquire blocks (capped by `dispatchAdmissionTimeout`).
-  Beyond that timeout, mid-stream 503 closes the connection. The
-  legacy "reserve 5 GiB" approach would have surfaced as 503 *at
-  request entry* for unrelated PUTs; this approach surfaces as
-  mid-stream 503 for the chunked PUT itself, which is the right
-  blame attribution.
+- If the awsChunkedReader produces decoded bytes faster than Raft
+  drains, the next 1 MiB acquire blocks (capped by
+  `dispatchAdmissionTimeout`). Beyond that timeout, mid-stream 503
+  closes the connection. The legacy "reserve 5 GiB" approach
+  would have surfaced as 503 *at request entry* for unrelated
+  PUTs; this approach surfaces as mid-stream 503 for the chunked
+  PUT itself, which is the right blame attribution.
+- The bootstrap check at step 1 is racy: another PUT can consume
+  the headroom between the check and the first per-frame acquire.
+  When that happens the first acquire blocks (or 503s on
+  timeout) — the same path the fixed-length admission B handles
+  for the contending case. The race is intentional: making the
+  check a real reservation would multiply per-request slot hold
+  by `concurrent_chunked_PUTs × 4 MiB` of bootstrap-only credit
+  with no corresponding payload, reintroducing a head-of-line
+  hazard.
+- If the awsChunkedReader produces a single frame whose decoded
+  size never accumulates to a full `s3ChunkSize`, the buffer
+  flushes on stream EOF: a final `acquire(actual_buffered_bytes)`
+  rounded up to one slot is taken (semaphore charges in 1-slot
+  units regardless of actual byte count), so the bound holds.
 - If the awsChunkedReader frame size ever exceeds
-  `s3RaftEntryByteBudget` (a malformed client), the per-batch acquire
-  asks for more than the cap allows and we 503 immediately — same
-  as a fixed-length PUT whose `Content-Length` exceeds the global
-  cap.
+  `s3RaftEntryByteBudget` (a malformed client whose decoded
+  cumulative output between framing acks already exceeds 4 MiB),
+  the first per-frame acquire asks for more than the cap allows
+  and we 503 immediately — same as a fixed-length PUT whose
+  `Content-Length` exceeds the global cap.
 
 This change moves chunked admission from M4 (originally "deferred
 optimisation") into M1 (the first shippable milestone). M1 ships
diff --git a/docs/design/2026_04_25_proposed_s3_raft_blob_offload.md b/docs/design/2026_04_25_proposed_s3_raft_blob_offload.md
index beb82e42c..270ed51af 100644
--- a/docs/design/2026_04_25_proposed_s3_raft_blob_offload.md
+++ b/docs/design/2026_04_25_proposed_s3_raft_blob_offload.md
@@ -197,6 +197,29 @@ configuration. Tuning `chunkBlobMinReplicas` higher trades PUT
 availability for stronger durability; tuning lower than 2 is
 rejected at config-load.
 
+**Important durability note for N > 3 clusters.** On a 3-node
+cluster the degraded floor of 2 chunkblob copies happens to match
+Raft's quorum-of-2, so a single node failure is tolerated by both
+the chunkref *and* the chunkblob. For N > 3 this is no longer
+true: a 5-node cluster has Raft quorum 3 and tolerates 2
+simultaneous failures for the chunkref, but the degraded
+chunkblob path (leader + 1 follower) tolerates only 1. If the
+leader and the chunkblob-holding follower both fail during the
+degraded window, the surviving Raft quorum elects a new leader,
+finds a committed chunkref, and discovers that no surviving node
+holds the chunkblob — the chunkref is durable but the object data
+is lost. This is **weaker than the legacy "every byte through
+Raft" path**, which loses data only when Raft itself loses quorum
+(3 simultaneous failures on N=5). Operators on N > 3 clusters who
+need the legacy "blob durability == Raft durability" guarantee
+should configure `chunkBlobMinReplicas = N` (full replication;
+trades some PUT availability — any single peer outage stalls
+PUTs — for the strongest durability the cluster can offer).
+The default `(N/2)+1` is sized for "match Raft quorum," not "match
+Raft fault tolerance"; this distinction is invisible at N=3 but
+material at N≥5 and is what makes this configuration knob
+operationally meaningful.
+
 The trade-off is PUT latency: a PUT now blocks on
 `chunkBlobMinReplicas - 1` follower fsyncs in addition to the Raft
 quorum write of the chunkref. Empirically the chunkblob fsync is
@@ -331,13 +354,28 @@ sweeper needs to know not just *that* the RC reached zero but
       `!s3|chunkblob-gc-queue|…` is Raft-replicated. The phases
       MUST run in this order:
 
-      i. **Raft phase first.** Delete the queue entry through a
-         Raft txn. Concurrent sweepers serialise here on
-         write-write-conflict; only the winner proceeds to the
-         local phase. This makes the queue the global "we are
-         GC-ing this SHA" lock.
+      i. **Raft phase first — conditional delete.** Delete the
+         queue entry through a Raft txn that is **conditional on
+         the queue entry existing AND the RC counter still being
+         0 at the txn's read timestamp**. The conditional form is
+         load-bearing: if a re-reference txn has committed
+         between the sweeper's queue scan and this txn (driving
+         RC back to 1 and atomically removing the queue entry —
+         see §3.1's atomic invariant), the conditional delete
+         fails and the sweeper aborts before reaching the local
+         phase. An *unconditional* delete would silently succeed
+         on the now-absent queue entry and let the sweeper
+         proceed to local-delete a chunkblob that is currently
+         live (RC=1) — a **correctness bug, not just a space
+         leak**. Concurrent sweepers also serialise on this txn
+         (write-write-conflict on the queue key); only the winner
+         proceeds.
       ii. **Local phase second.** Delete the local
           `!s3|chunkblob|<SHA>` from Pebble. No Raft round-trip.
+          Reaching this phase implies (i) succeeded, which
+          implies the RC was 0 at the txn read timestamp and
+          remained 0 throughout the txn's commit window — i.e.
+          the blob is genuinely unreachable.
 
       The phase ordering is the load-bearing detail. If we did
       local-first then Raft, a crash between the two phases would

From 1fb39ba1b31feb70ca0203770cf94f33f8a8f6bb Mon Sep 17 00:00:00 2001
From: "Yoshiaki Ueda (bootjp)" <contact@bootjp.me>
Date: Sun, 26 Apr 2026 04:47:10 +0900
Subject: [PATCH 7/8] docs(design): address Claude bot round-5 review (1
 MEDIUM, 4 LOW)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

s3_raft_blob_offload.md — MEDIUM: §3.2 PushChunkBlob latency claim
("PUT p99 ≈ legacy") was load-bearing but the
local-write/peer-push pipeline was unspecified. Sequential
ordering would silently double the per-chunk latency
(`chunkBlobMinReplicas × fsync_latency`). Pin the contract:
local Pebble write and PushChunkBlob fan out **concurrently**;
multiple followers' pushes are **fanned out in parallel**, not
sequentially. Update the flow diagram to show the pipeline
explicitly and call out that this is part of the contract, not
an optional optimization.

s3_admission_control.md — LOW: §3.3.1 malformed-client failure
mode said "we 503 immediately" but the accumulation design
(`acquire(s3ChunkSize)` only) means the immediate-503 path is
never reachable on the chunked side — the per-frame acquire is
always 1 MiB, well under the 256 MiB cap. Reword to specify the
actual path: successive 1 MiB acquires block under Raft pressure
and the PUT eventually surfaces 503 on
`dispatchAdmissionTimeout`. The "immediate 503 for oversized
request" path is fixed-length only.

s3_admission_control.md — LOW: §5 milestone table had M2 saying
"Add `dispatchAdmissionTimeout`" but M1 already ships the
chunked per-frame admission B path which is gated on it. Move
the constant into M1; M2 narrows to "add fixed-length per-batch
admission B + cleanup," with chunked already using the path
from M1.

s3_raft_blob_offload.md — LOW: the §3.2 flow-diagram step 3
phrasing "synchronously replicate to ≥ chunkBlobMinReplicas
peers" was inconsistent with the prose's
"chunkBlobMinReplicas - 1 followers." Resolved as part of the
MEDIUM rewrite — the diagram now reads "PushChunkBlob to
chunkBlobMinReplicas-1 followers" with parallel fan-out, matching
the prose count.

s3_raft_blob_offload.md — LOW: §3.5 orphan scan was framed as
the recovery path for "sweeper crash between Phase i and ii,"
but it implicitly also covers a more common scenario — chunkblobs
written to local Pebble by §3.2 step 2 when the PUT then fails
before `coordinator.Dispatch` is called (admission 503, client
disconnect, PushChunkBlob quorum failure). In that case neither
RC nor GC queue entry exists, so the sweeper never sees the
orphan; only the orphan scan does. Make this case explicit so the
PUT-handler abort path can rely on the orphan scan rather than
needing its own best-effort local delete.

No code changes; design docs only.
---
 ...026_04_25_proposed_s3_admission_control.md | 22 +++---
 ...026_04_25_proposed_s3_raft_blob_offload.md | 68 +++++++++++++++----
 2 files changed, 69 insertions(+), 21 deletions(-)

diff --git a/docs/design/2026_04_25_proposed_s3_admission_control.md b/docs/design/2026_04_25_proposed_s3_admission_control.md
index c5c937b7d..54bfb9dbe 100644
--- a/docs/design/2026_04_25_proposed_s3_admission_control.md
+++ b/docs/design/2026_04_25_proposed_s3_admission_control.md
@@ -283,12 +283,18 @@ Failure modes:
   flushes on stream EOF: a final `acquire(actual_buffered_bytes)`
   rounded up to one slot is taken (semaphore charges in 1-slot
   units regardless of actual byte count), so the bound holds.
-- If the awsChunkedReader frame size ever exceeds
-  `s3RaftEntryByteBudget` (a malformed client whose decoded
-  cumulative output between framing acks already exceeds 4 MiB),
-  the first per-frame acquire asks for more than the cap allows
-  and we 503 immediately — same as a fixed-length PUT whose
-  `Content-Length` exceeds the global cap.
+- A malformed client that decodes bytes faster than Raft drains
+  *cannot* trigger the immediate-503 path the way a fixed-length
+  PUT can. The accumulation design (callback always calls
+  `acquire(s3ChunkSize)`, never larger) means the per-frame
+  acquire request is bounded by 1 MiB — the
+  "if `bytes > capacity * s3ChunkSize`" early-return in
+  `acquire`'s spec is never hit on the chunked path. Instead,
+  successive 1 MiB acquires block under Raft pressure and the
+  PUT eventually surfaces 503 on `dispatchAdmissionTimeout` —
+  the same path a slow follower triggers. The "immediate 503 for
+  oversized request" failure mode applies only to fixed-length
+  PUTs (via `peekHeadroom(Content-Length > 256 MiB)`).
 
 This change moves chunked admission from M4 (originally "deferred
 optimisation") into M1 (the first shippable milestone). M1 ships
@@ -383,8 +389,8 @@ suggests bumping the cap or scaling out (more nodes spreads PUT load).
 
 | Milestone | Scope | Risk |
 |---|---|---|
-| M1 | Add `putAdmission` type + per-node singleton + fixed-length `Content-Length` admission. Wire `prepareStreamingPutBody` to acquire / release. **aws-chunked progress-callback admission** (§3.3.1) ships in this milestone too — the conservative 5 GiB pre-charge fallback only sits behind `ELASTICKV_S3_PUT_ADMISSION_CHUNKED_INCREMENTAL=false`. Metric scaffolding (gauge + counter). | Medium. Chunked progress callback needs `awsChunkedReader` to expose a hook. |
-| M2 | Per-batch admission B inside `flushBatch` for fixed-length PUTs. Add `dispatchAdmissionTimeout`. Mid-stream 503 with cleanup. (Chunked PUTs already use this path through their incremental charging.) | Medium. Cleanup path on partial failure. |
+| M1 | Add `putAdmission` type + per-node singleton + fixed-length `Content-Length` admission (`peekHeadroom`). Wire `prepareStreamingPutBody` to acquire / release. **aws-chunked progress-callback admission** (§3.3.1) ships in this milestone too — the conservative 5 GiB pre-charge fallback only sits behind `ELASTICKV_S3_PUT_ADMISSION_CHUNKED_INCREMENTAL=false`. **`dispatchAdmissionTimeout` ships here** (the chunked per-frame `acquire(s3ChunkSize)` path is gated on it from day one), not in M2. Metric scaffolding (gauge + counter). | Medium. Chunked progress callback needs `awsChunkedReader` to expose a hook. |
+| M2 | Per-batch admission B inside `flushBatch` for **fixed-length** PUTs (chunked PUTs already use admission B as of M1). Mid-stream 503 with cleanup on the fixed-length path. | Medium. Cleanup path on partial failure. |
 | M3 | Env-var tunables. Histogram metric. Grafana panel. | Low. |
 | M4 | Per-tenant / per-bucket admission classes (handed off to the workload-isolation rollout). | Medium. Out-of-scope for the v1 cap. |
 
diff --git a/docs/design/2026_04_25_proposed_s3_raft_blob_offload.md b/docs/design/2026_04_25_proposed_s3_raft_blob_offload.md
index 270ed51af..90c7657a0 100644
--- a/docs/design/2026_04_25_proposed_s3_raft_blob_offload.md
+++ b/docs/design/2026_04_25_proposed_s3_raft_blob_offload.md
@@ -133,8 +133,15 @@ client ─► HTTP PUT body
         ─► chunk loop (s3ChunkSize):
              1. compute SHA-256 of chunk
              2. write chunk to LOCAL Pebble at !s3|chunkblob|<SHA>
-             3. *** synchronously replicate to ≥ chunkBlobMinReplicas peers ***
-             4. queue ChunkRef into pendingBatch
+                (fsync)         ─┐ pipelined: bytes also stream out
+             3. PushChunkBlob to ─┤ to chunkBlobMinReplicas-1 followers
+                followers in parallel (one RPC per follower; bytes
+                start flowing the moment the leader has them — not
+                after local fsync completes)
+             4. wait until BOTH local fsync AND a quorum of follower
+                fsync-acks have returned (= chunkBlobMinReplicas
+                durable copies including the leader)
+             5. queue ChunkRef into pendingBatch
         ─► flushBatch:
              coordinator.Dispatch(OperationGroup{
                  Elems: [ chunkref Puts ... + manifest Put ],
@@ -151,13 +158,27 @@ quorum guarantees the chunkref but tells you nothing about the
 blob payload. We close that gap by treating the chunkblob like a
 mini-Raft entry of its own with **semi-synchronous quorum**:
 
-1. Leader writes the chunkblob to local Pebble (fsync).
-2. Leader pushes the chunkblob to `chunkBlobMinReplicas - 1`
-   followers via the `S3BlobFetch.PushChunkBlob` RPC and waits for
-   each follower's "fsync ack." (`PushChunkBlob` is the leader-
-   initiated counterpart to the follower-initiated `FetchChunkBlob`
-   defined in §3.6.)
-3. Only after the chunkblob is durable on a quorum of nodes does
+1. Leader starts writing the chunkblob to local Pebble (fsync in
+   flight).
+2. **Concurrently** with step 1, the leader streams the chunkblob
+   to `chunkBlobMinReplicas - 1` followers via parallel
+   `S3BlobFetch.PushChunkBlob` RPCs. Pushes are **fanned out in
+   parallel**, not sequential — each follower's RPC is started
+   immediately, and the leader waits on a quorum of fsync-acks
+   rather than serially blocking on each one. (`PushChunkBlob` is
+   the leader-initiated counterpart to the follower-initiated
+   `FetchChunkBlob` defined in §3.6.)
+3. The leader waits for *both* the local fsync AND a quorum of
+   follower fsync-acks. The dominant cost is therefore
+   `max(local_fsync, slowest_quorum_follower_fsync)` — typically
+   ≈ 10 ms on consumer SSD, equivalent to a Raft quorum write.
+   This is what makes the p99 latency claim below load-bearing:
+   if step 1 and step 2 were *sequential* (write local → then
+   push to followers → then wait), per-chunk latency would be
+   `chunkBlobMinReplicas × fsync_latency` and silently double the
+   PUT p99 vs. the legacy path. The pipelined / parallel model
+   is part of the contract, not an optimization.
+4. Only after the chunkblob is durable on a quorum of nodes does
    the leader propose the chunkref through Raft.
 
 `chunkBlobMinReplicas` defaults to **2** on a 3-node cluster (= a
@@ -386,10 +407,31 @@ sweeper needs to know not just *that* the RC reached zero but
       that for the inverse failure mode: a crash between the two
       phases leaves the queue entry deleted but the local
       chunkblob still on disk — a **bounded local space leak,
-      not a correctness bug**. A periodic "orphan scan" (chunkblob
-      keys whose SHA has RC=0 *and* no queue entry) reclaims
-      these without urgency. The orphan scan runs at low priority
-      out of band from the sweeper.
+      not a correctness bug**. A periodic "orphan scan" reclaims
+      these.
+
+      The orphan scan covers two distinct sources of orphans:
+
+      - **Sweeper crash between Phase (3b.i) and (3b.ii)** — the
+        case described above; queue entry was removed via Raft
+        but the local Pebble delete never fired.
+      - **PUT failure before chunkref Dispatch** — chunkblob
+        bytes were written to local Pebble in §3.2 step 2, then
+        the PUT aborted before reaching `coordinator.Dispatch`
+        (admission control 503, client disconnect, `PushChunkBlob`
+        quorum failure, request context cancel). In that
+        scenario neither an RC entry nor a GC queue entry was
+        ever written, so the sweeper's queue-range scan never
+        sees these orphans — only the orphan scan does.
+
+      Detection criterion (covers both): `!s3|chunkblob|<SHA>`
+      keys whose SHA has either no RC entry at all, or RC=0 with
+      no corresponding queue entry. The orphan scan runs at low
+      priority out of band from the sweeper (proposed default
+      `chunkBlobOrphanScanInterval = 1 hour`); it is the safety
+      net behind both the sweeper crash path and the PUT-abort
+      cleanup path, so the PUT handler does not need its own
+      best-effort local-delete on the abort path.
    c. if the RC has bounced above 0 in the meantime, the queue
       entry is stale (a re-reference txn forgot to remove it, or
       the sweeper raced) and the sweeper deletes only the queue

From aa418cef4985342f5ec22a9fa51db7f4f8bbc865 Mon Sep 17 00:00:00 2001
From: "Yoshiaki Ueda (bootjp)" <contact@bootjp.me>
Date: Sun, 26 Apr 2026 04:55:50 +0900
Subject: [PATCH 8/8] docs(design): address CodeRabbit review (typo +
 code-fence language tags)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

CodeRabbit round 1 on PR #637:

- s3_raft_blob_offload.md:358 typo "elegibility" → "eligibility"
  (single character).
- Add `text` language identifier to the six previously-bare fenced
  code blocks (the §1 worst-case formula in admission_control,
  the §3.3 client/admission flow diagram in admission_control,
  the §3.5 metrics block in admission_control, the §1
  Raft-log flow in blob_offload, the §3.1 keyspace block in
  blob_offload, and the §3.2 PUT pipeline diagram in
  blob_offload). Markdownlint flags bare fences; tagging them
  enables both linting compliance and consistent syntax
  highlighting in the rendered docs.

No code changes; design docs only.
---
 docs/design/2026_04_25_proposed_s3_admission_control.md | 6 +++---
 docs/design/2026_04_25_proposed_s3_raft_blob_offload.md | 8 ++++----
 2 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/docs/design/2026_04_25_proposed_s3_admission_control.md b/docs/design/2026_04_25_proposed_s3_admission_control.md
index 54bfb9dbe..a153f1697 100644
--- a/docs/design/2026_04_25_proposed_s3_admission_control.md
+++ b/docs/design/2026_04_25_proposed_s3_admission_control.md
@@ -22,7 +22,7 @@ with `MaxSizePerMsg`, the **aggregate** in-flight memory bound is
 governed by client concurrency, which the server cannot currently
 limit:
 
-```
+```text
 leader-side worst-case  =  concurrent_PUTs × pending_entries_per_PUT × entry_size
                         ≈  concurrent_PUTs × MaxInflight × 4 MiB
 ```
@@ -105,7 +105,7 @@ Recommended: **both, at different scales**.
   fails with 503 mid-stream. Dispatch latency stays bounded by the
   semaphore size rather than by client behaviour.
 
-```
+```text
 client ─[Content-Length]─► (A) reserve full body bytes
                                 │
                                 ▼
@@ -323,7 +323,7 @@ materialise by default.
 
 ### 3.5 Metrics
 
-```
+```text
 elastickv_s3_put_admission_inflight_bytes        gauge
 elastickv_s3_put_admission_rejections_total      counter (labels:
                                                     stage    = "prereserve" | "perbatch",
diff --git a/docs/design/2026_04_25_proposed_s3_raft_blob_offload.md b/docs/design/2026_04_25_proposed_s3_raft_blob_offload.md
index 90c7657a0..3d02154a2 100644
--- a/docs/design/2026_04_25_proposed_s3_raft_blob_offload.md
+++ b/docs/design/2026_04_25_proposed_s3_raft_blob_offload.md
@@ -20,7 +20,7 @@
 
 Today every byte of an S3 object travels through the Raft log:
 
-```
+```text
 HTTP PUT body  ─►  s3ChunkSize (1 MiB) chunks
                  ─►  s3ChunkBatchOps × 1 MiB Raft entry
                  ─►  Raft log entry (Pebble WAL on every node)
@@ -102,7 +102,7 @@ themselves*.
 
 ### 3.1 New keyspace
 
-```
+```text
 !s3|chunkref|<bucket>|<gen>|<objectKey>|<uploadID>|<partNo>|<chunkNo>
     → ChunkRef{
         ContentSHA256 [32]byte
@@ -128,7 +128,7 @@ Two separate keyspaces:
 
 ### 3.2 PUT path
 
-```
+```text
 client ─► HTTP PUT body
         ─► chunk loop (s3ChunkSize):
              1. compute SHA-256 of chunk
@@ -355,7 +355,7 @@ sweeper needs to know not just *that* the RC reached zero but
    would drive it to zero, the *same* txn additionally writes
    `!s3|chunkblob-gc-queue|<commitTS-nanos>|<SHA>` → empty. The
    commitTS-prefixed key is the time signal: the queue is
-   naturally sorted by elegibility-start time, and any node can
+   naturally sorted by eligibility-start time, and any node can
    determine the grace-period boundary by scanning the queue with
    `endKey = !s3|chunkblob-gc-queue|<now-gracePeriod>|`. If a
    subsequent txn re-references the same SHA before the sweeper