diff --git a/docs/design/2026_04_25_proposed_s3_admission_control.md b/docs/design/2026_04_25_proposed_s3_admission_control.md new file mode 100644 index 000000000..a153f1697 --- /dev/null +++ b/docs/design/2026_04_25_proposed_s3_admission_control.md @@ -0,0 +1,468 @@ +# S3 PUT admission control + +> **Status: Proposed** +> Author: bootjp +> Date: 2026-04-25 +> +> Companion to PR #636 (`s3ChunkBatchOps = 4`, Raft entry size aligned +> with `MaxSizePerMsg = 4 MiB` per PR #593) and to the workload-class +> isolation proposal (`docs/design/2026_04_24_proposed_workload_isolation.md`). +> Where PR #636 fixes the *per-entry* memory accounting and the +> isolation doc keeps Raft's CPU budget separate from heavy command +> paths, this doc bounds the *aggregate* in-flight memory that S3 +> PUT body bytes can pin in the leader's Raft pipeline regardless of +> how many clients are uploading at once. + +--- + +## 1. Problem + +Even with PR #636 reducing the per-entry size to 4 MiB and aligning +with `MaxSizePerMsg`, the **aggregate** in-flight memory bound is +governed by client concurrency, which the server cannot currently +limit: + +```text +leader-side worst-case = concurrent_PUTs × pending_entries_per_PUT × entry_size + ≈ concurrent_PUTs × MaxInflight × 4 MiB +``` + +A single PUT's pipeline is already capped by `MaxInflight = 1024` Raft +in-flight messages times 4 MiB per entry = 4 GiB / peer (the bound PR +#593 advertises). What multiplies that bound is concurrent uploads: + +- 4 simultaneous 5 GiB PUTs on a 4-vCPU 8 GiB-RAM node can pin + `4 × min(MaxInflight × 4 MiB, body_remaining)` on the leader. If a + follower stalls (GC pause, slow disk fsync) for several seconds, that + worst case is realised before backpressure kicks in. +- The 2026-04-24 incident showed that one workload class can wedge the + Go runtime for the entire cluster. S3 PUT body bytes pinned in the + Raft pipeline play the same role as XREAD blobs in the Redis path: + they grow with client behaviour, not with anything the operator can + rate-limit per-handler. +- `GOMEMLIMIT = 1800 MiB` plus `--memory = 2500m` (PR #617) set the + upper bound on heap, but they do not provide *flow control*. The Go + GC starts thrashing well before the limit, and memwatch (PR #612) + triggers graceful shutdown — a recovery mechanism, not a steady-state + control. + +There is no hard ceiling today on the aggregate body bytes that S3 +puts onto the Raft pipeline. PR #636 makes the per-entry slot +predictable; this proposal makes the *number of slots in flight* +predictable too. + +## 2. Goals & non-goals + +**Goals** + +- Hard cap on the total S3 PUT body bytes that have been read from + HTTP but not yet committed to Raft. This cap is a tunable property + of the S3 server, not of any single PUT. +- Steady-state behaviour where new PUTs are admitted at the rate Raft + can drain. Bursting clients see HTTP 503 with `Retry-After`, not + silent ballooning of leader memory. +- Composable with existing memwatch + GOMEMLIMIT. Admission failure + must be visible (metric + log line per rejected request) so an + operator can size the cap against a given memory budget. +- Read path is unaffected. GET / HEAD / LIST do not consume the same + budget. + +**Non-goals** + +- No global QPS quota across all S3 verbs. Other verbs (GET, multipart + initiate / complete) have well-bounded memory and don't need this. +- No per-bucket / per-key fairness. Fairness across tenants is a + separate workload-class concern (handled by the workload isolation + proposal). This doc only fixes the aggregate ceiling. +- No backpressure-via-TCP (`SO_RCVBUF` shrink). Decoupling from kernel + buffers keeps the budget testable and avoids cross-OS surprises. +- No multi-region / multi-cluster admission. One leader's perspective. + +## 3. Design + +### 3.1 Where the budget is checked + +Two natural insertion points exist on the PUT path: + +| Insertion point | Granularity | When the request is rejected | +|---|---|---| +| (A) Before `prepareStreamingPutBody` accepts the first byte | Whole-object | Pre-Raft, before any reads from the body | +| (B) Inside the `flushBatch` loop, before `coordinator.Dispatch` | Per-batch | Mid-stream, after `s3ChunkBatchOps` chunks are buffered | + +Recommended: **both, at different scales**. + +- **(A) is the request-admission cap.** Use the `Content-Length` + header (rejected when absent for PUTs that are not aws-chunked) to + pre-charge the budget. If the budget would be exceeded we reply 503 + immediately. This matches how AWS S3 itself handles + `SlowDown` / `ServiceUnavailable` and is the cheapest case to + surface — the body is never read from the socket. +- **(B) is the in-flight cap.** Even after a request is admitted, its + 4 MiB batches can pile up if Raft becomes slow. The flush loop + acquires a per-batch lease (4 MiB) from a shared semaphore *before* + reading the next chunk window. If the semaphore is empty for longer + than `dispatchAdmissionTimeout` (proposed default 30 s), the PUT + fails with 503 mid-stream. Dispatch latency stays bounded by the + semaphore size rather than by client behaviour. + +```text +client ─[Content-Length]─► (A) reserve full body bytes + │ + ▼ + prepareStreamingPutBody + │ + ▼ + (B) acquire 4 MiB slot ◄┐ (released on Dispatch ack) + │ + ▼ + coordinator.Dispatch ───┘ +``` + +### 3.2 Concrete cap values + +Default cap: + +```go +const ( + // s3PutAdmissionMaxInflightBytes is the hard ceiling on S3 PUT body + // bytes accepted by this node but not yet committed to Raft. Picked + // so that, on a 3-node cluster with MaxInflight × MaxSizePerMsg = + // 4 GiB / peer per PR #593, all in-flight PUT bytes plus their Raft + // replication shadow stay under one quarter of GOMEMLIMIT (1800 MiB + // per PR #617) — leaving headroom for Lua, scan buffers, and Pebble + // memtables. + s3PutAdmissionMaxInflightBytes = 256 << 20 // 256 MiB + // s3RaftEntryByteBudget is the per-batch unit acquired and + // released against the semaphore. It must equal the byte + // budget of one Raft entry produced by PutObject / + // UploadPart's flush loop (PR #636: s3ChunkSize × + // s3ChunkBatchOps = 1 MiB × 4 = 4 MiB minus protobuf framing + // overhead — kept abstract here so the admission contract + // does not lock the entry-size choice). The semaphore's + // capacity is `s3PutAdmissionMaxInflightBytes / + // s3ChunkSize` 1 MiB-units; per-batch acquire takes + // `s3RaftEntryByteBudget / s3ChunkSize` units at a time + // (= 4 units on the default tunables). + s3RaftEntryByteBudget = s3ChunkSize * s3ChunkBatchOps + // dispatchAdmissionTimeout is how long a per-batch flush will wait + // for a slot before giving up. The 256 MiB cap drains in ~2 s at + // 1 Gbps under steady-state Raft throughput (256 MiB / 125 MB/s), + // so 30 s is *not* sized against normal drain — it is the budget + // for a transiently stalled follower (GC pause, slow disk fsync, + // bounded leader re-election) to recover before we conclude the + // cluster is genuinely overloaded. Longer stalls surface as 503, + // which is the right signal: at that point the right action is + // operator intervention (scale out, investigate the stall), not + // continued accumulation. + dispatchAdmissionTimeout = 30 * time.Second +) +``` + +The cap is intentionally **per-node** rather than cluster-wide: +admission is enforced on the node receiving the HTTP request, which is +also the node whose memory is at risk. Clients hitting a different +node simply get a different budget. + +Both values are exposed as env vars (`ELASTICKV_S3_PUT_ADMISSION_MAX_INFLIGHT_BYTES`, +`ELASTICKV_S3_DISPATCH_ADMISSION_TIMEOUT`) with the constants as +defaults, following the pattern PR #593 established for +`ELASTICKV_RAFT_MAX_INFLIGHT_MSGS` etc. + +### 3.3 Data structure + +```go +type putAdmission struct { + semaphore chan struct{} // capacity == max / s3ChunkSize, charges in 1 MiB units + inflight atomic.Int64 // metric mirror; semaphore is the source of truth +} + +func newPutAdmission(maxBytes int64) *putAdmission { … } + +// peekHeadroom is admission A. It returns ErrAdmissionExhausted +// without acquiring slots when the requested byte count exceeds +// the *current* free capacity of the semaphore. It does NOT take +// out a reservation — the only effect is "fail fast at request +// entry instead of partway through the body" — so it cannot +// double-count against admission B. +func (a *putAdmission) peekHeadroom(bytes int64) error { … } + +// acquire is admission B. It blocks until (bytes / s3ChunkSize) +// slots are available or ctx fires. The returned release closure +// MUST be called exactly once. If `bytes > capacity * s3ChunkSize` +// (a malformed client whose frame exceeds the entire budget), +// returns ErrAdmissionExhausted *immediately* without waiting — +// otherwise we would block until ctx (typically +// dispatchAdmissionTimeout) for a request that can never fit. +func (a *putAdmission) acquire(ctx context.Context, bytes int64) (func(), error) { … } +``` + +The two-step contract avoids the double-charge / unbounded-window +hazard the obvious "A is also an acquire" design would have: + +1. **Admission A — request-entry headroom check (peek only).** When + a PUT arrives with `Content-Length: N`, the handler calls + `peekHeadroom(N)`. If the result is `ErrAdmissionExhausted`, + reply 503 immediately and never read from the body. If `nil`, + admission A is done — no slots have been taken out of the + semaphore. This is intentionally racy with concurrent PUTs: + admission A only promises "at the moment we asked, the budget + *would have fit* this request"; it does not reserve the budget. +2. **Admission B — per-batch acquire/release (the only path that + touches the semaphore).** The PUT handler then loops the body + in `s3ChunkSize × s3ChunkBatchOps = 4 MiB` windows; each window + acquires 4 MiB worth of slots via `acquire`, reads the next + window from the body, dispatches, and releases on Dispatch + ack. This is the bound that actually holds — at any instant the + sum of held slots across all in-flight PUTs cannot exceed the + semaphore's capacity. If admission A's racy estimate turns out + to be wrong (another PUT raced in between A's check and the + first B-acquire), the first B-acquire blocks until the + contending PUT releases or `dispatchAdmissionTimeout` fires. + +The semaphore is therefore charged **only by admission B**. Bytes +in flight = `held_B_slots × s3ChunkSize`, full stop; admission A is +a fast-fail gate, not a reservation. This is the model +implementations MUST follow — a "both A and B charge" design would +double-count every PUT against itself and an admission-A-only +design would lose its bound the moment a PUT's body exceeded its +declared `Content-Length` (chunked transfers, malformed clients). + +### 3.3.1 aws-chunked transfers (`Content-Length: -1`) + +A naïve "reserve `s3MaxObjectSizeBytes` (5 GiB) up front" is rejected: +on default tunables (`s3PutAdmissionMaxInflightBytes = 256 MiB`) a +single chunked PUT would consume **20×** the entire budget at request +admission time, head-of-line-blocking every other PUT until the +chunked stream finishes — exactly the failure mode admission control +exists to prevent. We therefore split chunked admission across two +mechanisms instead of pre-charging: + +1. **Bootstrap headroom check** at request entry. Calls + `peekHeadroom(s3RaftEntryByteBudget)` — exactly the admission-A + contract: a fast-fail check that 4 MiB *would have fit* at the + moment we asked. **No slot is acquired.** This is intentionally + racy with concurrent PUTs (same as fixed-length admission A); + its job is to fail at request entry rather than partway + through the first decoded frame. Chunked PUTs are not "free" + — they still must beat the same admission queue as fixed-length + PUTs at the per-frame level. +2. **Pay-as-you-decode** thereafter, charged via an + `awsChunkedReader` progress callback. The callback **buffers + decoded bytes until a full slot unit (`s3ChunkSize = 1 MiB`) is + accumulated**, then calls `acquire(s3ChunkSize)` on the + semaphore (same path as fixed-length admission B). This keeps + the slot unit coherent: the semaphore's capacity is + `s3PutAdmissionMaxInflightBytes / s3ChunkSize` 1 MiB-units, so + acquiring at sub-MiB granularity is not representable. The + slot is released once the corresponding + `coordinator.Dispatch` acks the chunk. The buffer never holds + more than `s3ChunkSize - 1` decoded bytes, so the worst-case + memory overhead beyond the semaphore-tracked bytes is bounded + by 1 MiB per concurrent chunked PUT. + +Failure modes: + +- If the awsChunkedReader produces decoded bytes faster than Raft + drains, the next 1 MiB acquire blocks (capped by + `dispatchAdmissionTimeout`). Beyond that timeout, mid-stream 503 + closes the connection. The legacy "reserve 5 GiB" approach + would have surfaced as 503 *at request entry* for unrelated + PUTs; this approach surfaces as mid-stream 503 for the chunked + PUT itself, which is the right blame attribution. +- The bootstrap check at step 1 is racy: another PUT can consume + the headroom between the check and the first per-frame acquire. + When that happens the first acquire blocks (or 503s on + timeout) — the same path the fixed-length admission B handles + for the contending case. The race is intentional: making the + check a real reservation would multiply per-request slot hold + by `concurrent_chunked_PUTs × 4 MiB` of bootstrap-only credit + with no corresponding payload, reintroducing a head-of-line + hazard. +- If the awsChunkedReader produces a single frame whose decoded + size never accumulates to a full `s3ChunkSize`, the buffer + flushes on stream EOF: a final `acquire(actual_buffered_bytes)` + rounded up to one slot is taken (semaphore charges in 1-slot + units regardless of actual byte count), so the bound holds. +- A malformed client that decodes bytes faster than Raft drains + *cannot* trigger the immediate-503 path the way a fixed-length + PUT can. The accumulation design (callback always calls + `acquire(s3ChunkSize)`, never larger) means the per-frame + acquire request is bounded by 1 MiB — the + "if `bytes > capacity * s3ChunkSize`" early-return in + `acquire`'s spec is never hit on the chunked path. Instead, + successive 1 MiB acquires block under Raft pressure and the + PUT eventually surfaces 503 on `dispatchAdmissionTimeout` — + the same path a slow follower triggers. The "immediate 503 for + oversized request" failure mode applies only to fixed-length + PUTs (via `peekHeadroom(Content-Length > 256 MiB)`). + +This change moves chunked admission from M4 (originally "deferred +optimisation") into M1 (the first shippable milestone). M1 ships +with the progress-callback wired *unconditionally* for all chunked +PUTs; an env-var switch falls back to "bootstrap-only" charging +without the per-decode credit if a corner case requires it +(`ELASTICKV_S3_PUT_ADMISSION_CHUNKED_INCREMENTAL=false`, default +`true`). The fallback path keeps the 5 GiB-reservation hazard +behind an explicit operator decision rather than letting it +materialise by default. + +### 3.4 Failure mode + +- `503 Service Unavailable` with `Retry-After: 1` (small, jittered). + AWS S3 SDK clients (boto3, aws-sdk-go-v2) auto-retry this code with + exponential backoff out of the box. +- Body for the response is the standard S3 XML error envelope: + `SlowDownReduce your request rate`. + This is the AWS-defined code for admission rejection and matches + what real S3 returns. +- Mid-stream rejection (admission B) closes the connection with a + `connection: close` header so partial body reads do not corrupt the + client's pipeline. The PUT handler also calls + `cleanupManifestBlobs` for any partial blobs that already landed in + Pebble. + +### 3.5 Metrics + +```text +elastickv_s3_put_admission_inflight_bytes gauge +elastickv_s3_put_admission_rejections_total counter (labels: + stage = "prereserve" | "perbatch", + protocol = "fixed-length" | "chunked") +elastickv_s3_put_admission_wait_seconds histogram (labels: stage, protocol) +``` + +The `protocol` label distinguishes fixed-length PUTs (those with a +declared `Content-Length`, hitting admission A's `peekHeadroom`) +from aws-chunked PUTs (admission via §3.3.1's pay-as-you-decode). +This split is what makes the chunked-PUT 503 surface (§6) and the +rolling-upgrade alerting story actionable: a spike on +`stage="perbatch", protocol="chunked"` points at "chunked clients +beat Raft drain"; a spike on `stage="prereserve", +protocol="fixed-length"` points at "client concurrency exceeds +the per-node aggregate cap." Without the dimension the two +failure modes are indistinguishable in a single counter. + +Grafana panel: inflight gauge with the cap as a horizontal line so +the operator sees how often the system saturates. Rejection rate +suggests bumping the cap or scaling out (more nodes spreads PUT load). + +## 4. Interaction with related subsystems + +- **PR #636 (entry size alignment).** The 4 MiB per-batch unit is + the natural admission grain. The two changes are independent: + alignment is necessary for the per-peer Raft bound to be correct; + admission is necessary for the *aggregate* bound to be hard. +- **PR #612 (memwatch graceful shutdown).** Continues to function as + the last-resort safety net. Admission control should fire at well + below the memwatch threshold (`s3PutAdmissionMaxInflightBytes` is + ~14% of `GOMEMLIMIT`) so memwatch sees a much lower steady-state + pressure and the graceful-shutdown path stays a rare event. +- **Workload isolation proposal.** That doc proposes per-class CPU + reservation for Raft. Admission control is the memory-axis sibling. + Both are needed — limiting CPU does not bound queue depth. +- **`coordinator.Dispatch` retries.** Today the S3 path has its + own retry loop (`s3TxnRetryMaxAttempts = 8` with exponential + backoff capped at `s3TxnRetryMaxBackoff = 32 ms`). The admission + contract is **hold-through-retry**: the per-batch slot acquired + in admission B is released exactly once, on the *final* outcome + of the retry chain (success ack, terminal error, or + `dispatchAdmissionTimeout` expiring), not between attempts. + Rationale: the bytes are still buffered in the PUT handler's + pendingBatch slice for the entire retry window, so the budget + must reflect them; a release-between-retries scheme would let a + second PUT proceed while the first is still memory-resident, + breaking the bound. The S3 PUT path uses the inbound + `*http.Request` context for `coordinator.Dispatch` (no + S3-specific Dispatch timeout — the HTTP server's + `writeTimeout` / client-side cancellation is the upper bound on + one Dispatch attempt), so the wall-clock cost of holding the + slot through one full retry chain is bounded by + `s3TxnRetryMaxAttempts × (single_dispatch_budget + s3TxnRetryMaxBackoff)` + where `single_dispatch_budget` is whatever the request context + permits at that moment. If the retry chain duration ever + exceeds `dispatchAdmissionTimeout` the per-batch acquire on the + *next* batch surfaces as 503 — the right failure mode + (chronic dispatch failure → caller learns instead of silently + consuming the budget). + +## 5. Implementation plan + +| Milestone | Scope | Risk | +|---|---|---| +| M1 | Add `putAdmission` type + per-node singleton + fixed-length `Content-Length` admission (`peekHeadroom`). Wire `prepareStreamingPutBody` to acquire / release. **aws-chunked progress-callback admission** (§3.3.1) ships in this milestone too — the conservative 5 GiB pre-charge fallback only sits behind `ELASTICKV_S3_PUT_ADMISSION_CHUNKED_INCREMENTAL=false`. **`dispatchAdmissionTimeout` ships here** (the chunked per-frame `acquire(s3ChunkSize)` path is gated on it from day one), not in M2. Metric scaffolding (gauge + counter). | Medium. Chunked progress callback needs `awsChunkedReader` to expose a hook. | +| M2 | Per-batch admission B inside `flushBatch` for **fixed-length** PUTs (chunked PUTs already use admission B as of M1). Mid-stream 503 with cleanup on the fixed-length path. | Medium. Cleanup path on partial failure. | +| M3 | Env-var tunables. Histogram metric. Grafana panel. | Low. | +| M4 | Per-tenant / per-bucket admission classes (handed off to the workload-isolation rollout). | Medium. Out-of-scope for the v1 cap. | + +### Rolling upgrade + +Admission is purely additive on the request entry path: a node +without the cap behaves identically to a node with the cap set +infinitely high. A mixed cluster (some nodes M1, some still on +`main`) is therefore safe — clients hitting the upgraded node see +admission, clients hitting an old node see no admission, but +neither path corrupts state. The default cap is intentionally +generous enough that even single-node M1 traffic falls below the +threshold under typical load, so the rollout signature is +"503 SlowDown rate goes from 0 to negligible" rather than a step +function. Operators can pin +`ELASTICKV_S3_PUT_ADMISSION_MAX_INFLIGHT_BYTES=$((1<<63))` to +disable the cap on M1 nodes during the burn-in window if desired. + +The aws-chunked progress-callback path is the only behaviour +change visible to clients: a chunked PUT that would have succeeded +under the old "no admission" code can now 503 mid-stream when Raft +drain falls behind. This is by design — that is the failure mode +admission control exists to surface — but operators should expect +to see chunked-upload 503s where there were none before. The +`stage="perbatch", protocol="chunked"` rejection-counter label +isolates this signal; bumping the cap or +`ELASTICKV_S3_PUT_ADMISSION_CHUNKED_INCREMENTAL=false` (with the +HoL hazard re-introduced as a known trade-off) restores legacy +behaviour during incident response. + +Acceptance criteria: + +- `go test ./adapter/ -short -run TestS3PutAdmission` covers reject / + admit / mid-stream-timeout paths. +- A loaded test that opens 32 concurrent PUTs of 100 MiB each must + hold leader memory below `s3PutAdmissionMaxInflightBytes + epsilon` + for the duration of the test. +- No regression in `Test_grpc_transaction` (which is currently the + longest leader-stress test). + +## 6. Risks + +- **Tail-latency for legitimate clients.** A long-running PUT that + loses a 4 MiB slot mid-stream returns 503 even though it is making + progress. Mitigated by `dispatchAdmissionTimeout = 30s`, well above + the steady-state Raft drain time. If we observe spurious 503s in + practice, drop the timeout into a config knob and tune. +- **Operator confusion.** "Why does S3 return 503 when CPU is at + 20%?" Mitigated by a sharp Grafana panel and a clear `Retry-After` + value so SDK behaviour is predictable. +- **New chunked-PUT 503 surface.** Pay-as-you-decode admission + (§3.3.1) ships in M1 alongside fixed-length admission, so the + legacy 5 GiB pre-charge hazard does not materialise as a + steady-state risk. The residual risk it introduces is the + inverse: a chunked PUT that would have silently succeeded under + the no-admission code can now 503 mid-stream when Raft drain + falls behind. This is by design — that is the failure mode + admission control exists to surface — but it is the only + client-visible behaviour change in M1 and is what operators + should expect to see in dashboards. The + `stage="perbatch", protocol="chunked"` label on the rejection + counter (§3.5) isolates the signal; the operator escape hatch + is `ELASTICKV_S3_PUT_ADMISSION_CHUNKED_INCREMENTAL=false`, which + reverts to bootstrap-only charging at the cost of + re-introducing the 5 GiB head-of-line hazard. + +## 7. Out of scope (future work) + +- Per-bucket admission classes (e.g. system buckets get their own + budget). Punted to the workload-isolation rollout. +- Coordinated admission across the multi-region read replica path + proposed in `docs/design/2026_04_18_proposed_raft_grpc_streaming_transport.md`. +- Token-bucket rate-shaping (e.g. bytes-per-second). The current + proposal only bounds *concurrent* bytes; rate-shaping is a separate + policy choice. diff --git a/docs/design/2026_04_25_proposed_s3_raft_blob_offload.md b/docs/design/2026_04_25_proposed_s3_raft_blob_offload.md new file mode 100644 index 000000000..3d02154a2 --- /dev/null +++ b/docs/design/2026_04_25_proposed_s3_raft_blob_offload.md @@ -0,0 +1,678 @@ +# S3 raft blob offload — keep large object payloads out of the Raft log + +> **Status: Proposed** +> Author: bootjp +> Date: 2026-04-25 +> +> Companion to PR #636 (`s3ChunkBatchOps = 4`, Raft entry size aligned +> with `MaxSizePerMsg = 4 MiB` per PR #593) and to the S3 PUT +> admission-control proposal +> (`docs/design/2026_04_25_proposed_s3_admission_control.md`). +> +> PR #636 caps the *per-entry* size; admission control caps the +> *aggregate in-flight* memory; this doc removes large blob payloads +> from the *Raft log itself* so that snapshots and follower catch-up +> stay bounded as the data set grows. + +--- + +## 1. Problem + +Today every byte of an S3 object travels through the Raft log: + +```text +HTTP PUT body ─► s3ChunkSize (1 MiB) chunks + ─► s3ChunkBatchOps × 1 MiB Raft entry + ─► Raft log entry (Pebble WAL on every node) + ─► applyLoop → s3keys.BlobKey(...) → MVCCStore Put +``` + +`s3keys.BlobKey(bucket, generation, objectKey, uploadID, partNo, chunkNo)` +is the key actually written to Pebble; the log entry that proposed it +contains the full chunk value. Two consequences: + +1. **WAL & snapshot growth scales with object bytes.** A node that + serves 100 GiB of S3 PUT traffic ends up with a 100 GiB Pebble WAL + plus the same 100 GiB persisted in the engine's state machine. + Snapshot transfer for a falling-behind follower carries the + payload twice — once as Raft log replication (during catch-up + inside the WAL window) and once as a snapshot dump if the leader + has already truncated. +2. **Follower catch-up after a long absence is expensive.** Right now + a follower that misses 5 GiB of PUT traffic re-applies it as Raft + log entries, each going through the full `applyRequests` → + `MVCCStore.Put` path on a single thread. The ApplyLoop becomes a + bottleneck that holds the rest of the cluster waiting for the + `commitIndex` advance. + +S3 blob payloads are special: they are **idempotent, content- +addressable, large, and rarely re-read inside the leader's apply +window**. They look more like attachments than like Raft log records. +Treating them as Raft entries is overkill — Raft only needs to +linearise *what changed* (the manifest), not *the chunk bytes +themselves*. + +## 2. Goals & non-goals + +**Goals** + +- Raft log entries for S3 PUT carry a *reference* to the chunk + payload, not the payload itself. Replication traffic and WAL + size become O(manifest size), not O(object size). +- Followers fetch blob payloads out-of-band when they apply a + manifest reference. Apply order remains Raft-defined; only the + bytes are pulled lazily. +- Snapshot transfer for a falling-behind follower is O(manifest + count + small blob index), not O(stored bytes). +- The new path is opt-in and lives alongside the current direct-Raft + path until parity is proven. Existing S3 traffic is unaffected + during rollout. + +**Non-goals** + +- No external object store dependency (S3, MinIO, Ceph). The blob + offload uses Pebble itself plus a peer-to-peer fetch protocol; + introducing an external dependency would re-create the operational + surface elastickv exists to replace. +- No *user-facing* deduplication API or storage-accounting credit. + The `chunkblob` keyspace is content-addressed by SHA-256 (§3.1) + and reference-counted (§3.5), which means two distinct objects + whose chunks happen to hash identically will share one + `chunkblob` row at the storage layer — that is a structural + property of content addressing, not a feature we expose. We do + *not*: surface dedup ratios, charge storage by post-dedup bytes, + rebalance dedup credit across tenants, or treat dedup hits as + semantically observable from S3 verbs. The reference layer + (`chunkref`) keeps `(bucket, objectKey, uploadID, partNo, + chunkNo)` granular and per-object, so DELETE / lifecycle still + reason about objects independently. Authorisation enforcement + remains on `chunkref` reads, never on `chunkblob` reads + (§3.3 covers the proxy-on-miss path; ACL checks fire before the + blob fetch is initiated, so a tenant cannot dereference a peer's + `chunkblob` by guessing a SHA — see §6 *Cross-tenant blob fetch*). +- No change to MVCC semantics. Manifest commits remain the + serialisation point; blob fetch is a side channel that does not + change visibility rules. +- No removal of the existing `BlobKey` path in this doc. We will + ship in two stages (manifest-only writes through Raft, blob + payload via a side channel), and the legacy path stays available + until enough operational evidence accumulates. + +## 3. Design + +### 3.1 New keyspace + +```text +!s3|chunkref|||||| + → ChunkRef{ + ContentSHA256 [32]byte + Size uint64 + // Optional: leader-locality hints. Followers without the + // payload locally fetch from a peer that advertises the + // chunk in its catalog. + SourcePeer NodeID + } + +!s3|chunkblob| + → raw bytes (the chunk payload) +``` + +Two separate keyspaces: + +- `chunkref` is replicated through Raft. Cheap (32 B + small header + per chunk) and ordered against the manifest commit. +- `chunkblob` is **not** written through Raft. It is written + directly to Pebble on the receiving node and pulled by peers via + the new fetch protocol when they apply the corresponding + `chunkref`. + +### 3.2 PUT path + +```text +client ─► HTTP PUT body + ─► chunk loop (s3ChunkSize): + 1. compute SHA-256 of chunk + 2. write chunk to LOCAL Pebble at !s3|chunkblob| + (fsync) ─┐ pipelined: bytes also stream out + 3. PushChunkBlob to ─┤ to chunkBlobMinReplicas-1 followers + followers in parallel (one RPC per follower; bytes + start flowing the moment the leader has them — not + after local fsync completes) + 4. wait until BOTH local fsync AND a quorum of follower + fsync-acks have returned (= chunkBlobMinReplicas + durable copies including the leader) + 5. queue ChunkRef into pendingBatch + ─► flushBatch: + coordinator.Dispatch(OperationGroup{ + Elems: [ chunkref Puts ... + manifest Put ], + }) + ─► HTTP 200 OK once Dispatch acks +``` + +Step 3 — synchronous chunkblob replication before the chunkref +commit — is the difference between "Raft-equivalent durability" +and "leader-only durability." Without it, a leader crash between +the chunkref commit and the eventual async fetch would leave a +committed manifest pointing at a chunkblob nobody else has — Raft's +quorum guarantees the chunkref but tells you nothing about the +blob payload. We close that gap by treating the chunkblob like a +mini-Raft entry of its own with **semi-synchronous quorum**: + +1. Leader starts writing the chunkblob to local Pebble (fsync in + flight). +2. **Concurrently** with step 1, the leader streams the chunkblob + to `chunkBlobMinReplicas - 1` followers via parallel + `S3BlobFetch.PushChunkBlob` RPCs. Pushes are **fanned out in + parallel**, not sequential — each follower's RPC is started + immediately, and the leader waits on a quorum of fsync-acks + rather than serially blocking on each one. (`PushChunkBlob` is + the leader-initiated counterpart to the follower-initiated + `FetchChunkBlob` defined in §3.6.) +3. The leader waits for *both* the local fsync AND a quorum of + follower fsync-acks. The dominant cost is therefore + `max(local_fsync, slowest_quorum_follower_fsync)` — typically + ≈ 10 ms on consumer SSD, equivalent to a Raft quorum write. + This is what makes the p99 latency claim below load-bearing: + if step 1 and step 2 were *sequential* (write local → then + push to followers → then wait), per-chunk latency would be + `chunkBlobMinReplicas × fsync_latency` and silently double the + PUT p99 vs. the legacy path. The pipelined / parallel model + is part of the contract, not an optimization. +4. Only after the chunkblob is durable on a quorum of nodes does + the leader propose the chunkref through Raft. + +`chunkBlobMinReplicas` defaults to **2** on a 3-node cluster (= a +quorum of 2 includes the leader and one follower). For larger +clusters the floor is `(N/2)+1` to match Raft's quorum size; the +operator can opt into N for stronger-than-Raft durability. A +follower that crashes after acking the push but before the chunkref +commits is fine — the chunkref will be retried by the leader on the +next attempt because it has not yet entered the Raft log. + +**Behaviour during cluster shrink / partial outage.** A controlled +decommission (5 → 3 nodes) or a transient partition can leave the +leader with fewer reachable peers than `(N/2)+1` configured at the +last membership commit. Blocking PUTs until the configured minimum +is reachable would surface as an indefinite hang, which is worse +than the legacy "every byte through Raft" path (which fails when +Raft itself loses quorum, but otherwise succeeds). The contract is +therefore degraded availability with a hard floor: + +- If reachable peers ≥ `chunkBlobMinReplicas`, normal path: ack + after the configured minimum. +- If reachable peers < `chunkBlobMinReplicas` but ≥ `floor(N/2)+1` + (Raft quorum is intact), degrade to "as many as currently + available, but never fewer than 2 — i.e. leader + at least one + follower." Emit `s3_chunkblob_replication_degraded_total` so + operators see the degradation. This matches Raft's own behaviour + during the same window: Raft would still commit at quorum, just + with one fewer redundant ack. +- If reachable peers < 2 (leader-only durability), fail the PUT + with 503 — single-node durability is what `BlobKey`-on-Raft + already loses on leader crash, and is the regression this design + exists to prevent. + +The floor of 2 is the strict invariant: leader-only writes are the +case the design refuses to accept regardless of operator +configuration. Tuning `chunkBlobMinReplicas` higher trades PUT +availability for stronger durability; tuning lower than 2 is +rejected at config-load. + +**Important durability note for N > 3 clusters.** On a 3-node +cluster the degraded floor of 2 chunkblob copies happens to match +Raft's quorum-of-2, so a single node failure is tolerated by both +the chunkref *and* the chunkblob. For N > 3 this is no longer +true: a 5-node cluster has Raft quorum 3 and tolerates 2 +simultaneous failures for the chunkref, but the degraded +chunkblob path (leader + 1 follower) tolerates only 1. If the +leader and the chunkblob-holding follower both fail during the +degraded window, the surviving Raft quorum elects a new leader, +finds a committed chunkref, and discovers that no surviving node +holds the chunkblob — the chunkref is durable but the object data +is lost. This is **weaker than the legacy "every byte through +Raft" path**, which loses data only when Raft itself loses quorum +(3 simultaneous failures on N=5). Operators on N > 3 clusters who +need the legacy "blob durability == Raft durability" guarantee +should configure `chunkBlobMinReplicas = N` (full replication; +trades some PUT availability — any single peer outage stalls +PUTs — for the strongest durability the cluster can offer). +The default `(N/2)+1` is sized for "match Raft quorum," not "match +Raft fault tolerance"; this distinction is invisible at N=3 but +material at N≥5 and is what makes this configuration knob +operationally meaningful. + +The trade-off is PUT latency: a PUT now blocks on +`chunkBlobMinReplicas - 1` follower fsyncs in addition to the Raft +quorum write of the chunkref. Empirically the chunkblob fsync is +the dominant cost (1 MiB write, ~5–10 ms on consumer SSD), so PUT +p99 is roughly equivalent to today's "every byte through Raft" +latency — we are paying the same fsync cost, just to a different +keyspace. + +The `chunkref` keys are < 100 B each. A 1 GiB PUT generates 1024 +of them = ~100 KiB of Raft log payload. Compared with today's +1 GiB through Raft, that is a **10⁴× reduction** in Raft log +write amplification — even with semi-synchronous chunkblob +replication, the Raft log itself is unaffected by chunk size, so +log replay time, snapshot transfer, and follower catch-up still +collapse to O(manifest count). + +### 3.3 Follower apply path + +When a follower's apply loop sees a `chunkref` Put: + +1. Stage the `chunkref` key in MVCCStore as usual. +2. Schedule an async fetch of `!s3|chunkblob|` from + `SourcePeer` (or a quorum-style fanout to all known peers). +3. The fetch worker writes the chunk to local Pebble at + `!s3|chunkblob|` once the body arrives and verifies the + SHA-256 (mismatch → drop and retry from another peer). + +`SourcePeer` is a **best-effort hint** captured at write time +(§3.1). It is *not* authoritative — the recorded peer may have +crashed, restarted, evicted the blob via §3.5 GC, or simply lost +the local copy to disk failure. Callers MUST treat +`FetchChunkBlob → NOT_FOUND` from `SourcePeer` as a normal fallback +trigger, not an error: drop to fanout against the rest of the +peer set and accept the first peer that returns the bytes (with +SHA-256 verification at the receiving end). Treating +`NOT_FOUND` as a hard failure would make a single peer's GC tick +pin clients on a bad source. + +GET / range-read on the follower checks the local `chunkblob` +first; if absent (because the async fetch is still pending), it +either: + +- proxies the read to a peer that holds the chunk (using the + `SourcePeer` hint, falling back to fanout on `NOT_FOUND`), or +- replies 503 with `Retry-After`, identical to S3's behaviour + during a region failover. + +The choice is per-deployment; Phase-1 ships proxy-on-miss. + +#### 3.3.1 GET vs. GC delete race + +Even with the §3.5 grace window, a GET that proxies to peer X for +a chunkblob can lose a race against peer X's sweeper if the +sweeper deletes the local copy *between* the GET arriving on the +caller's node and the proxy RPC reaching peer X. The blob remains +reachable globally (other peers still hold it; the chunkref is +unchanged), so the right behaviour is for the caller to fall back +to fanout. We therefore mandate: + +- `FetchChunkBlob` on a peer whose local sweeper just removed the + blob returns `NOT_FOUND`, **not** an internal error. +- The caller's GET handler treats both `NOT_FOUND` and + `INVALID_ARGUMENT (sha mismatch)` as fallback triggers and + cycles through the remaining peers in randomised order. +- If the entire peer set returns `NOT_FOUND` for an SHA whose + `chunkref` is still present, that is a genuine durability + failure (every replica including the leader's GC raced); the + GET surfaces 500 *and* the read path emits a + `s3_chunkblob_unrecoverable_total` metric so operators detect + the underlying GC bug. With the §3.2 quorum-write durability + and the §3.5 grace window, this case requires a coincident + failure across a quorum of nodes within a 1-hour window — vastly + more unlikely than the per-peer 404 the fanout absorbs as a + matter of course. + +### 3.4 Snapshot + +Today a follower snapshot dump includes every `BlobKey` Pebble has +ever stored. Under the new design: + +- The Raft snapshot serialises only `chunkref` keys plus the rest + of the MVCC state (manifests, bucket meta, ACLs). +- A separate **blob catalog snapshot** lists every locally-held + `chunkblob` SHA. This is included in the snapshot stream as a + manifest of "blobs you should fetch from me on demand." +- Once the follower has consumed the Raft snapshot and the blob + catalog, it begins serving GET / HEAD by proxying chunkblob + fetches to peers as in 3.3. + +A 100 GiB cluster's Raft snapshot drops from ~100 GiB to a few +megabytes (one `chunkref` per chunk, plus the manifest set). The +blob catalog adds 32 B × N_chunks = ~3 MiB per 100 GiB. Snapshot +stream time falls by orders of magnitude; the recovery cost +shifts from "leader's WAL dump" to "follower's lazy blob fetch +amortised across reads." + +### 3.5 Garbage collection + +A blob whose `chunkref` has been deleted (DELETE, lifecycle policy, +object version pruned, manifest aborted) is reclaimable. The +sweeper needs to know not just *that* the RC reached zero but +*when*, otherwise the documented grace window is unimplementable +(a plain counter at zero carries no time signal). We make the +"became eligible at T" fact a first-class Raft entry: + +1. **Reference counting** via `!s3|chunkref-rc|`, a counter + updated inside the same Raft txn that adds / removes a + `chunkref`. The atomic `(chunkref change, RC update)` pair is + the linearisation point for "this blob is now / no longer + reachable." +2. **GC eligibility queue.** When the txn that decrements an RC + would drive it to zero, the *same* txn additionally writes + `!s3|chunkblob-gc-queue||` → empty. The + commitTS-prefixed key is the time signal: the queue is + naturally sorted by eligibility-start time, and any node can + determine the grace-period boundary by scanning the queue with + `endKey = !s3|chunkblob-gc-queue||`. If a + subsequent txn re-references the same SHA before the sweeper + runs (e.g. an upload reuses a content hash), that txn deletes + the queue entry as part of incrementing the RC; the queue + therefore reflects "currently RC==0" rather than "ever was zero." +3. **Node-local sweeper.** Each node runs an independent sweeper + every `chunkBlobGCInterval` (proposed default 5 minutes) that: + a. scans the queue range `[!s3|chunkblob-gc-queue|, !s3|chunkblob-gc-queue||)` + for entries whose grace window has elapsed, + b. for each `` returned, re-checks the RC counter at the + sweeper's read timestamp. + + *The deletion is two-phase across two storage layers and is + NOT a single transaction* — `!s3|chunkblob|` is local + Pebble (per §3.1, never written through Raft), while + `!s3|chunkblob-gc-queue|…` is Raft-replicated. The phases + MUST run in this order: + + i. **Raft phase first — conditional delete.** Delete the + queue entry through a Raft txn that is **conditional on + the queue entry existing AND the RC counter still being + 0 at the txn's read timestamp**. The conditional form is + load-bearing: if a re-reference txn has committed + between the sweeper's queue scan and this txn (driving + RC back to 1 and atomically removing the queue entry — + see §3.1's atomic invariant), the conditional delete + fails and the sweeper aborts before reaching the local + phase. An *unconditional* delete would silently succeed + on the now-absent queue entry and let the sweeper + proceed to local-delete a chunkblob that is currently + live (RC=1) — a **correctness bug, not just a space + leak**. Concurrent sweepers also serialise on this txn + (write-write-conflict on the queue key); only the winner + proceeds. + ii. **Local phase second.** Delete the local + `!s3|chunkblob|` from Pebble. No Raft round-trip. + Reaching this phase implies (i) succeeded, which + implies the RC was 0 at the txn read timestamp and + remained 0 throughout the txn's commit window — i.e. + the blob is genuinely unreachable. + + The phase ordering is the load-bearing detail. If we did + local-first then Raft, a crash between the two phases would + leave the chunkblob gone locally but the queue entry still + present — every subsequent sweep would re-attempt the local + delete (no-op) and the queue entry would never get removed + until manual intervention. The Raft-first ordering trades + that for the inverse failure mode: a crash between the two + phases leaves the queue entry deleted but the local + chunkblob still on disk — a **bounded local space leak, + not a correctness bug**. A periodic "orphan scan" reclaims + these. + + The orphan scan covers two distinct sources of orphans: + + - **Sweeper crash between Phase (3b.i) and (3b.ii)** — the + case described above; queue entry was removed via Raft + but the local Pebble delete never fired. + - **PUT failure before chunkref Dispatch** — chunkblob + bytes were written to local Pebble in §3.2 step 2, then + the PUT aborted before reaching `coordinator.Dispatch` + (admission control 503, client disconnect, `PushChunkBlob` + quorum failure, request context cancel). In that + scenario neither an RC entry nor a GC queue entry was + ever written, so the sweeper's queue-range scan never + sees these orphans — only the orphan scan does. + + Detection criterion (covers both): `!s3|chunkblob|` + keys whose SHA has either no RC entry at all, or RC=0 with + no corresponding queue entry. The orphan scan runs at low + priority out of band from the sweeper (proposed default + `chunkBlobOrphanScanInterval = 1 hour`); it is the safety + net behind both the sweeper crash path and the PUT-abort + cleanup path, so the PUT handler does not need its own + best-effort local-delete on the abort path. + c. if the RC has bounced above 0 in the meantime, the queue + entry is stale (a re-reference txn forgot to remove it, or + the sweeper raced) and the sweeper deletes only the queue + entry through a Raft txn, leaving the chunkblob in place. + +The queue is the authoritative "blob is GC-eligible since T" +signal *and* the global "we are GC-ing this SHA" lock — its +Raft-replicated single-writer-per-key property is what makes +concurrent sweepers safe across nodes. The RC is the +authoritative "is reachable" signal, also Raft-replicated. Local +chunkblob deletes are deliberately *not* replicated: each node +deletes its own copy independently after the queue-entry txn +commits, because that's the whole point of the architecture. + +`chunkBlobGCGracePeriod` defaults to 1 hour. The grace window +absorbs in-flight reads (a peer that has already started fetching +the blob completes its fetch before the sweeper runs), in-flight +upload aborts (a multipart abort flips RCs to zero, then a retry +creates a new manifest with the same chunks and bumps them back), +and clock skew between nodes (we use the Raft `commitTS` from the +RC-update txn, not wall clock, so skew is bounded to the +HLC-physical-shift the cluster already tolerates). + +### 3.6 Fetch protocol + +Two RPCs on the existing internal raft transport service — one +follower-initiated (lazy fetch on miss), one leader-initiated +(synchronous replication before chunkref commit, see §3.2 step 3): + +```protobuf +service S3BlobFetch { + // FetchChunkBlob returns the bytes of a chunkblob this peer holds + // locally. Caller must verify SHA-256. Used by followers on the + // proxy-on-miss GET path (§3.3) and during snapshot catch-up. + rpc FetchChunkBlob(FetchChunkBlobRequest) returns (stream FetchChunkBlobResponse); + + // PushChunkBlob streams a chunkblob from the leader to a follower + // and acks once the bytes are durable in the receiver's Pebble. + // Used by §3.2 step 3 to make chunkblob writes survive a leader + // crash without depending on the async fetch path catching up. + // The receiver SHOULD verify SHA-256 against the request header + // before fsync; mismatch fails the RPC and the leader retries. + rpc PushChunkBlob(stream PushChunkBlobRequest) returns (PushChunkBlobResponse); +} + +message FetchChunkBlobRequest { + bytes content_sha256 = 1; +} + +message FetchChunkBlobResponse { + bytes payload = 1; + bool eof = 2; +} + +message PushChunkBlobRequest { + bytes content_sha256 = 1; // sent in the first frame only + bytes payload = 2; + bool eof = 3; +} + +message PushChunkBlobResponse { + bool durable = 1; // true == fsynced +} +``` + +Streamed because a chunkblob is up to `s3ChunkSize = 1 MiB`. The +existing gRPC `MaxRecvMsgSize = 64 MiB` (PR #593 → `internal.GRPCCallOptions`) +already covers this in a single RPC, but streaming keeps the +implementation symmetric with how the future Raft streaming +transport (proposed under +`docs/design/2026_04_18_proposed_raft_grpc_streaming_transport.md`) +handles large payloads. + +### 3.7 Backwards compatibility & rollout + +The legacy `BlobKey` path remains available. New PUTs use the +offload path when `ELASTICKV_S3_BLOB_OFFLOAD=true`; existing data +keeps reading through the legacy `BlobKey` path until a background +migrator (separate proposal) rewrites it. Mixed keyspace coexistence +works because `!s3|chunkblob|*` and the legacy +`!s3|blob|||...` namespaces are disjoint. + +The opt-in flag stays for at least one full release cycle so we can +revert by flipping a single env var if any of the following surface: + +- a SHA-256 collision (~zero probability but a hard kill criterion), +- a follower fetch storm overwhelming peer-to-peer bandwidth, +- a GC bug that leaks reachable blobs, +- semi-synchronous `PushChunkBlob` latency exceeding the legacy + PUT p99 by an unacceptable margin (the soak-test acceptance + criterion in §5). + +### 3.8 Mixed-version cluster behaviour + +Until *every* node in the cluster speaks the offload protocol, PUTs +on the offload path cannot proceed safely: a node that does not +implement `PushChunkBlob` cannot ack a quorum write, and a follower +that does not implement `FetchChunkBlob` cannot resolve a chunkref +on apply. We therefore gate the offload path on cluster-level +feature negotiation rather than a single env var: + +- A node advertises offload capability by setting + `feature_s3_blob_offload=true` in the `AdminServer.GetClusterOverview` + response (alongside the existing role / version metadata). +- The leader inspects every peer's advertised capabilities at PUT + admission time. If any peer is missing the capability, the PUT + falls back to the legacy `BlobKey` path for that request — even + if the leader has the env var enabled. +- During an upgrade window the leader continues to emit legacy + writes; once the last peer rolls and re-advertises, subsequent + PUTs flip to offload automatically. A roll-back works the same + way in reverse: the first downgraded peer drops its capability + flag and the leader resumes legacy emission within the next + capability-refresh interval (default 30 s). +- Reads always succeed regardless of mixed state, because both + keyspaces are namespaced and the GET path checks legacy then + offload (or vice-versa) and serves whichever resolves. + +This gives operators a **strict two-step rolling upgrade** with no +PUT data path that depends on a half-upgraded cluster: + +1. Roll out the new binary with `ELASTICKV_S3_BLOB_OFFLOAD=false` + on every node. PUTs continue on the legacy path. Validate + stability for a soak window (24 h on the canary cluster in + §5's M0 acceptance criteria). +2. Flip `ELASTICKV_S3_BLOB_OFFLOAD=true` on the leader, then on + followers. Once every node advertises capability, PUTs switch + to the offload path. + +Roll-back: flip the env var to `false` on any node; the leader's +capability check sees the disagreement and falls back to legacy +within ≤ refresh interval. The migrator (M4) is independently +gated and never runs during a roll-back window. + +A node with `ELASTICKV_S3_BLOB_OFFLOAD=true` running against a +cluster where offload is disabled (e.g. a stuck rollout) is safe +— it advertises capability but the leader's per-PUT capability +check sees other peers missing it and routes through legacy. No +data is written into the offload keyspace until a quorum of +capability-advertising peers exists. + +## 4. Interaction with related subsystems + +- **PR #636 + admission control.** The admission control budget + drops in importance under the offload path because Raft entries + are tiny (~100 B per chunkref). However the *body bytes still + flow through HTTP* and `prepareStreamingPutBody` continues to + hold them in memory until the local Pebble write returns. The + admission cap must stay; only the per-peer Raft-side worst-case + bound (`MaxInflight × MaxSizePerMsg = 4 GiB`) gets *much* easier + to honour. +- **PR #589 (snapshot tuning) and PR #614 (etcd-snapshot-disk-offload).** + Already implemented. The offload path makes those tunables more + effective by reducing the per-snapshot byte count. +- **`docs/design/2026_04_18_proposed_raft_grpc_streaming_transport.md`.** + The blob-fetch RPC reuses the same chunked-streaming approach + proposed for Raft transport. We can land both behind the same + abstraction. +- **Lease read & MVCC snapshot reads.** No change. Manifests remain + the linearisation point; chunk bytes are immutable once committed + (content-addressable), so a stale local copy on a follower is + still correct. + +## 5. Implementation plan + +| Milestone | Scope | Risk | +|---|---|---| +| M0 | Spike: prove the chunkref + chunkblob keyspaces under a feature flag with 1 % traffic. Measure local Pebble write amp & blob fetch latency. | Low (observability only). | +| M1 | PUT path emits chunkrefs through Raft; chunkblob writes go directly to local Pebble. **`FetchChunkBlob` and `PushChunkBlob` RPCs ship in this milestone** because both M1 PUT (semi-synchronous push) and M1 GET (proxy-on-miss) depend on them — without them M1 GET could only serve local-hit or 503. | Medium (race ordering, RPC plumbing on the request goroutine). | +| M2 | Async fetch worker pool for follower apply (catch-up after a long absence). Independent of M1's synchronous `FetchChunkBlob` use on the GET path. SHA verification + retry from alternate peer on mismatch. | Medium (fanout cost on snapshot apply). | +| M3 | Reference-count + grace-period GC (the queue-based scheme in §3.5). | Medium (correctness of RC under concurrent ops). | +| M4 | Migrator: rewrite legacy `BlobKey` data in the background. Off by default until M0–M3 burn in for 30 days in production. | High (long-running batch over live traffic). | + +Acceptance criteria for M3 (the milestone that flips `ELASTICKV_S3_BLOB_OFFLOAD=true` by default): + +- WAL growth per GiB of S3 PUT < 1 MiB on a one-week soak test. +- Snapshot transfer for a 100 GiB-cluster follower restart completes + in < 60 s on a 1 Gbps interconnect. +- No regression in PUT p99 latency or GET p99 latency vs. the legacy + path (measured on the 24 h pre-cutover window). + +## 6. Risks + +- **Race between local chunkblob write and the chunkref commit.** + Mitigated by writing the chunkblob to a local Pebble batch with + fsync *before* the chunkref enters `coordinator.Dispatch`. The + manifest commit is the linearisation point; if the chunkblob is + durable on the leader, peers can fetch it as soon as they apply. +- **Follower fetch storm.** A new follower that catches up sees a + flood of `chunkref` Puts and could DDoS the source peer with + fetches. Mitigation: bounded fetch worker pool + token bucket + per-source. The Raft apply loop does *not* block on the fetch — + it stages `chunkref` and lets the fetch lag — so apply latency + stays bounded. +- **SHA-256 collision.** Operationally improbable; shipped with a + metric (`s3_chunkblob_sha_mismatch_total`) and a hard-fail option + for paranoid operators. +- **Leader-only durability before chunkref commit.** Without + intervention, a leader crash between writing the chunkblob to its + own Pebble and the eventual async fetch on followers would leave + a Raft-committed chunkref pointing at a chunkblob no surviving + node has. Mitigation: §3.2 step 3 — synchronous semi-quorum + replication via `PushChunkBlob` before the chunkref enters Raft. + `chunkBlobMinReplicas` defaults to a Raft-quorum-equivalent + floor; operators who want N-way durability bump it explicitly. + This restores end-to-end durability parity with the legacy + "every byte through Raft" path at the cost of one extra fsync + per chunkblob on the followers in the quorum. +- **Cross-tenant blob fetch via SHA-256 guessing.** Because + `chunkblob` keys are SHA-256-addressed, *if* a malicious tenant + could (a) guess a victim tenant's chunk SHA and (b) bypass + authorisation, they could exfiltrate the chunk. SHA-256 guessing + is computationally infeasible for non-trivial content, but we + remove the second prerequisite by enforcing authorisation + *exclusively at the `chunkref` layer*. The `S3BlobFetch.FetchChunkBlob` + RPC is internal-only (raft-transport credentials, not exposed to + S3 clients); user-facing GET resolves through `chunkref` first, + which carries the bucket / key tenancy context, and only after + the ACL check does the server proxy to a peer for the + corresponding `chunkblob`. A future design that exposes + blob fetch on a public surface would need to reintroduce + tenant-scoped authorisation at the blob layer; this proposal + intentionally does not. + +## 7. Out of scope (future work) + +- Cross-cluster blob replication (CRR / disaster recovery). +- Tiered storage (cold blobs to S3-IA / Glacier-equivalent). +- Erasure coding for blob payloads. +- Compression. The current S3 spec is "the bytes the client sent"; + any compression layer is a separate negotiated feature. + +## 8. Open questions + +- Do we need a per-follower bandwidth cap on blob fetch? If the + cluster network is constrained, a runaway catch-up could starve + user-path GET / Raft heartbeat traffic. Probably yes — defer to + the workload-isolation rollout. +- Is content-addressing at the chunk granularity (`chunkSize = 1 MiB`) + the right unit, or should we content-address whole objects and + range-fetch sub-chunks? The chunk granularity matches what + `prefetchObjectChunks` already does and keeps content addressing + predictable; whole-object addressing would require re-hashing on + partial reads. Tentatively: chunk granularity for v1.