From 577106d6f0e03e6f9e784b519674550287850bff Mon Sep 17 00:00:00 2001 From: "Yoshiaki Ueda (bootjp)" Date: Sat, 25 Apr 2026 23:36:01 +0900 Subject: [PATCH 1/8] docs(design): propose S3 admission control + raft blob offload MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Two companion proposals to PR #636 (s3ChunkBatchOps=4, Raft entry size aligned with MaxSizePerMsg). PR #636 fixes the per-entry memory accounting; these docs cover the two follow-up axes the in-PR discussion identified: - 2026_04_25_proposed_s3_admission_control.md Hard cap on the aggregate S3 PUT body bytes a node accepts but has not yet committed to Raft. Per-PUT pipeline is already capped by MaxInflight × MaxSizePerMsg; this proposal bounds the multiplier from concurrent PUTs so leader-side worst-case memory has a hard ceiling regardless of client behaviour. Two-level admission (Content-Length pre-charge + per-batch slot) with HTTP 503 SlowDown rejection. - 2026_04_25_proposed_s3_raft_blob_offload.md Take large chunk payloads off the Raft log entirely. Raft replicates ChunkRefs (32-byte SHA256 + size) plus the manifest; chunk bytes travel via a side-channel gRPC fetch protocol between peers. Snapshot size becomes O(manifest count), follower catch-up is amortised by lazy fetch, and WAL growth no longer scales with stored bytes. Both docs are filed as proposals only; no code changes. They follow the docs/design/README.md filename + header conventions and cross-reference each other plus the prerequisite PRs (#593, #600, #612, #617, #636) and the workload isolation roadmap. --- ...026_04_25_proposed_s3_admission_control.md | 280 +++++++++++++++ ...026_04_25_proposed_s3_raft_blob_offload.md | 332 ++++++++++++++++++ 2 files changed, 612 insertions(+) create mode 100644 docs/design/2026_04_25_proposed_s3_admission_control.md create mode 100644 docs/design/2026_04_25_proposed_s3_raft_blob_offload.md diff --git a/docs/design/2026_04_25_proposed_s3_admission_control.md b/docs/design/2026_04_25_proposed_s3_admission_control.md new file mode 100644 index 000000000..6700be7f4 --- /dev/null +++ b/docs/design/2026_04_25_proposed_s3_admission_control.md @@ -0,0 +1,280 @@ +# S3 PUT admission control + +> **Status: Proposed** +> Author: bootjp +> Date: 2026-04-25 +> +> Companion to PR #636 (`s3ChunkBatchOps = 4`, Raft entry size aligned +> with `MaxSizePerMsg = 4 MiB` per PR #593) and to the workload-class +> isolation proposal (`docs/design/2026_04_24_proposed_workload_isolation.md`). +> Where PR #636 fixes the *per-entry* memory accounting and the +> isolation doc keeps Raft's CPU budget separate from heavy command +> paths, this doc bounds the *aggregate* in-flight memory that S3 +> PUT body bytes can pin in the leader's Raft pipeline regardless of +> how many clients are uploading at once. + +--- + +## 1. Problem + +Even with PR #636 reducing the per-entry size to 4 MiB and aligning +with `MaxSizePerMsg`, the **aggregate** in-flight memory bound is +governed by client concurrency, which the server cannot currently +limit: + +``` +leader-side worst-case = concurrent_PUTs × pending_entries_per_PUT × entry_size + ≈ concurrent_PUTs × MaxInflight × 4 MiB +``` + +A single PUT's pipeline is already capped by `MaxInflight = 1024` Raft +in-flight messages times 4 MiB per entry = 4 GiB / peer (the bound PR +#593 advertises). What multiplies that bound is concurrent uploads: + +- 4 simultaneous 5 GiB PUTs on a 4-vCPU 8 GiB-RAM node can pin + `4 × min(MaxInflight × 4 MiB, body_remaining)` on the leader. If a + follower stalls (GC pause, slow disk fsync) for several seconds, that + worst case is realised before backpressure kicks in. +- The 2026-04-24 incident showed that one workload class can wedge the + Go runtime for the entire cluster. S3 PUT body bytes pinned in the + Raft pipeline play the same role as XREAD blobs in the Redis path: + they grow with client behaviour, not with anything the operator can + rate-limit per-handler. +- `GOMEMLIMIT = 1800 MiB` plus `--memory = 2500m` (PR #617) set the + upper bound on heap, but they do not provide *flow control*. The Go + GC starts thrashing well before the limit, and memwatch (PR #612) + triggers graceful shutdown — a recovery mechanism, not a steady-state + control. + +There is no hard ceiling today on the aggregate body bytes that S3 +puts onto the Raft pipeline. PR #636 makes the per-entry slot +predictable; this proposal makes the *number of slots in flight* +predictable too. + +## 2. Goals & non-goals + +**Goals** + +- Hard cap on the total S3 PUT body bytes that have been read from + HTTP but not yet committed to Raft. This cap is a tunable property + of the S3 server, not of any single PUT. +- Steady-state behaviour where new PUTs are admitted at the rate Raft + can drain. Bursting clients see HTTP 503 with `Retry-After`, not + silent ballooning of leader memory. +- Composable with existing memwatch + GOMEMLIMIT. Admission failure + must be visible (metric + log line per rejected request) so an + operator can size the cap against a given memory budget. +- Read path is unaffected. GET / HEAD / LIST do not consume the same + budget. + +**Non-goals** + +- No global QPS quota across all S3 verbs. Other verbs (GET, multipart + initiate / complete) have well-bounded memory and don't need this. +- No per-bucket / per-key fairness. Fairness across tenants is a + separate workload-class concern (handled by the workload isolation + proposal). This doc only fixes the aggregate ceiling. +- No backpressure-via-TCP (`SO_RCVBUF` shrink). Decoupling from kernel + buffers keeps the budget testable and avoids cross-OS surprises. +- No multi-region / multi-cluster admission. One leader's perspective. + +## 3. Design + +### 3.1 Where the budget is checked + +Two natural insertion points exist on the PUT path: + +| Insertion point | Granularity | When the request is rejected | +|---|---|---| +| (A) Before `prepareStreamingPutBody` accepts the first byte | Whole-object | Pre-Raft, before any reads from the body | +| (B) Inside the `flushBatch` loop, before `coordinator.Dispatch` | Per-batch | Mid-stream, after `s3ChunkBatchOps` chunks are buffered | + +Recommended: **both, at different scales**. + +- **(A) is the request-admission cap.** Use the `Content-Length` + header (rejected when absent for PUTs that are not aws-chunked) to + pre-charge the budget. If the budget would be exceeded we reply 503 + immediately. This matches how AWS S3 itself handles + `SlowDown` / `ServiceUnavailable` and is the cheapest case to + surface — the body is never read from the socket. +- **(B) is the in-flight cap.** Even after a request is admitted, its + 4 MiB batches can pile up if Raft becomes slow. The flush loop + acquires a per-batch lease (4 MiB) from a shared semaphore *before* + reading the next chunk window. If the semaphore is empty for longer + than `dispatchAdmissionTimeout` (proposed default 30 s), the PUT + fails with 503 mid-stream. Dispatch latency stays bounded by the + semaphore size rather than by client behaviour. + +``` +client ─[Content-Length]─► (A) reserve full body bytes + │ + ▼ + prepareStreamingPutBody + │ + ▼ + (B) acquire 4 MiB slot ◄┐ (released on Dispatch ack) + │ + ▼ + coordinator.Dispatch ───┘ +``` + +### 3.2 Concrete cap values + +Default cap: + +```go +const ( + // s3PutAdmissionMaxInflightBytes is the hard ceiling on S3 PUT body + // bytes accepted by this node but not yet committed to Raft. Picked + // so that, on a 3-node cluster with MaxInflight × MaxSizePerMsg = + // 4 GiB / peer per PR #593, all in-flight PUT bytes plus their Raft + // replication shadow stay under one quarter of GOMEMLIMIT (1800 MiB + // per PR #617) — leaving headroom for Lua, scan buffers, and Pebble + // memtables. + s3PutAdmissionMaxInflightBytes = 256 << 20 // 256 MiB + // dispatchAdmissionTimeout is how long a per-batch flush will wait + // for a slot before giving up. Sized comfortably above the Raft + // in-flight queue's drain time at 1 Gbps (1024 × 4 MiB / 125 MB/s + // ≈ 33 s) so a transient stall does not surface as 503; longer + // stalls indicate the cluster is genuinely overloaded and rejection + // is the right signal. + dispatchAdmissionTimeout = 30 * time.Second +) +``` + +The cap is intentionally **per-node** rather than cluster-wide: +admission is enforced on the node receiving the HTTP request, which is +also the node whose memory is at risk. Clients hitting a different +node simply get a different budget. + +Both values are exposed as env vars (`ELASTICKV_S3_PUT_ADMISSION_MAX_INFLIGHT_BYTES`, +`ELASTICKV_S3_DISPATCH_ADMISSION_TIMEOUT`) with the constants as +defaults, following the pattern PR #593 established for +`ELASTICKV_RAFT_MAX_INFLIGHT_MSGS` etc. + +### 3.3 Data structure + +```go +type putAdmission struct { + semaphore chan struct{} // capacity == max / s3ChunkSize, charges in 1 MiB units + inflight atomic.Int64 // metric mirror; semaphore is the source of truth +} + +func newPutAdmission(maxBytes int64) *putAdmission { … } + +// reserve charges (bytes) units against the semaphore. Returns a +// release closure that the caller MUST call exactly once. Returns +// ErrAdmissionExhausted if the deadline elapses before the budget +// is available. +func (a *putAdmission) reserve(ctx context.Context, bytes int64) (func(), error) { … } +``` + +`reserve` is non-trivial because we need partial-charge semantics: + +1. Pre-charge `Content-Length` total at request entry (admission A). + The release fires in a deferred handler on the request goroutine + regardless of success / failure. +2. Per-batch sub-lease (admission B) is *not* a separate budget — it's + a synchronisation primitive on the existing semaphore. The PUT + handler acquires `s3ChunkSize × s3ChunkBatchOps = 4 MiB` units + before reading the next 4 MiB window from the body and releases + them on `coordinator.Dispatch` ack. This way the same budget covers + both the "request is in flight" and "this batch is buffered for + Raft" phases. + +For aws-chunked transfers (`Content-Length == -1`), the request-entry +charge falls back to a conservative `s3MaxObjectSizeBytes` (5 GiB) +reservation. The downside is that one chunked PUT can monopolise the +budget; the upside is correctness without re-reading headers. +We will instrument a metric to find out empirically how large that +hit actually is before optimising. + +### 3.4 Failure mode + +- `503 Service Unavailable` with `Retry-After: 1` (small, jittered). + AWS S3 SDK clients (boto3, aws-sdk-go-v2) auto-retry this code with + exponential backoff out of the box. +- Body for the response is the standard S3 XML error envelope: + `SlowDownReduce your request rate`. + This is the AWS-defined code for admission rejection and matches + what real S3 returns. +- Mid-stream rejection (admission B) closes the connection with a + `connection: close` header so partial body reads do not corrupt the + client's pipeline. The PUT handler also calls + `cleanupManifestBlobs` for any partial blobs that already landed in + Pebble. + +### 3.5 Metrics + +``` +elastickv_s3_put_admission_inflight_bytes gauge +elastickv_s3_put_admission_rejections_total counter (label: stage="prereserve" | "perbatch") +elastickv_s3_put_admission_wait_seconds histogram (label: stage) +``` + +Grafana panel: inflight gauge with the cap as a horizontal line so +the operator sees how often the system saturates. Rejection rate +suggests bumping the cap or scaling out (more nodes spreads PUT load). + +## 4. Interaction with related subsystems + +- **PR #636 (entry size alignment).** The 4 MiB per-batch unit is + the natural admission grain. The two changes are independent: + alignment is necessary for the per-peer Raft bound to be correct; + admission is necessary for the *aggregate* bound to be hard. +- **PR #612 (memwatch graceful shutdown).** Continues to function as + the last-resort safety net. Admission control should fire at well + below the memwatch threshold (`s3PutAdmissionMaxInflightBytes` is + ~14% of `GOMEMLIMIT`) so memwatch sees a much lower steady-state + pressure and the graceful-shutdown path stays a rare event. +- **Workload isolation proposal.** That doc proposes per-class CPU + reservation for Raft. Admission control is the memory-axis sibling. + Both are needed — limiting CPU does not bound queue depth. +- **`coordinator.Dispatch` retries.** Today the S3 path has its own + retry loop (`s3TxnRetryMaxAttempts = 8`). Admission must release + its budget around retries, otherwise a long retry chain double- + counts. + +## 5. Implementation plan + +| Milestone | Scope | Risk | +|---|---|---| +| M1 | Add `putAdmission` type + per-node singleton + `Content-Length` admission. Wire `prepareStreamingPutBody` to acquire / release. Metric scaffolding only (gauge + counter). | Low. No mid-stream cancellation yet. | +| M2 | Per-batch admission B inside `flushBatch`. Add `dispatchAdmissionTimeout`. Mid-stream 503 with cleanup. | Medium. Cleanup path on partial failure. | +| M3 | Env-var tunables. Histogram metric. Grafana panel. | Low. | +| M4 | aws-chunked path: emit a warn log when the conservative 5 GiB reservation kicks in, and add a follower-up to charge the actually-decoded byte count incrementally. | Medium. Needs `awsChunkedReader` to expose a progress callback. | + +Acceptance criteria: + +- `go test ./adapter/ -short -run TestS3PutAdmission` covers reject / + admit / mid-stream-timeout paths. +- A loaded test that opens 32 concurrent PUTs of 100 MiB each must + hold leader memory below `s3PutAdmissionMaxInflightBytes + epsilon` + for the duration of the test. +- No regression in `Test_grpc_transaction` (which is currently the + longest leader-stress test). + +## 6. Risks + +- **Tail-latency for legitimate clients.** A long-running PUT that + loses a 4 MiB slot mid-stream returns 503 even though it is making + progress. Mitigated by `dispatchAdmissionTimeout = 30s`, well above + the steady-state Raft drain time. If we observe spurious 503s in + practice, drop the timeout into a config knob and tune. +- **Operator confusion.** "Why does S3 return 503 when CPU is at + 20%?" Mitigated by a sharp Grafana panel and a clear `Retry-After` + value so SDK behaviour is predictable. +- **aws-chunked overcharge.** A small chunked PUT charges 5 GiB + against the budget until M4 lands. We accept this temporarily + because aws-chunked traffic is rare in practice and the + conservative cap fails safe. + +## 7. Out of scope (future work) + +- Per-bucket admission classes (e.g. system buckets get their own + budget). Punted to the workload-isolation rollout. +- Coordinated admission across the multi-region read replica path + proposed in `docs/design/2026_04_18_proposed_raft_grpc_streaming_transport.md`. +- Token-bucket rate-shaping (e.g. bytes-per-second). The current + proposal only bounds *concurrent* bytes; rate-shaping is a separate + policy choice. diff --git a/docs/design/2026_04_25_proposed_s3_raft_blob_offload.md b/docs/design/2026_04_25_proposed_s3_raft_blob_offload.md new file mode 100644 index 000000000..e4aa37bfb --- /dev/null +++ b/docs/design/2026_04_25_proposed_s3_raft_blob_offload.md @@ -0,0 +1,332 @@ +# S3 raft blob offload — keep large object payloads out of the Raft log + +> **Status: Proposed** +> Author: bootjp +> Date: 2026-04-25 +> +> Companion to PR #636 (`s3ChunkBatchOps = 4`, Raft entry size aligned +> with `MaxSizePerMsg = 4 MiB` per PR #593) and to the S3 PUT +> admission-control proposal +> (`docs/design/2026_04_25_proposed_s3_admission_control.md`). +> +> PR #636 caps the *per-entry* size; admission control caps the +> *aggregate in-flight* memory; this doc removes large blob payloads +> from the *Raft log itself* so that snapshots and follower catch-up +> stay bounded as the data set grows. + +--- + +## 1. Problem + +Today every byte of an S3 object travels through the Raft log: + +``` +HTTP PUT body ─► s3ChunkSize (1 MiB) chunks + ─► s3ChunkBatchOps × 1 MiB Raft entry + ─► Raft log entry (Pebble WAL on every node) + ─► applyLoop → s3keys.BlobKey(...) → MVCCStore Put +``` + +`s3keys.BlobKey(bucket, generation, objectKey, uploadID, partNo, chunkNo)` +is the key actually written to Pebble; the log entry that proposed it +contains the full chunk value. Two consequences: + +1. **WAL & snapshot growth scales with object bytes.** A node that + serves 100 GiB of S3 PUT traffic ends up with a 100 GiB Pebble WAL + plus the same 100 GiB persisted in the engine's state machine. + Snapshot transfer for a falling-behind follower carries the + payload twice — once as Raft log replication (during catch-up + inside the WAL window) and once as a snapshot dump if the leader + has already truncated. +2. **Follower catch-up after a long absence is expensive.** Right now + a follower that misses 5 GiB of PUT traffic re-applies it as Raft + log entries, each going through the full `applyRequests` → + `MVCCStore.Put` path on a single thread. The ApplyLoop becomes a + bottleneck that holds the rest of the cluster waiting for the + `commitIndex` advance. + +S3 blob payloads are special: they are **idempotent, content- +addressable, large, and rarely re-read inside the leader's apply +window**. They look more like attachments than like Raft log records. +Treating them as Raft entries is overkill — Raft only needs to +linearise *what changed* (the manifest), not *the chunk bytes +themselves*. + +## 2. Goals & non-goals + +**Goals** + +- Raft log entries for S3 PUT carry a *reference* to the chunk + payload, not the payload itself. Replication traffic and WAL + size become O(manifest size), not O(object size). +- Followers fetch blob payloads out-of-band when they apply a + manifest reference. Apply order remains Raft-defined; only the + bytes are pulled lazily. +- Snapshot transfer for a falling-behind follower is O(manifest + count + small blob index), not O(stored bytes). +- The new path is opt-in and lives alongside the current direct-Raft + path until parity is proven. Existing S3 traffic is unaffected + during rollout. + +**Non-goals** + +- No external object store dependency (S3, MinIO, Ceph). The blob + offload uses Pebble itself plus a peer-to-peer fetch protocol; + introducing an external dependency would re-create the operational + surface elastickv exists to replace. +- No deduplication across objects. Each `(bucket, objectKey, + uploadID, partNo, chunkNo)` keeps its own key. Real S3 also does + not deduplicate at this layer. +- No change to MVCC semantics. Manifest commits remain the + serialisation point; blob fetch is a side channel that does not + change visibility rules. +- No removal of the existing `BlobKey` path in this doc. We will + ship in two stages (manifest-only writes through Raft, blob + payload via a side channel), and the legacy path stays available + until enough operational evidence accumulates. + +## 3. Design + +### 3.1 New keyspace + +``` +!s3|chunkref|||||| + → ChunkRef{ + ContentSHA256 [32]byte + Size uint64 + // Optional: leader-locality hints. Followers without the + // payload locally fetch from a peer that advertises the + // chunk in its catalog. + SourcePeer NodeID + } + +!s3|chunkblob| + → raw bytes (the chunk payload) +``` + +Two separate keyspaces: + +- `chunkref` is replicated through Raft. Cheap (32 B + small header + per chunk) and ordered against the manifest commit. +- `chunkblob` is **not** written through Raft. It is written + directly to Pebble on the receiving node and pulled by peers via + the new fetch protocol when they apply the corresponding + `chunkref`. + +### 3.2 PUT path + +``` +client ─► HTTP PUT body + ─► chunk loop (s3ChunkSize): + 1. compute SHA-256 of chunk + 2. write chunk to LOCAL Pebble at !s3|chunkblob| + 3. queue ChunkRef into pendingBatch + ─► flushBatch: + coordinator.Dispatch(OperationGroup{ + Elems: [ chunkref Puts ... + manifest Put ], + }) + ─► HTTP 200 OK once Dispatch acks +``` + +The order matters: the chunk *must* be persisted in local Pebble +before its `chunkref` is committed through Raft, otherwise a +follower that applies the `chunkref` and immediately re-fetches +might race the leader's local write. Pebble's WAL fsync gives the +ordering guarantee; the chunk write returns only after the data is +durable. + +The `chunkref` keys are < 100 B each. A 1 GiB PUT generates 1024 +of them = ~100 KiB of Raft log payload. Compared with today's +1 GiB through Raft, that is a **10⁴× reduction** in Raft log +write amplification. + +### 3.3 Follower apply path + +When a follower's apply loop sees a `chunkref` Put: + +1. Stage the `chunkref` key in MVCCStore as usual. +2. Schedule an async fetch of `!s3|chunkblob|` from + `SourcePeer` (or a quorum-style fanout to all known peers). +3. The fetch worker writes the chunk to local Pebble at + `!s3|chunkblob|` once the body arrives and verifies the + SHA-256 (mismatch → drop and retry from another peer). + +GET / range-read on the follower checks the local `chunkblob` +first; if absent (because the async fetch is still pending), it +either: + +- proxies the read to a peer that *does* have the chunk (using the + `SourcePeer` hint and falling back to a fanout), or +- replies 503 with `Retry-After`, identical to S3's behaviour + during a region failover. + +The choice is per-deployment; Phase-1 ships proxy-on-miss. + +### 3.4 Snapshot + +Today a follower snapshot dump includes every `BlobKey` Pebble has +ever stored. Under the new design: + +- The Raft snapshot serialises only `chunkref` keys plus the rest + of the MVCC state (manifests, bucket meta, ACLs). +- A separate **blob catalog snapshot** lists every locally-held + `chunkblob` SHA. This is included in the snapshot stream as a + manifest of "blobs you should fetch from me on demand." +- Once the follower has consumed the Raft snapshot and the blob + catalog, it begins serving GET / HEAD by proxying chunkblob + fetches to peers as in 3.3. + +A 100 GiB cluster's Raft snapshot drops from ~100 GiB to a few +megabytes (one `chunkref` per chunk, plus the manifest set). The +blob catalog adds 32 B × N_chunks = ~3 MiB per 100 GiB. Snapshot +stream time falls by orders of magnitude; the recovery cost +shifts from "leader's WAL dump" to "follower's lazy blob fetch +amortised across reads." + +### 3.5 Garbage collection + +A blob whose `chunkref` has been deleted (DELETE, lifecycle policy, +object version pruned, manifest aborted) is reclaimable. We handle +this two-staged: + +1. Reference counting via a `!s3|chunkref-rc|` counter + updated inside the same Raft txn that adds / removes a + `chunkref`. RC == 0 marks the blob as eligible. +2. A node-local sweeper periodically (e.g. every 5 minutes) + scans `!s3|chunkblob|*` and deletes blobs whose RC is 0 and + whose RC has been 0 for at least `chunkBlobGCGracePeriod` + (proposed default 1 hour). The grace window covers in-flight + reads and avoids deleting a blob a peer is just about to fetch. + +Sweeper deletion is local — it does not pass through Raft, because +the authoritative state ("is this blob unreachable?") is the +already-committed RC. Followers run independent sweepers and arrive +at the same conclusion. + +### 3.6 Fetch protocol + +A new gRPC method on the existing internal raft transport service: + +```protobuf +service S3BlobFetch { + // FetchChunkBlob returns the bytes of a chunkblob this peer holds + // locally. Caller must verify SHA-256. + rpc FetchChunkBlob(FetchChunkBlobRequest) returns (stream FetchChunkBlobResponse); +} + +message FetchChunkBlobRequest { + bytes content_sha256 = 1; +} + +message FetchChunkBlobResponse { + bytes payload = 1; + bool eof = 2; +} +``` + +Streamed because a chunkblob is up to `s3ChunkSize = 1 MiB`. The +existing gRPC `MaxRecvMsgSize = 64 MiB` (PR #593 → `internal.GRPCCallOptions`) +already covers this in a single RPC, but streaming keeps the +implementation symmetric with how the future Raft streaming +transport (proposed under +`docs/design/2026_04_18_proposed_raft_grpc_streaming_transport.md`) +handles large payloads. + +### 3.7 Backwards compatibility & rollout + +The legacy `BlobKey` path remains available. New PUTs use the +offload path when `ELASTICKV_S3_BLOB_OFFLOAD=true`; existing data +keeps reading through the legacy `BlobKey` path until a background +migrator (separate proposal) rewrites it. Mixed clusters work +because both keyspaces are namespaced. + +The opt-in flag stays for at least one full release cycle so we can +revert by flipping a single env var if any of the following surface: + +- a SHA-256 collision (~zero probability but a hard kill criterion), +- a follower fetch storm overwhelming peer-to-peer bandwidth, +- a GC bug that leaks reachable blobs. + +## 4. Interaction with related subsystems + +- **PR #636 + admission control.** The admission control budget + drops in importance under the offload path because Raft entries + are tiny (~100 B per chunkref). However the *body bytes still + flow through HTTP* and `prepareStreamingPutBody` continues to + hold them in memory until the local Pebble write returns. The + admission cap must stay; only the per-peer Raft-side worst-case + bound (`MaxInflight × MaxSizePerMsg = 4 GiB`) gets *much* easier + to honour. +- **PR #589 (snapshot tuning) and PR #614 (etcd-snapshot-disk-offload).** + Already implemented. The offload path makes those tunables more + effective by reducing the per-snapshot byte count. +- **`docs/design/2026_04_18_proposed_raft_grpc_streaming_transport.md`.** + The blob-fetch RPC reuses the same chunked-streaming approach + proposed for Raft transport. We can land both behind the same + abstraction. +- **Lease read & MVCC snapshot reads.** No change. Manifests remain + the linearisation point; chunk bytes are immutable once committed + (content-addressable), so a stale local copy on a follower is + still correct. + +## 5. Implementation plan + +| Milestone | Scope | Risk | +|---|---|---| +| M0 | Spike: prove the chunkref + chunkblob keyspaces under a feature flag with 1 % traffic. Measure local Pebble write amp & blob fetch latency. | Low (observability only). | +| M1 | PUT path emits chunkrefs through Raft; chunkblob writes go directly to local Pebble. GET path checks chunkblob locally with proxy-on-miss. | Medium (race ordering). | +| M2 | Follower fetch protocol + async fetch worker pool. SHA verification + retry from alternate peer on mismatch. | Medium (fanout cost on snapshot apply). | +| M3 | Reference-count + grace-period GC. | Medium (correctness of RC under concurrent ops). | +| M4 | Migrator: rewrite legacy `BlobKey` data in the background. Off by default until M0–M3 burn in for 30 days in production. | High (long-running batch over live traffic). | + +Acceptance criteria for M3 (the milestone that flips `ELASTICKV_S3_BLOB_OFFLOAD=true` by default): + +- WAL growth per GiB of S3 PUT < 1 MiB on a one-week soak test. +- Snapshot transfer for a 100 GiB-cluster follower restart completes + in < 60 s on a 1 Gbps interconnect. +- No regression in PUT p99 latency or GET p99 latency vs. the legacy + path (measured on the 24 h pre-cutover window). + +## 6. Risks + +- **Race between local chunkblob write and the chunkref commit.** + Mitigated by writing the chunkblob to a local Pebble batch with + fsync *before* the chunkref enters `coordinator.Dispatch`. The + manifest commit is the linearisation point; if the chunkblob is + durable on the leader, peers can fetch it as soon as they apply. +- **Follower fetch storm.** A new follower that catches up sees a + flood of `chunkref` Puts and could DDoS the source peer with + fetches. Mitigation: bounded fetch worker pool + token bucket + per-source. The Raft apply loop does *not* block on the fetch — + it stages `chunkref` and lets the fetch lag — so apply latency + stays bounded. +- **SHA-256 collision.** Operationally improbable; shipped with a + metric (`s3_chunkblob_sha_mismatch_total`) and a hard-fail option + for paranoid operators. +- **Local-blob-only on a single node.** If only one node has a + given chunkblob and that node fails before peers fetch it, data + is lost. Mitigation: PUT path replicates the chunkblob to *N* + peers asynchronously before returning 200 (e.g. quorum write + outside of Raft). N defaults to 2 (one extra copy = parity with + Raft's quorum durability). + +## 7. Out of scope (future work) + +- Cross-cluster blob replication (CRR / disaster recovery). +- Tiered storage (cold blobs to S3-IA / Glacier-equivalent). +- Erasure coding for blob payloads. +- Compression. The current S3 spec is "the bytes the client sent"; + any compression layer is a separate negotiated feature. + +## 8. Open questions + +- Do we need a per-follower bandwidth cap on blob fetch? If the + cluster network is constrained, a runaway catch-up could starve + user-path GET / Raft heartbeat traffic. Probably yes — defer to + the workload-isolation rollout. +- Is content-addressing at the chunk granularity (`chunkSize = 1 MiB`) + the right unit, or should we content-address whole objects and + range-fetch sub-chunks? The chunk granularity matches what + `prefetchObjectChunks` already does and keeps content addressing + predictable; whole-object addressing would require re-hashing on + partial reads. Tentatively: chunk granularity for v1. From bfd4a56f616f8a8abd2134a6e6d7237a19b78783 Mon Sep 17 00:00:00 2001 From: "Yoshiaki Ueda (bootjp)" Date: Sun, 26 Apr 2026 00:39:23 +0900 Subject: [PATCH 2/8] docs(design): clarify dedup non-goal for blob offload (Gemini medium) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Gemini medium on PR #637 — the original "No deduplication across objects" non-goal contradicts §3.1 (chunkblob is content-addressed by SHA-256) and §3.5 (reference-counted GC). Two distinct objects whose chunks happen to hash identically *will* share a chunkblob row; that is a structural property of content addressing. Rewrite the bullet to be precise about what we are *not* promising without contradicting what the design does: - Surface no dedup API / metering / billing surface. - Reference layer (chunkref) stays per-object so DELETE / lifecycle still reason about objects independently. - Authorisation enforcement remains on chunkref reads, never on chunkblob reads — see the new "Cross-tenant blob fetch" risk entry that pins the threat model. Add a matching risk entry under §6 covering the SHA-256-guessing attack the dedup non-goal alludes to: feasible only if a tenant can both guess a victim's chunk SHA *and* bypass auth. We block the second by keeping S3BlobFetch internal-only and gating GET through chunkref ACL before the proxy-on-miss path is invoked. A future design that exposes blob fetch on a public surface would need to reintroduce tenant-scoped authorisation at the blob layer; this proposal explicitly does not. No code changes — design doc only. --- ...026_04_25_proposed_s3_raft_blob_offload.md | 34 +++++++++++++++++-- 1 file changed, 31 insertions(+), 3 deletions(-) diff --git a/docs/design/2026_04_25_proposed_s3_raft_blob_offload.md b/docs/design/2026_04_25_proposed_s3_raft_blob_offload.md index e4aa37bfb..52106f3a1 100644 --- a/docs/design/2026_04_25_proposed_s3_raft_blob_offload.md +++ b/docs/design/2026_04_25_proposed_s3_raft_blob_offload.md @@ -74,9 +74,22 @@ themselves*. offload uses Pebble itself plus a peer-to-peer fetch protocol; introducing an external dependency would re-create the operational surface elastickv exists to replace. -- No deduplication across objects. Each `(bucket, objectKey, - uploadID, partNo, chunkNo)` keeps its own key. Real S3 also does - not deduplicate at this layer. +- No *user-facing* deduplication API or storage-accounting credit. + The `chunkblob` keyspace is content-addressed by SHA-256 (§3.1) + and reference-counted (§3.5), which means two distinct objects + whose chunks happen to hash identically will share one + `chunkblob` row at the storage layer — that is a structural + property of content addressing, not a feature we expose. We do + *not*: surface dedup ratios, charge storage by post-dedup bytes, + rebalance dedup credit across tenants, or treat dedup hits as + semantically observable from S3 verbs. The reference layer + (`chunkref`) keeps `(bucket, objectKey, uploadID, partNo, + chunkNo)` granular and per-object, so DELETE / lifecycle still + reason about objects independently. Authorisation enforcement + remains on `chunkref` reads, never on `chunkblob` reads + (§3.3 covers the proxy-on-miss path; ACL checks fire before the + blob fetch is initiated, so a tenant cannot dereference a peer's + `chunkblob` by guessing a SHA — see §6 *Cross-tenant blob fetch*). - No change to MVCC semantics. Manifest commits remain the serialisation point; blob fetch is a side channel that does not change visibility rules. @@ -309,6 +322,21 @@ Acceptance criteria for M3 (the milestone that flips `ELASTICKV_S3_BLOB_OFFLOAD= peers asynchronously before returning 200 (e.g. quorum write outside of Raft). N defaults to 2 (one extra copy = parity with Raft's quorum durability). +- **Cross-tenant blob fetch via SHA-256 guessing.** Because + `chunkblob` keys are SHA-256-addressed, *if* a malicious tenant + could (a) guess a victim tenant's chunk SHA and (b) bypass + authorisation, they could exfiltrate the chunk. SHA-256 guessing + is computationally infeasible for non-trivial content, but we + remove the second prerequisite by enforcing authorisation + *exclusively at the `chunkref` layer*. The `S3BlobFetch.FetchChunkBlob` + RPC is internal-only (raft-transport credentials, not exposed to + S3 clients); user-facing GET resolves through `chunkref` first, + which carries the bucket / key tenancy context, and only after + the ACL check does the server proxy to a peer for the + corresponding `chunkblob`. A future design that exposes + blob fetch on a public surface would need to reintroduce + tenant-scoped authorisation at the blob layer; this proposal + intentionally does not. ## 7. Out of scope (future work) From 4d0707b251cac10770b3446622868a730a93a3b4 Mon Sep 17 00:00:00 2001 From: "Yoshiaki Ueda (bootjp)" Date: Sun, 26 Apr 2026 00:44:34 +0900 Subject: [PATCH 3/8] docs(design): durability + rolling-upgrade + chunked admission MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Round-2 review on PR #637 — Gemini high (durability, rolling upgrade) + medium (aws-chunked head-of-line blocking). s3_raft_blob_offload.md (Gemini high): - §3.2: chunkblob writes now go through semi-synchronous quorum replication via a new PushChunkBlob RPC before the chunkref enters Raft. Without it, a leader crash between the chunkref commit and the eventual async fetch on followers would leave a Raft-committed manifest pointing at a chunkblob no surviving node has — Raft's quorum bound the chunkref but said nothing about the blob payload. chunkBlobMinReplicas defaults to a Raft-quorum-equivalent floor, restoring end-to-end durability parity with the legacy "every byte through Raft" path. - §3.6: add the PushChunkBlob RPC alongside the existing FetchChunkBlob, with a Push-side fsync ack and a streaming request shape symmetric to the fetch path. - §3.8 (new): explicit two-step rolling-upgrade strategy gated on cluster-wide capability advertisement (feature_s3_blob_offload in GetClusterOverview). PUTs only flip to the offload path when *every* peer advertises the capability; any disagreement causes the leader to route the PUT through the legacy BlobKey path. Roll-back works the same way in reverse with no data loss because both keyspaces are namespaced and reads serve whichever resolves first. - §6: replace the optimistic "Local-blob-only on a single node" bullet with the explicit "Leader-only durability before chunkref commit" risk + the §3.2 mitigation. s3_admission_control.md (Gemini medium): - §3.3.1 (new): aws-chunked transfers no longer pre-charge s3MaxObjectSizeBytes (5 GiB on default tunables, 20× over the 256 MiB cap and a guaranteed head-of-line block for every other PUT). Bootstrap reservation is now s3RaftEntryByteBudget (4 MiB) and the rest is paid-as-decoded via an awsChunkedReader progress callback that hooks into the same per-batch admission the fixed-length path uses. The legacy 5 GiB pre-charge stays available behind ELASTICKV_S3_PUT_ADMISSION_CHUNKED_INCREMENTAL=false for incident response, but ships disabled by default. - §5: chunked progress callback moves from M4 to M1 so the HoL hazard is closed in the first shippable milestone. Adds an explicit "Rolling upgrade" subsection covering mixed-cluster safety (additive cap, no protocol change) and the new observable client behaviour (mid-stream 503 SlowDown for chunked PUTs that beat Raft drain). No code changes; design docs only. --- ...026_04_25_proposed_s3_admission_control.md | 87 ++++++++-- ...026_04_25_proposed_s3_raft_blob_offload.md | 148 +++++++++++++++--- 2 files changed, 207 insertions(+), 28 deletions(-) diff --git a/docs/design/2026_04_25_proposed_s3_admission_control.md b/docs/design/2026_04_25_proposed_s3_admission_control.md index 6700be7f4..4f9be3587 100644 --- a/docs/design/2026_04_25_proposed_s3_admission_control.md +++ b/docs/design/2026_04_25_proposed_s3_admission_control.md @@ -182,12 +182,54 @@ func (a *putAdmission) reserve(ctx context.Context, bytes int64) (func(), error) both the "request is in flight" and "this batch is buffered for Raft" phases. -For aws-chunked transfers (`Content-Length == -1`), the request-entry -charge falls back to a conservative `s3MaxObjectSizeBytes` (5 GiB) -reservation. The downside is that one chunked PUT can monopolise the -budget; the upside is correctness without re-reading headers. -We will instrument a metric to find out empirically how large that -hit actually is before optimising. +### 3.3.1 aws-chunked transfers (`Content-Length: -1`) + +A naïve "reserve `s3MaxObjectSizeBytes` (5 GiB) up front" is rejected: +on default tunables (`s3PutAdmissionMaxInflightBytes = 256 MiB`) a +single chunked PUT would consume **20×** the entire budget at request +admission time, head-of-line-blocking every other PUT until the +chunked stream finishes — exactly the failure mode admission control +exists to prevent. We therefore split chunked admission across two +mechanisms instead of pre-charging: + +1. **Bootstrap reservation = `s3RaftEntryByteBudget` (4 MiB)** at + request entry. This is enough to admit the request and let the + awsChunkedReader produce its first decoded window. Chunked PUTs + are not "free" — they still must beat the same admission queue + as fixed-length PUTs at the per-batch level. +2. **Pay-as-you-decode** thereafter, charged via an + `awsChunkedReader` progress callback. Each decoded chunk frame + (typically up to 64 KiB on the wire after framing overhead) + acquires a slot equal to the bytes about to flow into Pebble; the + slot is released once the corresponding `coordinator.Dispatch` + acks. This is the same path admission B uses for fixed-length + PUTs — chunked traffic just hooks into it incrementally instead of + pre-charging the worst case. + +Failure modes: + +- If the awsChunkedReader produces frames faster than Raft drains, the + per-batch acquire blocks (capped by `dispatchAdmissionTimeout`). + Beyond that timeout, mid-stream 503 closes the connection. The + legacy "reserve 5 GiB" approach would have surfaced as 503 *at + request entry* for unrelated PUTs; this approach surfaces as + mid-stream 503 for the chunked PUT itself, which is the right + blame attribution. +- If the awsChunkedReader frame size ever exceeds + `s3RaftEntryByteBudget` (a malformed client), the per-batch acquire + asks for more than the cap allows and we 503 immediately — same + as a fixed-length PUT whose `Content-Length` exceeds the global + cap. + +This change moves chunked admission from M4 (originally "deferred +optimisation") into M1 (the first shippable milestone). M1 ships +with the progress-callback wired *unconditionally* for all chunked +PUTs; an env-var switch falls back to "bootstrap-only" charging +without the per-decode credit if a corner case requires it +(`ELASTICKV_S3_PUT_ADMISSION_CHUNKED_INCREMENTAL=false`, default +`true`). The fallback path keeps the 5 GiB-reservation hazard +behind an explicit operator decision rather than letting it +materialise by default. ### 3.4 Failure mode @@ -239,10 +281,37 @@ suggests bumping the cap or scaling out (more nodes spreads PUT load). | Milestone | Scope | Risk | |---|---|---| -| M1 | Add `putAdmission` type + per-node singleton + `Content-Length` admission. Wire `prepareStreamingPutBody` to acquire / release. Metric scaffolding only (gauge + counter). | Low. No mid-stream cancellation yet. | -| M2 | Per-batch admission B inside `flushBatch`. Add `dispatchAdmissionTimeout`. Mid-stream 503 with cleanup. | Medium. Cleanup path on partial failure. | +| M1 | Add `putAdmission` type + per-node singleton + fixed-length `Content-Length` admission. Wire `prepareStreamingPutBody` to acquire / release. **aws-chunked progress-callback admission** (§3.3.1) ships in this milestone too — the conservative 5 GiB pre-charge fallback only sits behind `ELASTICKV_S3_PUT_ADMISSION_CHUNKED_INCREMENTAL=false`. Metric scaffolding (gauge + counter). | Medium. Chunked progress callback needs `awsChunkedReader` to expose a hook. | +| M2 | Per-batch admission B inside `flushBatch` for fixed-length PUTs. Add `dispatchAdmissionTimeout`. Mid-stream 503 with cleanup. (Chunked PUTs already use this path through their incremental charging.) | Medium. Cleanup path on partial failure. | | M3 | Env-var tunables. Histogram metric. Grafana panel. | Low. | -| M4 | aws-chunked path: emit a warn log when the conservative 5 GiB reservation kicks in, and add a follower-up to charge the actually-decoded byte count incrementally. | Medium. Needs `awsChunkedReader` to expose a progress callback. | +| M4 | Per-tenant / per-bucket admission classes (handed off to the workload-isolation rollout). | Medium. Out-of-scope for the v1 cap. | + +### Rolling upgrade + +Admission is purely additive on the request entry path: a node +without the cap behaves identically to a node with the cap set +infinitely high. A mixed cluster (some nodes M1, some still on +`main`) is therefore safe — clients hitting the upgraded node see +admission, clients hitting an old node see no admission, but +neither path corrupts state. The default cap is intentionally +generous enough that even single-node M1 traffic falls below the +threshold under typical load, so the rollout signature is +"503 SlowDown rate goes from 0 to negligible" rather than a step +function. Operators can pin +`ELASTICKV_S3_PUT_ADMISSION_MAX_INFLIGHT_BYTES=$((1<<63))` to +disable the cap on M1 nodes during the burn-in window if desired. + +The aws-chunked progress-callback path is the only behaviour +change visible to clients: a chunked PUT that would have succeeded +under the old "no admission" code can now 503 mid-stream when Raft +drain falls behind. This is by design — that is the failure mode +admission control exists to surface — but operators should expect +to see chunked-upload 503s where there were none before. The +`stage="perbatch", protocol="chunked"` rejection-counter label +isolates this signal; bumping the cap or +`ELASTICKV_S3_PUT_ADMISSION_CHUNKED_INCREMENTAL=false` (with the +HoL hazard re-introduced as a known trade-off) restores legacy +behaviour during incident response. Acceptance criteria: diff --git a/docs/design/2026_04_25_proposed_s3_raft_blob_offload.md b/docs/design/2026_04_25_proposed_s3_raft_blob_offload.md index 52106f3a1..e8f0096af 100644 --- a/docs/design/2026_04_25_proposed_s3_raft_blob_offload.md +++ b/docs/design/2026_04_25_proposed_s3_raft_blob_offload.md @@ -133,7 +133,8 @@ client ─► HTTP PUT body ─► chunk loop (s3ChunkSize): 1. compute SHA-256 of chunk 2. write chunk to LOCAL Pebble at !s3|chunkblob| - 3. queue ChunkRef into pendingBatch + 3. *** synchronously replicate to ≥ chunkBlobMinReplicas peers *** + 4. queue ChunkRef into pendingBatch ─► flushBatch: coordinator.Dispatch(OperationGroup{ Elems: [ chunkref Puts ... + manifest Put ], @@ -141,17 +142,47 @@ client ─► HTTP PUT body ─► HTTP 200 OK once Dispatch acks ``` -The order matters: the chunk *must* be persisted in local Pebble -before its `chunkref` is committed through Raft, otherwise a -follower that applies the `chunkref` and immediately re-fetches -might race the leader's local write. Pebble's WAL fsync gives the -ordering guarantee; the chunk write returns only after the data is -durable. +Step 3 — synchronous chunkblob replication before the chunkref +commit — is the difference between "Raft-equivalent durability" +and "leader-only durability." Without it, a leader crash between +the chunkref commit and the eventual async fetch would leave a +committed manifest pointing at a chunkblob nobody else has — Raft's +quorum guarantees the chunkref but tells you nothing about the +blob payload. We close that gap by treating the chunkblob like a +mini-Raft entry of its own with **semi-synchronous quorum**: + +1. Leader writes the chunkblob to local Pebble (fsync). +2. Leader pushes the chunkblob to `chunkBlobMinReplicas - 1` + followers via the `S3BlobFetch.PushChunkBlob` RPC and waits for + each follower's "fsync ack." (`PushChunkBlob` is the leader- + initiated counterpart to the follower-initiated `FetchChunkBlob` + defined in §3.6.) +3. Only after the chunkblob is durable on a quorum of nodes does + the leader propose the chunkref through Raft. + +`chunkBlobMinReplicas` defaults to **2** on a 3-node cluster (= a +quorum of 2 includes the leader and one follower). For larger +clusters the floor is `(N/2)+1` to match Raft's quorum size; the +operator can opt into N for stronger-than-Raft durability. A +follower that crashes after acking the push but before the chunkref +commits is fine — the chunkref will be retried by the leader on the +next attempt because it has not yet entered the Raft log. + +The trade-off is PUT latency: a PUT now blocks on +`chunkBlobMinReplicas - 1` follower fsyncs in addition to the Raft +quorum write of the chunkref. Empirically the chunkblob fsync is +the dominant cost (1 MiB write, ~5–10 ms on consumer SSD), so PUT +p99 is roughly equivalent to today's "every byte through Raft" +latency — we are paying the same fsync cost, just to a different +keyspace. The `chunkref` keys are < 100 B each. A 1 GiB PUT generates 1024 of them = ~100 KiB of Raft log payload. Compared with today's 1 GiB through Raft, that is a **10⁴× reduction** in Raft log -write amplification. +write amplification — even with semi-synchronous chunkblob +replication, the Raft log itself is unaffected by chunk size, so +log replay time, snapshot transfer, and follower catch-up still +collapse to O(manifest count). ### 3.3 Follower apply path @@ -218,13 +249,24 @@ at the same conclusion. ### 3.6 Fetch protocol -A new gRPC method on the existing internal raft transport service: +Two RPCs on the existing internal raft transport service — one +follower-initiated (lazy fetch on miss), one leader-initiated +(synchronous replication before chunkref commit, see §3.2 step 3): ```protobuf service S3BlobFetch { // FetchChunkBlob returns the bytes of a chunkblob this peer holds - // locally. Caller must verify SHA-256. + // locally. Caller must verify SHA-256. Used by followers on the + // proxy-on-miss GET path (§3.3) and during snapshot catch-up. rpc FetchChunkBlob(FetchChunkBlobRequest) returns (stream FetchChunkBlobResponse); + + // PushChunkBlob streams a chunkblob from the leader to a follower + // and acks once the bytes are durable in the receiver's Pebble. + // Used by §3.2 step 3 to make chunkblob writes survive a leader + // crash without depending on the async fetch path catching up. + // The receiver SHOULD verify SHA-256 against the request header + // before fsync; mismatch fails the RPC and the leader retries. + rpc PushChunkBlob(stream PushChunkBlobRequest) returns (PushChunkBlobResponse); } message FetchChunkBlobRequest { @@ -235,6 +277,16 @@ message FetchChunkBlobResponse { bytes payload = 1; bool eof = 2; } + +message PushChunkBlobRequest { + bytes content_sha256 = 1; // sent in the first frame only + bytes payload = 2; + bool eof = 3; +} + +message PushChunkBlobResponse { + bool durable = 1; // true == fsynced +} ``` Streamed because a chunkblob is up to `s3ChunkSize = 1 MiB`. The @@ -250,15 +302,68 @@ handles large payloads. The legacy `BlobKey` path remains available. New PUTs use the offload path when `ELASTICKV_S3_BLOB_OFFLOAD=true`; existing data keeps reading through the legacy `BlobKey` path until a background -migrator (separate proposal) rewrites it. Mixed clusters work -because both keyspaces are namespaced. +migrator (separate proposal) rewrites it. Mixed keyspace coexistence +works because `!s3|chunkblob|*` and the legacy +`!s3|blob|||...` namespaces are disjoint. The opt-in flag stays for at least one full release cycle so we can revert by flipping a single env var if any of the following surface: - a SHA-256 collision (~zero probability but a hard kill criterion), - a follower fetch storm overwhelming peer-to-peer bandwidth, -- a GC bug that leaks reachable blobs. +- a GC bug that leaks reachable blobs, +- semi-synchronous `PushChunkBlob` latency exceeding the legacy + PUT p99 by an unacceptable margin (the soak-test acceptance + criterion in §5). + +### 3.8 Mixed-version cluster behaviour + +Until *every* node in the cluster speaks the offload protocol, PUTs +on the offload path cannot proceed safely: a node that does not +implement `PushChunkBlob` cannot ack a quorum write, and a follower +that does not implement `FetchChunkBlob` cannot resolve a chunkref +on apply. We therefore gate the offload path on cluster-level +feature negotiation rather than a single env var: + +- A node advertises offload capability by setting + `feature_s3_blob_offload=true` in the `AdminServer.GetClusterOverview` + response (alongside the existing role / version metadata). +- The leader inspects every peer's advertised capabilities at PUT + admission time. If any peer is missing the capability, the PUT + falls back to the legacy `BlobKey` path for that request — even + if the leader has the env var enabled. +- During an upgrade window the leader continues to emit legacy + writes; once the last peer rolls and re-advertises, subsequent + PUTs flip to offload automatically. A roll-back works the same + way in reverse: the first downgraded peer drops its capability + flag and the leader resumes legacy emission within the next + capability-refresh interval (default 30 s). +- Reads always succeed regardless of mixed state, because both + keyspaces are namespaced and the GET path checks legacy then + offload (or vice-versa) and serves whichever resolves. + +This gives operators a **strict two-step rolling upgrade** with no +PUT data path that depends on a half-upgraded cluster: + +1. Roll out the new binary with `ELASTICKV_S3_BLOB_OFFLOAD=false` + on every node. PUTs continue on the legacy path. Validate + stability for a soak window (24 h on the canary cluster in + §5's M0 acceptance criteria). +2. Flip `ELASTICKV_S3_BLOB_OFFLOAD=true` on the leader, then on + followers. Once every node advertises capability, PUTs switch + to the offload path. + +Roll-back: flip the env var to `false` on any node; the leader's +capability check sees the disagreement and falls back to legacy +within ≤ refresh interval. The migrator (M4) is independently +gated and never runs during a roll-back window. + +A node with `ELASTICKV_S3_BLOB_OFFLOAD=true` running against a +cluster where offload is disabled (e.g. a stuck rollout) is safe +— it advertises capability but the leader's per-PUT capability +check sees other peers missing it and routes through legacy. No +data is written into the offload keyspace until a quorum of +capability-advertising peers exists. ## 4. Interaction with related subsystems @@ -316,12 +421,17 @@ Acceptance criteria for M3 (the milestone that flips `ELASTICKV_S3_BLOB_OFFLOAD= - **SHA-256 collision.** Operationally improbable; shipped with a metric (`s3_chunkblob_sha_mismatch_total`) and a hard-fail option for paranoid operators. -- **Local-blob-only on a single node.** If only one node has a - given chunkblob and that node fails before peers fetch it, data - is lost. Mitigation: PUT path replicates the chunkblob to *N* - peers asynchronously before returning 200 (e.g. quorum write - outside of Raft). N defaults to 2 (one extra copy = parity with - Raft's quorum durability). +- **Leader-only durability before chunkref commit.** Without + intervention, a leader crash between writing the chunkblob to its + own Pebble and the eventual async fetch on followers would leave + a Raft-committed chunkref pointing at a chunkblob no surviving + node has. Mitigation: §3.2 step 3 — synchronous semi-quorum + replication via `PushChunkBlob` before the chunkref enters Raft. + `chunkBlobMinReplicas` defaults to a Raft-quorum-equivalent + floor; operators who want N-way durability bump it explicitly. + This restores end-to-end durability parity with the legacy + "every byte through Raft" path at the cost of one extra fsync + per chunkblob on the followers in the quorum. - **Cross-tenant blob fetch via SHA-256 guessing.** Because `chunkblob` keys are SHA-256-addressed, *if* a malicious tenant could (a) guess a victim tenant's chunk SHA and (b) bypass From 7e594099d80b5a745a1331f7353e31e87f8fc0b4 Mon Sep 17 00:00:00 2001 From: "Yoshiaki Ueda (bootjp)" Date: Sun, 26 Apr 2026 04:15:24 +0900 Subject: [PATCH 4/8] docs(design): address Claude bot review (2 HIGH, 4 MEDIUM, 4 LOW) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit s3_admission_control.md — HIGH 1: stale §6 entry "aws-chunked overcharge persists until M4" contradicted §3.3.1 / §5 milestone table which now ship pay-as-you-decode in M1. Replace with the actual residual M1 risk: a chunked PUT that previously succeeded silently can now 503 mid-stream when Raft drain falls behind. s3_admission_control.md — MEDIUM 1: §3.3 A+B semaphore semantics were ambiguous. Pin the contract: A is a check-only headroom gate (`peekHeadroom`, no acquire); B is the actual acquire/release on the semaphore. The bytes-in-flight bound is enforced by B alone; A is a fast-fail at request entry. Includes the explicit "if bytes > capacity * unitSize, fail immediately without waiting" rule for malformed oversized frames (LOW 2 from the same review). s3_admission_control.md — MEDIUM 2: §4 retry-budget release was underspecified. Fix the contract to hold-through-retry: per-batch slot is released exactly once on the *final* outcome of the retry chain, never between attempts. The bytes are still in pendingBatch across retries, so the budget must reflect them; release-between- retries would let a second PUT proceed while the first is still memory-resident. s3_admission_control.md — LOW 1: dispatchAdmissionTimeout comment referenced the wrong drain baseline (33 s = full MaxInflight queue, not the 256 MiB cap which drains in ~2 s). Rewrite to make clear 30 s is the budget for a transiently stalled follower to recover, not normal drain time. s3_raft_blob_offload.md — HIGH 2: §3.5 GC grace window was unimplementable as written — a plain RC counter at zero carries no "became eligible at T" signal so the sweeper cannot enforce the 1-hour grace. Switch to a queue-based scheme: `!s3|chunkblob-gc-queue||` written in the same Raft txn that drives RC to zero. The sweeper scans the queue range older than the grace window, re-checks RC at read time, and deletes via Raft so concurrent sweepers serialise on write-write-conflict. s3_raft_blob_offload.md — MEDIUM 3: M1 milestone shipped proxy-on-miss but the FetchChunkBlob RPC was scheduled in M2, so M1 GET could only serve local-hit or 503. Move both FetchChunkBlob and PushChunkBlob into M1 (PushChunkBlob is already needed by §3.2 quorum write); M2 narrows to the async catch-up worker pool that does not exist on the request goroutine. s3_raft_blob_offload.md — MEDIUM 4: §3.3 had no explicit GET-vs-GC race section. Add §3.3.1 specifying that FetchChunkBlob NOT_FOUND on a peer mid-GC is a normal fallback trigger (cycle through remaining peers in randomised order) and that a fanout-wide NOT_FOUND on an SHA whose chunkref is still present is the genuine durability failure that surfaces 500 + metric. s3_raft_blob_offload.md — LOW 3: SourcePeer was implicitly a hint but never said so. Document explicitly that SourcePeer is best- effort, callers MUST handle NOT_FOUND from FetchChunkBlob on that peer as a normal fallback to fanout, and that treating it as a hard error would pin clients on a momentarily-bad source. s3_raft_blob_offload.md — LOW 4: chunkBlobMinReplicas behaviour during cluster shrink / partial outage was undefined. Specify the degradation ladder: ≥ minReplicas → normal; < minReplicas but Raft quorum intact → degrade to "as many as available, never < 2" with a metric; < 2 reachable peers → 503 (leader-only durability is the regression this design exists to prevent). No code changes; design docs only. --- ...026_04_25_proposed_s3_admission_control.md | 118 +++++++++++---- ...026_04_25_proposed_s3_raft_blob_offload.md | 140 +++++++++++++++--- 2 files changed, 209 insertions(+), 49 deletions(-) diff --git a/docs/design/2026_04_25_proposed_s3_admission_control.md b/docs/design/2026_04_25_proposed_s3_admission_control.md index 4f9be3587..a98492706 100644 --- a/docs/design/2026_04_25_proposed_s3_admission_control.md +++ b/docs/design/2026_04_25_proposed_s3_admission_control.md @@ -133,11 +133,15 @@ const ( // memtables. s3PutAdmissionMaxInflightBytes = 256 << 20 // 256 MiB // dispatchAdmissionTimeout is how long a per-batch flush will wait - // for a slot before giving up. Sized comfortably above the Raft - // in-flight queue's drain time at 1 Gbps (1024 × 4 MiB / 125 MB/s - // ≈ 33 s) so a transient stall does not surface as 503; longer - // stalls indicate the cluster is genuinely overloaded and rejection - // is the right signal. + // for a slot before giving up. The 256 MiB cap drains in ~2 s at + // 1 Gbps under steady-state Raft throughput (256 MiB / 125 MB/s), + // so 30 s is *not* sized against normal drain — it is the budget + // for a transiently stalled follower (GC pause, slow disk fsync, + // bounded leader re-election) to recover before we conclude the + // cluster is genuinely overloaded. Longer stalls surface as 503, + // which is the right signal: at that point the right action is + // operator intervention (scale out, investigate the stall), not + // continued accumulation. dispatchAdmissionTimeout = 30 * time.Second ) ``` @@ -162,25 +166,54 @@ type putAdmission struct { func newPutAdmission(maxBytes int64) *putAdmission { … } -// reserve charges (bytes) units against the semaphore. Returns a -// release closure that the caller MUST call exactly once. Returns -// ErrAdmissionExhausted if the deadline elapses before the budget -// is available. -func (a *putAdmission) reserve(ctx context.Context, bytes int64) (func(), error) { … } +// peekHeadroom is admission A. It returns ErrAdmissionExhausted +// without acquiring slots when the requested byte count exceeds +// the *current* free capacity of the semaphore. It does NOT take +// out a reservation — the only effect is "fail fast at request +// entry instead of partway through the body" — so it cannot +// double-count against admission B. +func (a *putAdmission) peekHeadroom(bytes int64) error { … } + +// acquire is admission B. It blocks until (bytes / s3ChunkSize) +// slots are available or ctx fires. The returned release closure +// MUST be called exactly once. If `bytes > capacity * s3ChunkSize` +// (a malformed client whose frame exceeds the entire budget), +// returns ErrAdmissionExhausted *immediately* without waiting — +// otherwise we would block until ctx (typically +// dispatchAdmissionTimeout) for a request that can never fit. +func (a *putAdmission) acquire(ctx context.Context, bytes int64) (func(), error) { … } ``` -`reserve` is non-trivial because we need partial-charge semantics: - -1. Pre-charge `Content-Length` total at request entry (admission A). - The release fires in a deferred handler on the request goroutine - regardless of success / failure. -2. Per-batch sub-lease (admission B) is *not* a separate budget — it's - a synchronisation primitive on the existing semaphore. The PUT - handler acquires `s3ChunkSize × s3ChunkBatchOps = 4 MiB` units - before reading the next 4 MiB window from the body and releases - them on `coordinator.Dispatch` ack. This way the same budget covers - both the "request is in flight" and "this batch is buffered for - Raft" phases. +The two-step contract avoids the double-charge / unbounded-window +hazard the obvious "A is also an acquire" design would have: + +1. **Admission A — request-entry headroom check (peek only).** When + a PUT arrives with `Content-Length: N`, the handler calls + `peekHeadroom(N)`. If the result is `ErrAdmissionExhausted`, + reply 503 immediately and never read from the body. If `nil`, + admission A is done — no slots have been taken out of the + semaphore. This is intentionally racy with concurrent PUTs: + admission A only promises "at the moment we asked, the budget + *would have fit* this request"; it does not reserve the budget. +2. **Admission B — per-batch acquire/release (the only path that + touches the semaphore).** The PUT handler then loops the body + in `s3ChunkSize × s3ChunkBatchOps = 4 MiB` windows; each window + acquires 4 MiB worth of slots via `acquire`, reads the next + window from the body, dispatches, and releases on Dispatch + ack. This is the bound that actually holds — at any instant the + sum of held slots across all in-flight PUTs cannot exceed the + semaphore's capacity. If admission A's racy estimate turns out + to be wrong (another PUT raced in between A's check and the + first B-acquire), the first B-acquire blocks until the + contending PUT releases or `dispatchAdmissionTimeout` fires. + +The semaphore is therefore charged **only by admission B**. Bytes +in flight = `held_B_slots × s3ChunkSize`, full stop; admission A is +a fast-fail gate, not a reservation. This is the model +implementations MUST follow — a "both A and B charge" design would +double-count every PUT against itself and an admission-A-only +design would lose its bound the moment a PUT's body exceeded its +declared `Content-Length` (chunked transfers, malformed clients). ### 3.3.1 aws-chunked transfers (`Content-Length: -1`) @@ -272,10 +305,24 @@ suggests bumping the cap or scaling out (more nodes spreads PUT load). - **Workload isolation proposal.** That doc proposes per-class CPU reservation for Raft. Admission control is the memory-axis sibling. Both are needed — limiting CPU does not bound queue depth. -- **`coordinator.Dispatch` retries.** Today the S3 path has its own - retry loop (`s3TxnRetryMaxAttempts = 8`). Admission must release - its budget around retries, otherwise a long retry chain double- - counts. +- **`coordinator.Dispatch` retries.** Today the S3 path has its + own retry loop (`s3TxnRetryMaxAttempts = 8` with exponential + backoff capped at `s3TxnRetryMaxBackoff = 32 ms`). The admission + contract is **hold-through-retry**: the per-batch slot acquired + in admission B is released exactly once, on the *final* outcome + of the retry chain (success ack, terminal error, or + `dispatchAdmissionTimeout` expiring), not between attempts. + Rationale: the bytes are still buffered in the PUT handler's + pendingBatch slice for the entire retry window, so the budget + must reflect them; a release-between-retries scheme would let a + second PUT proceed while the first is still memory-resident, + breaking the bound. The total wall-clock cost of holding through + one full retry chain is bounded by + `s3TxnRetryMaxAttempts × (redisDispatchTimeout + s3TxnRetryMaxBackoff)`; + if that ever exceeds `dispatchAdmissionTimeout` the per-batch + acquire on the *next* batch surfaces as 503, which is the right + failure mode (chronic dispatch failure → caller learns instead of + silently consuming the budget). ## 5. Implementation plan @@ -333,10 +380,21 @@ Acceptance criteria: - **Operator confusion.** "Why does S3 return 503 when CPU is at 20%?" Mitigated by a sharp Grafana panel and a clear `Retry-After` value so SDK behaviour is predictable. -- **aws-chunked overcharge.** A small chunked PUT charges 5 GiB - against the budget until M4 lands. We accept this temporarily - because aws-chunked traffic is rare in practice and the - conservative cap fails safe. +- **New chunked-PUT 503 surface.** Pay-as-you-decode admission + (§3.3.1) ships in M1 alongside fixed-length admission, so the + legacy 5 GiB pre-charge hazard does not materialise as a + steady-state risk. The residual risk it introduces is the + inverse: a chunked PUT that would have silently succeeded under + the no-admission code can now 503 mid-stream when Raft drain + falls behind. This is by design — that is the failure mode + admission control exists to surface — but it is the only + client-visible behaviour change in M1 and is what operators + should expect to see in dashboards. The + `stage="perbatch", protocol="chunked"` label on the rejection + counter (§3.5) isolates the signal; the operator escape hatch + is `ELASTICKV_S3_PUT_ADMISSION_CHUNKED_INCREMENTAL=false`, which + reverts to bootstrap-only charging at the cost of + re-introducing the 5 GiB head-of-line hazard. ## 7. Out of scope (future work) diff --git a/docs/design/2026_04_25_proposed_s3_raft_blob_offload.md b/docs/design/2026_04_25_proposed_s3_raft_blob_offload.md index e8f0096af..ed9e1eb6e 100644 --- a/docs/design/2026_04_25_proposed_s3_raft_blob_offload.md +++ b/docs/design/2026_04_25_proposed_s3_raft_blob_offload.md @@ -168,6 +168,35 @@ follower that crashes after acking the push but before the chunkref commits is fine — the chunkref will be retried by the leader on the next attempt because it has not yet entered the Raft log. +**Behaviour during cluster shrink / partial outage.** A controlled +decommission (5 → 3 nodes) or a transient partition can leave the +leader with fewer reachable peers than `(N/2)+1` configured at the +last membership commit. Blocking PUTs until the configured minimum +is reachable would surface as an indefinite hang, which is worse +than the legacy "every byte through Raft" path (which fails when +Raft itself loses quorum, but otherwise succeeds). The contract is +therefore degraded availability with a hard floor: + +- If reachable peers ≥ `chunkBlobMinReplicas`, normal path: ack + after the configured minimum. +- If reachable peers < `chunkBlobMinReplicas` but ≥ `floor(N/2)+1` + (Raft quorum is intact), degrade to "as many as currently + available, but never fewer than 2 — i.e. leader + at least one + follower." Emit `s3_chunkblob_replication_degraded_total` so + operators see the degradation. This matches Raft's own behaviour + during the same window: Raft would still commit at quorum, just + with one fewer redundant ack. +- If reachable peers < 2 (leader-only durability), fail the PUT + with 503 — single-node durability is what `BlobKey`-on-Raft + already loses on leader crash, and is the regression this design + exists to prevent. + +The floor of 2 is the strict invariant: leader-only writes are the +case the design refuses to accept regardless of operator +configuration. Tuning `chunkBlobMinReplicas` higher trades PUT +availability for stronger durability; tuning lower than 2 is +rejected at config-load. + The trade-off is PUT latency: a PUT now blocks on `chunkBlobMinReplicas - 1` follower fsyncs in addition to the Raft quorum write of the chunkref. Empirically the chunkblob fsync is @@ -195,17 +224,54 @@ When a follower's apply loop sees a `chunkref` Put: `!s3|chunkblob|` once the body arrives and verifies the SHA-256 (mismatch → drop and retry from another peer). +`SourcePeer` is a **best-effort hint** captured at write time +(§3.1). It is *not* authoritative — the recorded peer may have +crashed, restarted, evicted the blob via §3.5 GC, or simply lost +the local copy to disk failure. Callers MUST treat +`FetchChunkBlob → NOT_FOUND` from `SourcePeer` as a normal fallback +trigger, not an error: drop to fanout against the rest of the +peer set and accept the first peer that returns the bytes (with +SHA-256 verification at the receiving end). Treating +`NOT_FOUND` as a hard failure would make a single peer's GC tick +pin clients on a bad source. + GET / range-read on the follower checks the local `chunkblob` first; if absent (because the async fetch is still pending), it either: -- proxies the read to a peer that *does* have the chunk (using the - `SourcePeer` hint and falling back to a fanout), or +- proxies the read to a peer that holds the chunk (using the + `SourcePeer` hint, falling back to fanout on `NOT_FOUND`), or - replies 503 with `Retry-After`, identical to S3's behaviour during a region failover. The choice is per-deployment; Phase-1 ships proxy-on-miss. +#### 3.3.1 GET vs. GC delete race + +Even with the §3.5 grace window, a GET that proxies to peer X for +a chunkblob can lose a race against peer X's sweeper if the +sweeper deletes the local copy *between* the GET arriving on the +caller's node and the proxy RPC reaching peer X. The blob remains +reachable globally (other peers still hold it; the chunkref is +unchanged), so the right behaviour is for the caller to fall back +to fanout. We therefore mandate: + +- `FetchChunkBlob` on a peer whose local sweeper just removed the + blob returns `NOT_FOUND`, **not** an internal error. +- The caller's GET handler treats both `NOT_FOUND` and + `INVALID_ARGUMENT (sha mismatch)` as fallback triggers and + cycles through the remaining peers in randomised order. +- If the entire peer set returns `NOT_FOUND` for an SHA whose + `chunkref` is still present, that is a genuine durability + failure (every replica including the leader's GC raced); the + GET surfaces 500 *and* the read path emits a + `s3_chunkblob_unrecoverable_total` metric so operators detect + the underlying GC bug. With the §3.2 quorum-write durability + and the §3.5 grace window, this case requires a coincident + failure across a quorum of nodes within a 1-hour window — vastly + more unlikely than the per-peer 404 the fanout absorbs as a + matter of course. + ### 3.4 Snapshot Today a follower snapshot dump includes every `BlobKey` Pebble has @@ -230,22 +296,58 @@ amortised across reads." ### 3.5 Garbage collection A blob whose `chunkref` has been deleted (DELETE, lifecycle policy, -object version pruned, manifest aborted) is reclaimable. We handle -this two-staged: +object version pruned, manifest aborted) is reclaimable. The +sweeper needs to know not just *that* the RC reached zero but +*when*, otherwise the documented grace window is unimplementable +(a plain counter at zero carries no time signal). We make the +"became eligible at T" fact a first-class Raft entry: -1. Reference counting via a `!s3|chunkref-rc|` counter +1. **Reference counting** via `!s3|chunkref-rc|`, a counter updated inside the same Raft txn that adds / removes a - `chunkref`. RC == 0 marks the blob as eligible. -2. A node-local sweeper periodically (e.g. every 5 minutes) - scans `!s3|chunkblob|*` and deletes blobs whose RC is 0 and - whose RC has been 0 for at least `chunkBlobGCGracePeriod` - (proposed default 1 hour). The grace window covers in-flight - reads and avoids deleting a blob a peer is just about to fetch. - -Sweeper deletion is local — it does not pass through Raft, because -the authoritative state ("is this blob unreachable?") is the -already-committed RC. Followers run independent sweepers and arrive -at the same conclusion. + `chunkref`. The atomic `(chunkref change, RC update)` pair is + the linearisation point for "this blob is now / no longer + reachable." +2. **GC eligibility queue.** When the txn that decrements an RC + would drive it to zero, the *same* txn additionally writes + `!s3|chunkblob-gc-queue||` → empty. The + commitTS-prefixed key is the time signal: the queue is + naturally sorted by elegibility-start time, and any node can + determine the grace-period boundary by scanning the queue with + `endKey = !s3|chunkblob-gc-queue||`. If a + subsequent txn re-references the same SHA before the sweeper + runs (e.g. an upload reuses a content hash), that txn deletes + the queue entry as part of incrementing the RC; the queue + therefore reflects "currently RC==0" rather than "ever was zero." +3. **Node-local sweeper.** Each node runs an independent sweeper + every `chunkBlobGCInterval` (proposed default 5 minutes) that: + a. scans the queue range `[!s3|chunkblob-gc-queue|, !s3|chunkblob-gc-queue||)` + for entries whose grace window has elapsed, + b. for each `` returned, re-checks the RC counter at the + sweeper's read timestamp; if the RC is still 0, deletes the + local `!s3|chunkblob|` AND deletes the queue entry — + both as a single Raft txn so the queue stays consistent with + the keyspace, + c. if the RC has bounced above 0 in the meantime, the queue + entry is stale (a re-reference txn forgot to remove it, or + the sweeper raced) and the sweeper deletes only the queue + entry, leaving the chunkblob in place. + +The queue is the authoritative "blob is GC-eligible since T" +signal; the RC is the authoritative "is reachable" signal. Both +are Raft-replicated, so every node arrives at the same set of +sweepable SHAs and the same eligibility window. Different nodes +running sweepers concurrently is safe because step (3b) commits +through Raft and the txn fails with a write-write conflict on the +second sweeper, leaving the first sweeper's deletion authoritative. + +`chunkBlobGCGracePeriod` defaults to 1 hour. The grace window +absorbs in-flight reads (a peer that has already started fetching +the blob completes its fetch before the sweeper runs), in-flight +upload aborts (a multipart abort flips RCs to zero, then a retry +creates a new manifest with the same chunks and bumps them back), +and clock skew between nodes (we use the Raft `commitTS` from the +RC-update txn, not wall clock, so skew is bounded to the +HLC-physical-shift the cluster already tolerates). ### 3.6 Fetch protocol @@ -392,9 +494,9 @@ capability-advertising peers exists. | Milestone | Scope | Risk | |---|---|---| | M0 | Spike: prove the chunkref + chunkblob keyspaces under a feature flag with 1 % traffic. Measure local Pebble write amp & blob fetch latency. | Low (observability only). | -| M1 | PUT path emits chunkrefs through Raft; chunkblob writes go directly to local Pebble. GET path checks chunkblob locally with proxy-on-miss. | Medium (race ordering). | -| M2 | Follower fetch protocol + async fetch worker pool. SHA verification + retry from alternate peer on mismatch. | Medium (fanout cost on snapshot apply). | -| M3 | Reference-count + grace-period GC. | Medium (correctness of RC under concurrent ops). | +| M1 | PUT path emits chunkrefs through Raft; chunkblob writes go directly to local Pebble. **`FetchChunkBlob` and `PushChunkBlob` RPCs ship in this milestone** because both M1 PUT (semi-synchronous push) and M1 GET (proxy-on-miss) depend on them — without them M1 GET could only serve local-hit or 503. | Medium (race ordering, RPC plumbing on the request goroutine). | +| M2 | Async fetch worker pool for follower apply (catch-up after a long absence). Independent of M1's synchronous `FetchChunkBlob` use on the GET path. SHA verification + retry from alternate peer on mismatch. | Medium (fanout cost on snapshot apply). | +| M3 | Reference-count + grace-period GC (the queue-based scheme in §3.5). | Medium (correctness of RC under concurrent ops). | | M4 | Migrator: rewrite legacy `BlobKey` data in the background. Off by default until M0–M3 burn in for 30 days in production. | High (long-running batch over live traffic). | Acceptance criteria for M3 (the milestone that flips `ELASTICKV_S3_BLOB_OFFLOAD=true` by default): From 747506002296d0ef1da04c595945af754ac3715e Mon Sep 17 00:00:00 2001 From: "Yoshiaki Ueda (bootjp)" Date: Sun, 26 Apr 2026 04:25:22 +0900 Subject: [PATCH 5/8] docs(design): address Claude bot round-4 review (1 HIGH, 3 LOW) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit s3_raft_blob_offload.md — HIGH: §3.5 step (3b) claimed "delete the local chunkblob AND the queue entry as a single Raft txn" — but chunkblob is local-only Pebble per §3.1 and the queue is Raft-replicated, so the two ops live in different storage layers and cannot share a Raft txn. Rewrite the step to specify a two-phase ordering: i. Raft phase first: delete the queue entry through Raft. Concurrent sweepers serialise here on write-write-conflict; the queue is therefore the global "we are GC-ing this SHA" lock. ii. Local phase second: delete the local chunkblob from Pebble. Document the failure mode of the inverse ordering (local-first would orphan the queue entry on crash) and of the chosen ordering (crash between phases leaves a local space leak — bounded, no correctness consequence — recoverable by a periodic orphan scan). s3_raft_blob_offload.md — LOW: the same §3.5 closing paragraph said "Both are Raft-replicated" referring to the queue and RC. That phrasing implied the chunkblob deletes were Raft-replicated too. Rewrite to explicitly distinguish: queue + RC are Raft-replicated; local chunkblob deletes are deliberately node-local because that is the whole point of the architecture. s3_admission_control.md — LOW: §4 retry-budget bound formula referenced `redisDispatchTimeout`, a Redis-path constant copy- pasted into the S3 design. The S3 PUT path actually uses the inbound `*http.Request` context (no S3-specific Dispatch timeout), so the formula now reads `single_dispatch_budget` with an explicit note that the upper bound is whatever the request context allows at that moment. s3_admission_control.md — LOW: §3.5 metrics spec defined only `stage="prereserve" | "perbatch"` but §6 and the Rolling Upgrade subsection both reference a `stage="perbatch", protocol="chunked"` label combination for isolating chunked-PUT rejection events. Add the `protocol="fixed-length" | "chunked"` label dimension to `elastickv_s3_put_admission_rejections_total` and `elastickv_s3_put_admission_wait_seconds`, with a brief paragraph explaining why the split is operationally meaningful (chunked HoL events vs. fixed-length client-concurrency events). No code changes; design docs only. --- ...026_04_25_proposed_s3_admission_control.md | 37 ++++++++++---- ...026_04_25_proposed_s3_raft_blob_offload.md | 48 ++++++++++++++----- 2 files changed, 65 insertions(+), 20 deletions(-) diff --git a/docs/design/2026_04_25_proposed_s3_admission_control.md b/docs/design/2026_04_25_proposed_s3_admission_control.md index a98492706..53a482e76 100644 --- a/docs/design/2026_04_25_proposed_s3_admission_control.md +++ b/docs/design/2026_04_25_proposed_s3_admission_control.md @@ -283,10 +283,23 @@ materialise by default. ``` elastickv_s3_put_admission_inflight_bytes gauge -elastickv_s3_put_admission_rejections_total counter (label: stage="prereserve" | "perbatch") -elastickv_s3_put_admission_wait_seconds histogram (label: stage) +elastickv_s3_put_admission_rejections_total counter (labels: + stage = "prereserve" | "perbatch", + protocol = "fixed-length" | "chunked") +elastickv_s3_put_admission_wait_seconds histogram (labels: stage, protocol) ``` +The `protocol` label distinguishes fixed-length PUTs (those with a +declared `Content-Length`, hitting admission A's `peekHeadroom`) +from aws-chunked PUTs (admission via §3.3.1's pay-as-you-decode). +This split is what makes the chunked-PUT 503 surface (§6) and the +rolling-upgrade alerting story actionable: a spike on +`stage="perbatch", protocol="chunked"` points at "chunked clients +beat Raft drain"; a spike on `stage="prereserve", +protocol="fixed-length"` points at "client concurrency exceeds +the per-node aggregate cap." Without the dimension the two +failure modes are indistinguishable in a single counter. + Grafana panel: inflight gauge with the cap as a horizontal line so the operator sees how often the system saturates. Rejection rate suggests bumping the cap or scaling out (more nodes spreads PUT load). @@ -316,13 +329,19 @@ suggests bumping the cap or scaling out (more nodes spreads PUT load). pendingBatch slice for the entire retry window, so the budget must reflect them; a release-between-retries scheme would let a second PUT proceed while the first is still memory-resident, - breaking the bound. The total wall-clock cost of holding through - one full retry chain is bounded by - `s3TxnRetryMaxAttempts × (redisDispatchTimeout + s3TxnRetryMaxBackoff)`; - if that ever exceeds `dispatchAdmissionTimeout` the per-batch - acquire on the *next* batch surfaces as 503, which is the right - failure mode (chronic dispatch failure → caller learns instead of - silently consuming the budget). + breaking the bound. The S3 PUT path uses the inbound + `*http.Request` context for `coordinator.Dispatch` (no + S3-specific Dispatch timeout — the HTTP server's + `writeTimeout` / client-side cancellation is the upper bound on + one Dispatch attempt), so the wall-clock cost of holding the + slot through one full retry chain is bounded by + `s3TxnRetryMaxAttempts × (single_dispatch_budget + s3TxnRetryMaxBackoff)` + where `single_dispatch_budget` is whatever the request context + permits at that moment. If the retry chain duration ever + exceeds `dispatchAdmissionTimeout` the per-batch acquire on the + *next* batch surfaces as 503 — the right failure mode + (chronic dispatch failure → caller learns instead of silently + consuming the budget). ## 5. Implementation plan diff --git a/docs/design/2026_04_25_proposed_s3_raft_blob_offload.md b/docs/design/2026_04_25_proposed_s3_raft_blob_offload.md index ed9e1eb6e..beb82e42c 100644 --- a/docs/design/2026_04_25_proposed_s3_raft_blob_offload.md +++ b/docs/design/2026_04_25_proposed_s3_raft_blob_offload.md @@ -323,22 +323,48 @@ sweeper needs to know not just *that* the RC reached zero but a. scans the queue range `[!s3|chunkblob-gc-queue|, !s3|chunkblob-gc-queue||)` for entries whose grace window has elapsed, b. for each `` returned, re-checks the RC counter at the - sweeper's read timestamp; if the RC is still 0, deletes the - local `!s3|chunkblob|` AND deletes the queue entry — - both as a single Raft txn so the queue stays consistent with - the keyspace, + sweeper's read timestamp. + + *The deletion is two-phase across two storage layers and is + NOT a single transaction* — `!s3|chunkblob|` is local + Pebble (per §3.1, never written through Raft), while + `!s3|chunkblob-gc-queue|…` is Raft-replicated. The phases + MUST run in this order: + + i. **Raft phase first.** Delete the queue entry through a + Raft txn. Concurrent sweepers serialise here on + write-write-conflict; only the winner proceeds to the + local phase. This makes the queue the global "we are + GC-ing this SHA" lock. + ii. **Local phase second.** Delete the local + `!s3|chunkblob|` from Pebble. No Raft round-trip. + + The phase ordering is the load-bearing detail. If we did + local-first then Raft, a crash between the two phases would + leave the chunkblob gone locally but the queue entry still + present — every subsequent sweep would re-attempt the local + delete (no-op) and the queue entry would never get removed + until manual intervention. The Raft-first ordering trades + that for the inverse failure mode: a crash between the two + phases leaves the queue entry deleted but the local + chunkblob still on disk — a **bounded local space leak, + not a correctness bug**. A periodic "orphan scan" (chunkblob + keys whose SHA has RC=0 *and* no queue entry) reclaims + these without urgency. The orphan scan runs at low priority + out of band from the sweeper. c. if the RC has bounced above 0 in the meantime, the queue entry is stale (a re-reference txn forgot to remove it, or the sweeper raced) and the sweeper deletes only the queue - entry, leaving the chunkblob in place. + entry through a Raft txn, leaving the chunkblob in place. The queue is the authoritative "blob is GC-eligible since T" -signal; the RC is the authoritative "is reachable" signal. Both -are Raft-replicated, so every node arrives at the same set of -sweepable SHAs and the same eligibility window. Different nodes -running sweepers concurrently is safe because step (3b) commits -through Raft and the txn fails with a write-write conflict on the -second sweeper, leaving the first sweeper's deletion authoritative. +signal *and* the global "we are GC-ing this SHA" lock — its +Raft-replicated single-writer-per-key property is what makes +concurrent sweepers safe across nodes. The RC is the +authoritative "is reachable" signal, also Raft-replicated. Local +chunkblob deletes are deliberately *not* replicated: each node +deletes its own copy independently after the queue-entry txn +commits, because that's the whole point of the architecture. `chunkBlobGCGracePeriod` defaults to 1 hour. The grace window absorbs in-flight reads (a peer that has already started fetching From fabc81e40612359841a6aba50a1f1d8bc81693ca Mon Sep 17 00:00:00 2001 From: "Yoshiaki Ueda (bootjp)" Date: Sun, 26 Apr 2026 04:36:19 +0900 Subject: [PATCH 6/8] docs(design): address Claude bot round-5 review (2 MEDIUM, 2 LOW) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit s3_admission_control.md — MEDIUM: §3.3.1 "Bootstrap reservation" was ambiguous between peek and acquire. Pin it as a peek (`peekHeadroom(s3RaftEntryByteBudget)`, no slot acquisition, matching admission A's contract) and rename the heading to "Bootstrap headroom check." Document why it must be a peek (an acquire would multiply per-request slot hold by `concurrent_chunked_PUTs × 4 MiB` of bootstrap-only credit with no corresponding payload, reintroducing the head-of-line hazard the design exists to prevent). s3_admission_control.md — LOW: §3.3.1 "frame size up to 64 KiB" was incoherent with the §3.3 semaphore's 1 MiB slot unit (a channel-backed semaphore can't acquire fractional slots). Clarify that the awsChunkedReader progress callback **buffers decoded bytes until a full s3ChunkSize is accumulated, then calls acquire(s3ChunkSize)**. Worst-case extra buffer per concurrent chunked PUT is bounded by 1 MiB; on stream EOF the partial buffer flushes via one final acquire rounded up to one slot. Also adds `s3RaftEntryByteBudget` to §3.2's constant block (it was used throughout §3.3.1 but never defined) with a comment showing the derivation (s3ChunkSize × s3ChunkBatchOps). s3_raft_blob_offload.md — MEDIUM: §3.2 degraded path floor of 2 chunkblob copies provides weaker-than-Raft durability for N > 3 clusters. On a 5-node cluster Raft tolerates 2 simultaneous failures for the chunkref but the degraded chunkblob path (leader + 1 follower) tolerates only 1. Add an explicit note acknowledging the asymmetry, recommend `chunkBlobMinReplicas = N` for operators who need the legacy "blob durability == Raft durability" guarantee, and clarify that the default `(N/2)+1` is sized for "match Raft quorum" not "match Raft fault tolerance" — a distinction that is invisible at N=3 and material at N≥5. s3_raft_blob_offload.md — LOW: §3.5 Phase (3b.i) needs to specify that the queue-entry delete is **conditional** on (a) the entry existing and (b) the RC counter still being 0 at the txn's read timestamp. An unconditional delete would silently succeed on a queue entry that a re-reference txn has just removed, then proceed to phase (3b.ii) and local-delete a chunkblob whose RC has bounced back to 1 — a correctness bug, not just a space leak. The conditional form is what makes the sweeper safe against the re-reference race. No code changes; design docs only. --- ...026_04_25_proposed_s3_admission_control.md | 82 +++++++++++++------ ...026_04_25_proposed_s3_raft_blob_offload.md | 48 +++++++++-- 2 files changed, 102 insertions(+), 28 deletions(-) diff --git a/docs/design/2026_04_25_proposed_s3_admission_control.md b/docs/design/2026_04_25_proposed_s3_admission_control.md index 53a482e76..c5c937b7d 100644 --- a/docs/design/2026_04_25_proposed_s3_admission_control.md +++ b/docs/design/2026_04_25_proposed_s3_admission_control.md @@ -132,6 +132,18 @@ const ( // per PR #617) — leaving headroom for Lua, scan buffers, and Pebble // memtables. s3PutAdmissionMaxInflightBytes = 256 << 20 // 256 MiB + // s3RaftEntryByteBudget is the per-batch unit acquired and + // released against the semaphore. It must equal the byte + // budget of one Raft entry produced by PutObject / + // UploadPart's flush loop (PR #636: s3ChunkSize × + // s3ChunkBatchOps = 1 MiB × 4 = 4 MiB minus protobuf framing + // overhead — kept abstract here so the admission contract + // does not lock the entry-size choice). The semaphore's + // capacity is `s3PutAdmissionMaxInflightBytes / + // s3ChunkSize` 1 MiB-units; per-batch acquire takes + // `s3RaftEntryByteBudget / s3ChunkSize` units at a time + // (= 4 units on the default tunables). + s3RaftEntryByteBudget = s3ChunkSize * s3ChunkBatchOps // dispatchAdmissionTimeout is how long a per-batch flush will wait // for a slot before giving up. The 256 MiB cap drains in ~2 s at // 1 Gbps under steady-state Raft throughput (256 MiB / 125 MB/s), @@ -225,34 +237,58 @@ chunked stream finishes — exactly the failure mode admission control exists to prevent. We therefore split chunked admission across two mechanisms instead of pre-charging: -1. **Bootstrap reservation = `s3RaftEntryByteBudget` (4 MiB)** at - request entry. This is enough to admit the request and let the - awsChunkedReader produce its first decoded window. Chunked PUTs - are not "free" — they still must beat the same admission queue - as fixed-length PUTs at the per-batch level. +1. **Bootstrap headroom check** at request entry. Calls + `peekHeadroom(s3RaftEntryByteBudget)` — exactly the admission-A + contract: a fast-fail check that 4 MiB *would have fit* at the + moment we asked. **No slot is acquired.** This is intentionally + racy with concurrent PUTs (same as fixed-length admission A); + its job is to fail at request entry rather than partway + through the first decoded frame. Chunked PUTs are not "free" + — they still must beat the same admission queue as fixed-length + PUTs at the per-frame level. 2. **Pay-as-you-decode** thereafter, charged via an - `awsChunkedReader` progress callback. Each decoded chunk frame - (typically up to 64 KiB on the wire after framing overhead) - acquires a slot equal to the bytes about to flow into Pebble; the - slot is released once the corresponding `coordinator.Dispatch` - acks. This is the same path admission B uses for fixed-length - PUTs — chunked traffic just hooks into it incrementally instead of - pre-charging the worst case. + `awsChunkedReader` progress callback. The callback **buffers + decoded bytes until a full slot unit (`s3ChunkSize = 1 MiB`) is + accumulated**, then calls `acquire(s3ChunkSize)` on the + semaphore (same path as fixed-length admission B). This keeps + the slot unit coherent: the semaphore's capacity is + `s3PutAdmissionMaxInflightBytes / s3ChunkSize` 1 MiB-units, so + acquiring at sub-MiB granularity is not representable. The + slot is released once the corresponding + `coordinator.Dispatch` acks the chunk. The buffer never holds + more than `s3ChunkSize - 1` decoded bytes, so the worst-case + memory overhead beyond the semaphore-tracked bytes is bounded + by 1 MiB per concurrent chunked PUT. Failure modes: -- If the awsChunkedReader produces frames faster than Raft drains, the - per-batch acquire blocks (capped by `dispatchAdmissionTimeout`). - Beyond that timeout, mid-stream 503 closes the connection. The - legacy "reserve 5 GiB" approach would have surfaced as 503 *at - request entry* for unrelated PUTs; this approach surfaces as - mid-stream 503 for the chunked PUT itself, which is the right - blame attribution. +- If the awsChunkedReader produces decoded bytes faster than Raft + drains, the next 1 MiB acquire blocks (capped by + `dispatchAdmissionTimeout`). Beyond that timeout, mid-stream 503 + closes the connection. The legacy "reserve 5 GiB" approach + would have surfaced as 503 *at request entry* for unrelated + PUTs; this approach surfaces as mid-stream 503 for the chunked + PUT itself, which is the right blame attribution. +- The bootstrap check at step 1 is racy: another PUT can consume + the headroom between the check and the first per-frame acquire. + When that happens the first acquire blocks (or 503s on + timeout) — the same path the fixed-length admission B handles + for the contending case. The race is intentional: making the + check a real reservation would multiply per-request slot hold + by `concurrent_chunked_PUTs × 4 MiB` of bootstrap-only credit + with no corresponding payload, reintroducing a head-of-line + hazard. +- If the awsChunkedReader produces a single frame whose decoded + size never accumulates to a full `s3ChunkSize`, the buffer + flushes on stream EOF: a final `acquire(actual_buffered_bytes)` + rounded up to one slot is taken (semaphore charges in 1-slot + units regardless of actual byte count), so the bound holds. - If the awsChunkedReader frame size ever exceeds - `s3RaftEntryByteBudget` (a malformed client), the per-batch acquire - asks for more than the cap allows and we 503 immediately — same - as a fixed-length PUT whose `Content-Length` exceeds the global - cap. + `s3RaftEntryByteBudget` (a malformed client whose decoded + cumulative output between framing acks already exceeds 4 MiB), + the first per-frame acquire asks for more than the cap allows + and we 503 immediately — same as a fixed-length PUT whose + `Content-Length` exceeds the global cap. This change moves chunked admission from M4 (originally "deferred optimisation") into M1 (the first shippable milestone). M1 ships diff --git a/docs/design/2026_04_25_proposed_s3_raft_blob_offload.md b/docs/design/2026_04_25_proposed_s3_raft_blob_offload.md index beb82e42c..270ed51af 100644 --- a/docs/design/2026_04_25_proposed_s3_raft_blob_offload.md +++ b/docs/design/2026_04_25_proposed_s3_raft_blob_offload.md @@ -197,6 +197,29 @@ configuration. Tuning `chunkBlobMinReplicas` higher trades PUT availability for stronger durability; tuning lower than 2 is rejected at config-load. +**Important durability note for N > 3 clusters.** On a 3-node +cluster the degraded floor of 2 chunkblob copies happens to match +Raft's quorum-of-2, so a single node failure is tolerated by both +the chunkref *and* the chunkblob. For N > 3 this is no longer +true: a 5-node cluster has Raft quorum 3 and tolerates 2 +simultaneous failures for the chunkref, but the degraded +chunkblob path (leader + 1 follower) tolerates only 1. If the +leader and the chunkblob-holding follower both fail during the +degraded window, the surviving Raft quorum elects a new leader, +finds a committed chunkref, and discovers that no surviving node +holds the chunkblob — the chunkref is durable but the object data +is lost. This is **weaker than the legacy "every byte through +Raft" path**, which loses data only when Raft itself loses quorum +(3 simultaneous failures on N=5). Operators on N > 3 clusters who +need the legacy "blob durability == Raft durability" guarantee +should configure `chunkBlobMinReplicas = N` (full replication; +trades some PUT availability — any single peer outage stalls +PUTs — for the strongest durability the cluster can offer). +The default `(N/2)+1` is sized for "match Raft quorum," not "match +Raft fault tolerance"; this distinction is invisible at N=3 but +material at N≥5 and is what makes this configuration knob +operationally meaningful. + The trade-off is PUT latency: a PUT now blocks on `chunkBlobMinReplicas - 1` follower fsyncs in addition to the Raft quorum write of the chunkref. Empirically the chunkblob fsync is @@ -331,13 +354,28 @@ sweeper needs to know not just *that* the RC reached zero but `!s3|chunkblob-gc-queue|…` is Raft-replicated. The phases MUST run in this order: - i. **Raft phase first.** Delete the queue entry through a - Raft txn. Concurrent sweepers serialise here on - write-write-conflict; only the winner proceeds to the - local phase. This makes the queue the global "we are - GC-ing this SHA" lock. + i. **Raft phase first — conditional delete.** Delete the + queue entry through a Raft txn that is **conditional on + the queue entry existing AND the RC counter still being + 0 at the txn's read timestamp**. The conditional form is + load-bearing: if a re-reference txn has committed + between the sweeper's queue scan and this txn (driving + RC back to 1 and atomically removing the queue entry — + see §3.1's atomic invariant), the conditional delete + fails and the sweeper aborts before reaching the local + phase. An *unconditional* delete would silently succeed + on the now-absent queue entry and let the sweeper + proceed to local-delete a chunkblob that is currently + live (RC=1) — a **correctness bug, not just a space + leak**. Concurrent sweepers also serialise on this txn + (write-write-conflict on the queue key); only the winner + proceeds. ii. **Local phase second.** Delete the local `!s3|chunkblob|` from Pebble. No Raft round-trip. + Reaching this phase implies (i) succeeded, which + implies the RC was 0 at the txn read timestamp and + remained 0 throughout the txn's commit window — i.e. + the blob is genuinely unreachable. The phase ordering is the load-bearing detail. If we did local-first then Raft, a crash between the two phases would From 1fb39ba1b31feb70ca0203770cf94f33f8a8f6bb Mon Sep 17 00:00:00 2001 From: "Yoshiaki Ueda (bootjp)" Date: Sun, 26 Apr 2026 04:47:10 +0900 Subject: [PATCH 7/8] docs(design): address Claude bot round-5 review (1 MEDIUM, 4 LOW) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit s3_raft_blob_offload.md — MEDIUM: §3.2 PushChunkBlob latency claim ("PUT p99 ≈ legacy") was load-bearing but the local-write/peer-push pipeline was unspecified. Sequential ordering would silently double the per-chunk latency (`chunkBlobMinReplicas × fsync_latency`). Pin the contract: local Pebble write and PushChunkBlob fan out **concurrently**; multiple followers' pushes are **fanned out in parallel**, not sequentially. Update the flow diagram to show the pipeline explicitly and call out that this is part of the contract, not an optional optimization. s3_admission_control.md — LOW: §3.3.1 malformed-client failure mode said "we 503 immediately" but the accumulation design (`acquire(s3ChunkSize)` only) means the immediate-503 path is never reachable on the chunked side — the per-frame acquire is always 1 MiB, well under the 256 MiB cap. Reword to specify the actual path: successive 1 MiB acquires block under Raft pressure and the PUT eventually surfaces 503 on `dispatchAdmissionTimeout`. The "immediate 503 for oversized request" path is fixed-length only. s3_admission_control.md — LOW: §5 milestone table had M2 saying "Add `dispatchAdmissionTimeout`" but M1 already ships the chunked per-frame admission B path which is gated on it. Move the constant into M1; M2 narrows to "add fixed-length per-batch admission B + cleanup," with chunked already using the path from M1. s3_raft_blob_offload.md — LOW: the §3.2 flow-diagram step 3 phrasing "synchronously replicate to ≥ chunkBlobMinReplicas peers" was inconsistent with the prose's "chunkBlobMinReplicas - 1 followers." Resolved as part of the MEDIUM rewrite — the diagram now reads "PushChunkBlob to chunkBlobMinReplicas-1 followers" with parallel fan-out, matching the prose count. s3_raft_blob_offload.md — LOW: §3.5 orphan scan was framed as the recovery path for "sweeper crash between Phase i and ii," but it implicitly also covers a more common scenario — chunkblobs written to local Pebble by §3.2 step 2 when the PUT then fails before `coordinator.Dispatch` is called (admission 503, client disconnect, PushChunkBlob quorum failure). In that case neither RC nor GC queue entry exists, so the sweeper never sees the orphan; only the orphan scan does. Make this case explicit so the PUT-handler abort path can rely on the orphan scan rather than needing its own best-effort local delete. No code changes; design docs only. --- ...026_04_25_proposed_s3_admission_control.md | 22 +++--- ...026_04_25_proposed_s3_raft_blob_offload.md | 68 +++++++++++++++---- 2 files changed, 69 insertions(+), 21 deletions(-) diff --git a/docs/design/2026_04_25_proposed_s3_admission_control.md b/docs/design/2026_04_25_proposed_s3_admission_control.md index c5c937b7d..54bfb9dbe 100644 --- a/docs/design/2026_04_25_proposed_s3_admission_control.md +++ b/docs/design/2026_04_25_proposed_s3_admission_control.md @@ -283,12 +283,18 @@ Failure modes: flushes on stream EOF: a final `acquire(actual_buffered_bytes)` rounded up to one slot is taken (semaphore charges in 1-slot units regardless of actual byte count), so the bound holds. -- If the awsChunkedReader frame size ever exceeds - `s3RaftEntryByteBudget` (a malformed client whose decoded - cumulative output between framing acks already exceeds 4 MiB), - the first per-frame acquire asks for more than the cap allows - and we 503 immediately — same as a fixed-length PUT whose - `Content-Length` exceeds the global cap. +- A malformed client that decodes bytes faster than Raft drains + *cannot* trigger the immediate-503 path the way a fixed-length + PUT can. The accumulation design (callback always calls + `acquire(s3ChunkSize)`, never larger) means the per-frame + acquire request is bounded by 1 MiB — the + "if `bytes > capacity * s3ChunkSize`" early-return in + `acquire`'s spec is never hit on the chunked path. Instead, + successive 1 MiB acquires block under Raft pressure and the + PUT eventually surfaces 503 on `dispatchAdmissionTimeout` — + the same path a slow follower triggers. The "immediate 503 for + oversized request" failure mode applies only to fixed-length + PUTs (via `peekHeadroom(Content-Length > 256 MiB)`). This change moves chunked admission from M4 (originally "deferred optimisation") into M1 (the first shippable milestone). M1 ships @@ -383,8 +389,8 @@ suggests bumping the cap or scaling out (more nodes spreads PUT load). | Milestone | Scope | Risk | |---|---|---| -| M1 | Add `putAdmission` type + per-node singleton + fixed-length `Content-Length` admission. Wire `prepareStreamingPutBody` to acquire / release. **aws-chunked progress-callback admission** (§3.3.1) ships in this milestone too — the conservative 5 GiB pre-charge fallback only sits behind `ELASTICKV_S3_PUT_ADMISSION_CHUNKED_INCREMENTAL=false`. Metric scaffolding (gauge + counter). | Medium. Chunked progress callback needs `awsChunkedReader` to expose a hook. | -| M2 | Per-batch admission B inside `flushBatch` for fixed-length PUTs. Add `dispatchAdmissionTimeout`. Mid-stream 503 with cleanup. (Chunked PUTs already use this path through their incremental charging.) | Medium. Cleanup path on partial failure. | +| M1 | Add `putAdmission` type + per-node singleton + fixed-length `Content-Length` admission (`peekHeadroom`). Wire `prepareStreamingPutBody` to acquire / release. **aws-chunked progress-callback admission** (§3.3.1) ships in this milestone too — the conservative 5 GiB pre-charge fallback only sits behind `ELASTICKV_S3_PUT_ADMISSION_CHUNKED_INCREMENTAL=false`. **`dispatchAdmissionTimeout` ships here** (the chunked per-frame `acquire(s3ChunkSize)` path is gated on it from day one), not in M2. Metric scaffolding (gauge + counter). | Medium. Chunked progress callback needs `awsChunkedReader` to expose a hook. | +| M2 | Per-batch admission B inside `flushBatch` for **fixed-length** PUTs (chunked PUTs already use admission B as of M1). Mid-stream 503 with cleanup on the fixed-length path. | Medium. Cleanup path on partial failure. | | M3 | Env-var tunables. Histogram metric. Grafana panel. | Low. | | M4 | Per-tenant / per-bucket admission classes (handed off to the workload-isolation rollout). | Medium. Out-of-scope for the v1 cap. | diff --git a/docs/design/2026_04_25_proposed_s3_raft_blob_offload.md b/docs/design/2026_04_25_proposed_s3_raft_blob_offload.md index 270ed51af..90c7657a0 100644 --- a/docs/design/2026_04_25_proposed_s3_raft_blob_offload.md +++ b/docs/design/2026_04_25_proposed_s3_raft_blob_offload.md @@ -133,8 +133,15 @@ client ─► HTTP PUT body ─► chunk loop (s3ChunkSize): 1. compute SHA-256 of chunk 2. write chunk to LOCAL Pebble at !s3|chunkblob| - 3. *** synchronously replicate to ≥ chunkBlobMinReplicas peers *** - 4. queue ChunkRef into pendingBatch + (fsync) ─┐ pipelined: bytes also stream out + 3. PushChunkBlob to ─┤ to chunkBlobMinReplicas-1 followers + followers in parallel (one RPC per follower; bytes + start flowing the moment the leader has them — not + after local fsync completes) + 4. wait until BOTH local fsync AND a quorum of follower + fsync-acks have returned (= chunkBlobMinReplicas + durable copies including the leader) + 5. queue ChunkRef into pendingBatch ─► flushBatch: coordinator.Dispatch(OperationGroup{ Elems: [ chunkref Puts ... + manifest Put ], @@ -151,13 +158,27 @@ quorum guarantees the chunkref but tells you nothing about the blob payload. We close that gap by treating the chunkblob like a mini-Raft entry of its own with **semi-synchronous quorum**: -1. Leader writes the chunkblob to local Pebble (fsync). -2. Leader pushes the chunkblob to `chunkBlobMinReplicas - 1` - followers via the `S3BlobFetch.PushChunkBlob` RPC and waits for - each follower's "fsync ack." (`PushChunkBlob` is the leader- - initiated counterpart to the follower-initiated `FetchChunkBlob` - defined in §3.6.) -3. Only after the chunkblob is durable on a quorum of nodes does +1. Leader starts writing the chunkblob to local Pebble (fsync in + flight). +2. **Concurrently** with step 1, the leader streams the chunkblob + to `chunkBlobMinReplicas - 1` followers via parallel + `S3BlobFetch.PushChunkBlob` RPCs. Pushes are **fanned out in + parallel**, not sequential — each follower's RPC is started + immediately, and the leader waits on a quorum of fsync-acks + rather than serially blocking on each one. (`PushChunkBlob` is + the leader-initiated counterpart to the follower-initiated + `FetchChunkBlob` defined in §3.6.) +3. The leader waits for *both* the local fsync AND a quorum of + follower fsync-acks. The dominant cost is therefore + `max(local_fsync, slowest_quorum_follower_fsync)` — typically + ≈ 10 ms on consumer SSD, equivalent to a Raft quorum write. + This is what makes the p99 latency claim below load-bearing: + if step 1 and step 2 were *sequential* (write local → then + push to followers → then wait), per-chunk latency would be + `chunkBlobMinReplicas × fsync_latency` and silently double the + PUT p99 vs. the legacy path. The pipelined / parallel model + is part of the contract, not an optimization. +4. Only after the chunkblob is durable on a quorum of nodes does the leader propose the chunkref through Raft. `chunkBlobMinReplicas` defaults to **2** on a 3-node cluster (= a @@ -386,10 +407,31 @@ sweeper needs to know not just *that* the RC reached zero but that for the inverse failure mode: a crash between the two phases leaves the queue entry deleted but the local chunkblob still on disk — a **bounded local space leak, - not a correctness bug**. A periodic "orphan scan" (chunkblob - keys whose SHA has RC=0 *and* no queue entry) reclaims - these without urgency. The orphan scan runs at low priority - out of band from the sweeper. + not a correctness bug**. A periodic "orphan scan" reclaims + these. + + The orphan scan covers two distinct sources of orphans: + + - **Sweeper crash between Phase (3b.i) and (3b.ii)** — the + case described above; queue entry was removed via Raft + but the local Pebble delete never fired. + - **PUT failure before chunkref Dispatch** — chunkblob + bytes were written to local Pebble in §3.2 step 2, then + the PUT aborted before reaching `coordinator.Dispatch` + (admission control 503, client disconnect, `PushChunkBlob` + quorum failure, request context cancel). In that + scenario neither an RC entry nor a GC queue entry was + ever written, so the sweeper's queue-range scan never + sees these orphans — only the orphan scan does. + + Detection criterion (covers both): `!s3|chunkblob|` + keys whose SHA has either no RC entry at all, or RC=0 with + no corresponding queue entry. The orphan scan runs at low + priority out of band from the sweeper (proposed default + `chunkBlobOrphanScanInterval = 1 hour`); it is the safety + net behind both the sweeper crash path and the PUT-abort + cleanup path, so the PUT handler does not need its own + best-effort local-delete on the abort path. c. if the RC has bounced above 0 in the meantime, the queue entry is stale (a re-reference txn forgot to remove it, or the sweeper raced) and the sweeper deletes only the queue From aa418cef4985342f5ec22a9fa51db7f4f8bbc865 Mon Sep 17 00:00:00 2001 From: "Yoshiaki Ueda (bootjp)" Date: Sun, 26 Apr 2026 04:55:50 +0900 Subject: [PATCH 8/8] docs(design): address CodeRabbit review (typo + code-fence language tags) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit CodeRabbit round 1 on PR #637: - s3_raft_blob_offload.md:358 typo "elegibility" → "eligibility" (single character). - Add `text` language identifier to the six previously-bare fenced code blocks (the §1 worst-case formula in admission_control, the §3.3 client/admission flow diagram in admission_control, the §3.5 metrics block in admission_control, the §1 Raft-log flow in blob_offload, the §3.1 keyspace block in blob_offload, and the §3.2 PUT pipeline diagram in blob_offload). Markdownlint flags bare fences; tagging them enables both linting compliance and consistent syntax highlighting in the rendered docs. No code changes; design docs only. --- docs/design/2026_04_25_proposed_s3_admission_control.md | 6 +++--- docs/design/2026_04_25_proposed_s3_raft_blob_offload.md | 8 ++++---- 2 files changed, 7 insertions(+), 7 deletions(-) diff --git a/docs/design/2026_04_25_proposed_s3_admission_control.md b/docs/design/2026_04_25_proposed_s3_admission_control.md index 54bfb9dbe..a153f1697 100644 --- a/docs/design/2026_04_25_proposed_s3_admission_control.md +++ b/docs/design/2026_04_25_proposed_s3_admission_control.md @@ -22,7 +22,7 @@ with `MaxSizePerMsg`, the **aggregate** in-flight memory bound is governed by client concurrency, which the server cannot currently limit: -``` +```text leader-side worst-case = concurrent_PUTs × pending_entries_per_PUT × entry_size ≈ concurrent_PUTs × MaxInflight × 4 MiB ``` @@ -105,7 +105,7 @@ Recommended: **both, at different scales**. fails with 503 mid-stream. Dispatch latency stays bounded by the semaphore size rather than by client behaviour. -``` +```text client ─[Content-Length]─► (A) reserve full body bytes │ ▼ @@ -323,7 +323,7 @@ materialise by default. ### 3.5 Metrics -``` +```text elastickv_s3_put_admission_inflight_bytes gauge elastickv_s3_put_admission_rejections_total counter (labels: stage = "prereserve" | "perbatch", diff --git a/docs/design/2026_04_25_proposed_s3_raft_blob_offload.md b/docs/design/2026_04_25_proposed_s3_raft_blob_offload.md index 90c7657a0..3d02154a2 100644 --- a/docs/design/2026_04_25_proposed_s3_raft_blob_offload.md +++ b/docs/design/2026_04_25_proposed_s3_raft_blob_offload.md @@ -20,7 +20,7 @@ Today every byte of an S3 object travels through the Raft log: -``` +```text HTTP PUT body ─► s3ChunkSize (1 MiB) chunks ─► s3ChunkBatchOps × 1 MiB Raft entry ─► Raft log entry (Pebble WAL on every node) @@ -102,7 +102,7 @@ themselves*. ### 3.1 New keyspace -``` +```text !s3|chunkref|||||| → ChunkRef{ ContentSHA256 [32]byte @@ -128,7 +128,7 @@ Two separate keyspaces: ### 3.2 PUT path -``` +```text client ─► HTTP PUT body ─► chunk loop (s3ChunkSize): 1. compute SHA-256 of chunk @@ -355,7 +355,7 @@ sweeper needs to know not just *that* the RC reached zero but would drive it to zero, the *same* txn additionally writes `!s3|chunkblob-gc-queue||` → empty. The commitTS-prefixed key is the time signal: the queue is - naturally sorted by elegibility-start time, and any node can + naturally sorted by eligibility-start time, and any node can determine the grace-period boundary by scanning the queue with `endKey = !s3|chunkblob-gc-queue||`. If a subsequent txn re-references the same SHA before the sweeper