docs(design): propose S3 admission control + raft blob offload#637
docs(design): propose S3 admission control + raft blob offload#637
Conversation
Two companion proposals to PR #636 (s3ChunkBatchOps=4, Raft entry size aligned with MaxSizePerMsg). PR #636 fixes the per-entry memory accounting; these docs cover the two follow-up axes the in-PR discussion identified: - 2026_04_25_proposed_s3_admission_control.md Hard cap on the aggregate S3 PUT body bytes a node accepts but has not yet committed to Raft. Per-PUT pipeline is already capped by MaxInflight × MaxSizePerMsg; this proposal bounds the multiplier from concurrent PUTs so leader-side worst-case memory has a hard ceiling regardless of client behaviour. Two-level admission (Content-Length pre-charge + per-batch slot) with HTTP 503 SlowDown rejection. - 2026_04_25_proposed_s3_raft_blob_offload.md Take large chunk payloads off the Raft log entirely. Raft replicates ChunkRefs (32-byte SHA256 + size) plus the manifest; chunk bytes travel via a side-channel gRPC fetch protocol between peers. Snapshot size becomes O(manifest count), follower catch-up is amortised by lazy fetch, and WAL growth no longer scales with stored bytes. Both docs are filed as proposals only; no code changes. They follow the docs/design/README.md filename + header conventions and cross-reference each other plus the prerequisite PRs (#593, #600, #612, #617, #636) and the workload isolation roadmap.
|
Warning Rate limit exceeded
Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 51 minutes and 20 seconds. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
📝 WalkthroughWalkthroughTwo new design documents proposing features for S3 request handling: one defines an admission control mechanism with per-node caps on in-flight PUT body bytes using two-tier checking and semaphores; the other outlines moving S3 object chunk payloads from the Raft log into a separate Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
There was a problem hiding this comment.
Code Review
This pull request introduces two design documents aimed at improving the scalability and stability of S3 PUT operations. The first document proposes a memory-based admission control system to prevent leader memory exhaustion, while the second outlines a strategy to offload large blob payloads from the Raft log to reduce write amplification and snapshot sizes. Review feedback highlighted a significant durability risk in the blob offload path where data could be lost if a leader fails before followers fetch the out-of-band data. Concerns were also raised regarding the admission control's handling of chunked uploads, which could cause head-of-line blocking due to an oversized pre-charge against the memory budget, and a design contradiction regarding deduplication in the storage layer.
| - **Local-blob-only on a single node.** If only one node has a | ||
| given chunkblob and that node fails before peers fetch it, data | ||
| is lost. Mitigation: PUT path replicates the chunkblob to *N* | ||
| peers asynchronously before returning 200 (e.g. quorum write | ||
| outside of Raft). N defaults to 2 (one extra copy = parity with | ||
| Raft's quorum durability). |
There was a problem hiding this comment.
リーダーがローカルの Pebble に書き込み、Raft で chunkref をコミットした直後に、フォロワーがフェッチを完了する前にリーダーが全損した場合、データが失われるリスクがあります。Raft の「クォーラム合意=耐久性確保」という前提が本体データについては適用されなくなります。リスク緩和策として「N台のピアに非同期複製してから200 OKを返す」とありますが、厳密な耐久性を保証するには、少なくとも1台のフォロワーへの書き込み完了を確認してから Raft コミットを行う「準同期」的な仕組みの検討が必要です。また、本機能の導入はストレージ構成の大きな変更を伴うため、ローリングアップデートの可否や、ゼロダウンタイムでの切り替え戦略(例:デュアルライト、論理レプリケーション等)についても詳細を記載してください。
References
- When a design document identifies a significant operational risk, such as the inability to perform rolling upgrades, it must also detail potential mitigation strategies, like implementing a temporary "bridge" or "proxy" mode.
- When designing a production-grade system migration, the plan must consider and detail strategies for live or zero-downtime cutovers (e.g., blue/green, dual-write proxy, logical replication) to avoid service interruption.
| For aws-chunked transfers (`Content-Length == -1`), the request-entry | ||
| charge falls back to a conservative `s3MaxObjectSizeBytes` (5 GiB) | ||
| reservation. The downside is that one chunked PUT can monopolise the | ||
| budget; the upside is correctness without re-reading headers. | ||
| We will instrument a metric to find out empirically how large that | ||
| hit actually is before optimising. |
There was a problem hiding this comment.
aws-chunked (Content-Length: -1) の場合に s3MaxObjectSizeBytes (5 GiB) を一括で pre-charge する設計は、デフォルトの s3PutAdmissionMaxInflightBytes (256 MiB) を大幅に上回っています。これにより、1つの chunked upload が開始された瞬間に他のすべての PUT リクエストが 503 で拒絶される状態(ヘッドオブラインブロッキング)が発生します。M4 で予定されている「デコードされたバイト数に応じた逐次チャージ」を M1 の時点で最低限導入するか、chunked upload 専用のバッファ枠を設けるなどの緩和策を検討してください。また、この制限が既存の運用に与える影響を考慮し、ローリングアップデート時の互換性や影響緩和策についても記載が必要です。
References
- When a design document identifies a significant operational risk, such as the inability to perform rolling upgrades, it must also detail potential mitigation strategies, like implementing a temporary "bridge" or "proxy" mode.
| - No deduplication across objects. Each `(bucket, objectKey, | ||
| uploadID, partNo, chunkNo)` keeps its own key. Real S3 also does | ||
| not deduplicate at this layer. |
Gemini medium on PR #637 — the original "No deduplication across objects" non-goal contradicts §3.1 (chunkblob is content-addressed by SHA-256) and §3.5 (reference-counted GC). Two distinct objects whose chunks happen to hash identically *will* share a chunkblob row; that is a structural property of content addressing. Rewrite the bullet to be precise about what we are *not* promising without contradicting what the design does: - Surface no dedup API / metering / billing surface. - Reference layer (chunkref) stays per-object so DELETE / lifecycle still reason about objects independently. - Authorisation enforcement remains on chunkref reads, never on chunkblob reads — see the new "Cross-tenant blob fetch" risk entry that pins the threat model. Add a matching risk entry under §6 covering the SHA-256-guessing attack the dedup non-goal alludes to: feasible only if a tenant can both guess a victim's chunk SHA *and* bypass auth. We block the second by keeping S3BlobFetch internal-only and gating GET through chunkref ACL before the proxy-on-miss path is invoked. A future design that exposes blob fetch on a public surface would need to reintroduce tenant-scoped authorisation at the blob layer; this proposal explicitly does not. No code changes — design doc only.
|
Gemini medium ご指摘どおり矛盾していました。bfd4a56f で修正。 矛盾の中身:
修正方針: 「dedup is happening structurally, but we don't expose it」と precise に書き直し、暗黙の前提だった「ACL は chunkref 層で enforce」を明示化。さらに dedup の影響として懸念される cross-tenant な SHA 推測攻撃 をリスクセクションに正式に追加し、 変更箇所:
コードチャンジなし、design doc 1 ファイルの編集だけです。 /gemini review |
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
Round-2 review on PR #637 — Gemini high (durability, rolling upgrade) + medium (aws-chunked head-of-line blocking). s3_raft_blob_offload.md (Gemini high): - §3.2: chunkblob writes now go through semi-synchronous quorum replication via a new PushChunkBlob RPC before the chunkref enters Raft. Without it, a leader crash between the chunkref commit and the eventual async fetch on followers would leave a Raft-committed manifest pointing at a chunkblob no surviving node has — Raft's quorum bound the chunkref but said nothing about the blob payload. chunkBlobMinReplicas defaults to a Raft-quorum-equivalent floor, restoring end-to-end durability parity with the legacy "every byte through Raft" path. - §3.6: add the PushChunkBlob RPC alongside the existing FetchChunkBlob, with a Push-side fsync ack and a streaming request shape symmetric to the fetch path. - §3.8 (new): explicit two-step rolling-upgrade strategy gated on cluster-wide capability advertisement (feature_s3_blob_offload in GetClusterOverview). PUTs only flip to the offload path when *every* peer advertises the capability; any disagreement causes the leader to route the PUT through the legacy BlobKey path. Roll-back works the same way in reverse with no data loss because both keyspaces are namespaced and reads serve whichever resolves first. - §6: replace the optimistic "Local-blob-only on a single node" bullet with the explicit "Leader-only durability before chunkref commit" risk + the §3.2 mitigation. s3_admission_control.md (Gemini medium): - §3.3.1 (new): aws-chunked transfers no longer pre-charge s3MaxObjectSizeBytes (5 GiB on default tunables, 20× over the 256 MiB cap and a guaranteed head-of-line block for every other PUT). Bootstrap reservation is now s3RaftEntryByteBudget (4 MiB) and the rest is paid-as-decoded via an awsChunkedReader progress callback that hooks into the same per-batch admission the fixed-length path uses. The legacy 5 GiB pre-charge stays available behind ELASTICKV_S3_PUT_ADMISSION_CHUNKED_INCREMENTAL=false for incident response, but ships disabled by default. - §5: chunked progress callback moves from M4 to M1 so the HoL hazard is closed in the first shippable milestone. Adds an explicit "Rolling upgrade" subsection covering mixed-cluster safety (additive cap, no protocol change) and the new observable client behaviour (mid-stream 503 SlowDown for chunked PUTs that beat Raft drain). No code changes; design docs only.
|
残り 2 件のレビューに 4d0707b で対応。 Gemini high — Durability & rolling upgrade (
|
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
The auto-review on PR #637 ran for 7m13s and exited successfully, but no review comment was posted. Job log on the prior run #640 showed "No buffered inline comments", suggesting the action's post path silently no-op'd. Workflow GITHUB_TOKEN had only read on pull-requests and issues, so any fallback gh post would also have failed. Grant pull-requests: write and issues: write so a future re-enable posts comments correctly. The workflow stays in .disable state. https://claude.ai/code/session_01UCYhQp5BYJRzWEbxCanHrZ
|
@claude review |
|
Claude finished @bootjp's task in 3m 14s —— View job Claude's Review of PR #637
Both documents are well-structured and clearly motivated. The three issues Gemini flagged (deduplication contradiction, durability gap, HoL blocking) have all been addressed correctly. The following review focuses on remaining concerns I found in the current state of the docs.
|
| Severity | Doc | Issue |
|---|---|---|
| HIGH | Admission | §6 risk "aws-chunked overcharge" contradicts §3.3.1 (fixed in M1, not M4) |
| HIGH | Blob Offload | §3.5 GC grace period unimplementable — no mechanism to record when RC hit 0 |
| MEDIUM | Admission | §3.3 A+B semaphore interaction ambiguous — potential double-charge or gate-only semantics unclear |
| MEDIUM | Admission | §4 retry budget release protocol unspecified |
| MEDIUM | Blob Offload | M1/M2 gap — proxy-on-miss without the fetch RPC |
| MEDIUM | Blob Offload | §3.5 concurrent GET + GC delete race unaddressed |
| LOW | Admission | §3.2 dispatchAdmissionTimeout comment compares wrong drain baseline |
| LOW | Admission | §3.3.1 oversized-frame path needs explicit cap-check in reserve() spec |
| LOW | Blob Offload | SourcePeer staleness not documented as hint-only |
| LOW | Blob Offload | chunkBlobMinReplicas behaviour during cluster shrink unspecified |
The two HIGH items should block merge; the MEDIUMs are worth addressing in this doc before implementation begins. The LOWs can be fixed or deferred to implementation notes.
The PR-triggered claude-code-review workflow runs on every PR but the action exits without posting a review comment (see run 24935980219 on PR #637). The mention-triggered claude.yml works correctly (verified on PR #639). Disable the auto flow until the root cause is identified. https://claude.ai/code/session_01UCYhQp5BYJRzWEbxCanHrZ
) ## Summary - `claude-code-review.yml` を `.disable` 付きにリネームし、PR 起動時の自動レビューワークフローを停止します。 - mention 起動の `claude.yml` は引き続き有効。 ## Background - 自動レビューは [run 24935980219 (PR #637)](https://github.com/bootjp/elastickv/actions/runs/24935980219/job/73021502681?pr=637) で実行されたものの、PR にレビューコメントが投稿されない事象が発生。 - mention 起動 (`@claude`) は [PR #639 のコメント](#639 (comment)) で正常動作を確認済み。 - 原因調査が完了するまで自動フローを停止。 ## Test plan - [ ] このPRに対して `Claude Code Review` ワークフローが**起動しない**ことを確認 - [ ] `@claude` mention が引き続き反応することを別PRで確認 - [ ] 既存の Go テスト・Lint・CodeQL ワークフローに影響がないことを確認 https://claude.ai/code/session_01UCYhQp5BYJRzWEbxCanHrZ --- _Generated by [Claude Code](https://claude.ai/code/session_01UCYhQp5BYJRzWEbxCanHrZ)_
|
@claude review |
|
Claude finished @bootjp's task in 3m 48s —— View job Claude's Review of PR #637 (Updated Pass)
Note: No commits touched either design document after my previous review (last doc change:
|
| Severity | Doc | Issue | Status |
|---|---|---|---|
| HIGH | Admission | §6 "aws-chunked overcharge" contradicts M1 milestone (§3.3.1) | Open |
| HIGH | Blob Offload | §3.5 GC grace period unimplementable — no timestamp when RC hit 0 | Open |
| MEDIUM | Admission | §3.3 A+B semaphore double-charge semantics unresolved | Open |
| MEDIUM | Admission | §4 retry budget release protocol not specified | Open |
| MEDIUM | Blob Offload | M1 ships proxy-on-miss without the M2 fetch RPC | Open |
| MEDIUM | Blob Offload | §3.5 GET + GC delete race unaddressed | Open |
| LOW | Admission | §3.2 timeout comment references wrong drain baseline | Open |
| LOW | Admission | §3.3.1 oversized frame "503 immediately" requires explicit cap-check in reserve() |
Open |
| LOW | Blob Offload | SourcePeer staleness not documented as hint-only |
Open |
| LOW | Blob Offload | chunkBlobMinReplicas during cluster shrink undefined |
Open |
The two HIGH items should block merge. The MEDIUMs are worth closing before implementation starts to avoid design ambiguities carrying into code.
s3_admission_control.md — HIGH 1: stale §6 entry "aws-chunked overcharge persists until M4" contradicted §3.3.1 / §5 milestone table which now ship pay-as-you-decode in M1. Replace with the actual residual M1 risk: a chunked PUT that previously succeeded silently can now 503 mid-stream when Raft drain falls behind. s3_admission_control.md — MEDIUM 1: §3.3 A+B semaphore semantics were ambiguous. Pin the contract: A is a check-only headroom gate (`peekHeadroom`, no acquire); B is the actual acquire/release on the semaphore. The bytes-in-flight bound is enforced by B alone; A is a fast-fail at request entry. Includes the explicit "if bytes > capacity * unitSize, fail immediately without waiting" rule for malformed oversized frames (LOW 2 from the same review). s3_admission_control.md — MEDIUM 2: §4 retry-budget release was underspecified. Fix the contract to hold-through-retry: per-batch slot is released exactly once on the *final* outcome of the retry chain, never between attempts. The bytes are still in pendingBatch across retries, so the budget must reflect them; release-between- retries would let a second PUT proceed while the first is still memory-resident. s3_admission_control.md — LOW 1: dispatchAdmissionTimeout comment referenced the wrong drain baseline (33 s = full MaxInflight queue, not the 256 MiB cap which drains in ~2 s). Rewrite to make clear 30 s is the budget for a transiently stalled follower to recover, not normal drain time. s3_raft_blob_offload.md — HIGH 2: §3.5 GC grace window was unimplementable as written — a plain RC counter at zero carries no "became eligible at T" signal so the sweeper cannot enforce the 1-hour grace. Switch to a queue-based scheme: `!s3|chunkblob-gc-queue|<commitTS-nanos>|<SHA>` written in the same Raft txn that drives RC to zero. The sweeper scans the queue range older than the grace window, re-checks RC at read time, and deletes via Raft so concurrent sweepers serialise on write-write-conflict. s3_raft_blob_offload.md — MEDIUM 3: M1 milestone shipped proxy-on-miss but the FetchChunkBlob RPC was scheduled in M2, so M1 GET could only serve local-hit or 503. Move both FetchChunkBlob and PushChunkBlob into M1 (PushChunkBlob is already needed by §3.2 quorum write); M2 narrows to the async catch-up worker pool that does not exist on the request goroutine. s3_raft_blob_offload.md — MEDIUM 4: §3.3 had no explicit GET-vs-GC race section. Add §3.3.1 specifying that FetchChunkBlob NOT_FOUND on a peer mid-GC is a normal fallback trigger (cycle through remaining peers in randomised order) and that a fanout-wide NOT_FOUND on an SHA whose chunkref is still present is the genuine durability failure that surfaces 500 + metric. s3_raft_blob_offload.md — LOW 3: SourcePeer was implicitly a hint but never said so. Document explicitly that SourcePeer is best- effort, callers MUST handle NOT_FOUND from FetchChunkBlob on that peer as a normal fallback to fanout, and that treating it as a hard error would pin clients on a momentarily-bad source. s3_raft_blob_offload.md — LOW 4: chunkBlobMinReplicas behaviour during cluster shrink / partial outage was undefined. Specify the degradation ladder: ≥ minReplicas → normal; < minReplicas but Raft quorum intact → degrade to "as many as available, never < 2" with a metric; < 2 reachable peers → 503 (leader-only durability is the regression this design exists to prevent). No code changes; design docs only.
|
Claude bot review (2 HIGH, 4 MEDIUM, 4 LOW = 10 件) すべて 7e59409 で対応。 Admission control
Blob offload
両方 design docs のみで code 変更なし。HIGH 2 件は merge blocker、MEDIUM 4 件は実装前に閉じておくことで code に曖昧さを持ち込まない。 @claude review |
|
Claude finished @bootjp's task in 4m 12s —— View job Claude's Review of PR #637 (Third Pass)
All 10 issues from the previous two passes are resolved. One new HIGH was found during this read, along with three LOWs. Details below. Previous issues — all resolved ✅
New issuesHIGH —
|
| Severity | Doc | Issue | Status |
|---|---|---|---|
| HIGH | Blob Offload | §3.5 "single Raft txn" atomicity claim for local chunkblob + Raft queue entry is architecturally impossible; needs correct two-phase ordering and orphan-on-failure documented | Open |
| LOW | Admission | §4 redisDispatchTimeout reference should be an S3-path constant |
Open |
| LOW | Admission | §3.5 metrics spec missing protocol="chunked" label dimension referenced in §6 |
Open |
| LOW | Blob Offload | §3.5 "Both are Raft-replicated" should say only the queue is replicated; chunkblob deletes are node-local | Open |
The HIGH item should be fixed before merge; the three LOWs are clean-up worth doing before implementation starts.
s3_raft_blob_offload.md — HIGH: §3.5 step (3b) claimed "delete the
local chunkblob AND the queue entry as a single Raft txn" — but
chunkblob is local-only Pebble per §3.1 and the queue is
Raft-replicated, so the two ops live in different storage layers
and cannot share a Raft txn. Rewrite the step to specify a
two-phase ordering:
i. Raft phase first: delete the queue entry through Raft.
Concurrent sweepers serialise here on write-write-conflict;
the queue is therefore the global "we are GC-ing this SHA"
lock.
ii. Local phase second: delete the local chunkblob from Pebble.
Document the failure mode of the inverse ordering (local-first
would orphan the queue entry on crash) and of the chosen ordering
(crash between phases leaves a local space leak — bounded, no
correctness consequence — recoverable by a periodic orphan scan).
s3_raft_blob_offload.md — LOW: the same §3.5 closing paragraph
said "Both are Raft-replicated" referring to the queue and RC.
That phrasing implied the chunkblob deletes were Raft-replicated
too. Rewrite to explicitly distinguish: queue + RC are
Raft-replicated; local chunkblob deletes are deliberately
node-local because that is the whole point of the architecture.
s3_admission_control.md — LOW: §4 retry-budget bound formula
referenced `redisDispatchTimeout`, a Redis-path constant copy-
pasted into the S3 design. The S3 PUT path actually uses the
inbound `*http.Request` context (no S3-specific Dispatch timeout),
so the formula now reads `single_dispatch_budget` with an explicit
note that the upper bound is whatever the request context allows
at that moment.
s3_admission_control.md — LOW: §3.5 metrics spec defined only
`stage="prereserve" | "perbatch"` but §6 and the Rolling Upgrade
subsection both reference a `stage="perbatch", protocol="chunked"`
label combination for isolating chunked-PUT rejection events. Add
the `protocol="fixed-length" | "chunked"` label dimension to
`elastickv_s3_put_admission_rejections_total` and
`elastickv_s3_put_admission_wait_seconds`, with a brief paragraph
explaining why the split is operationally meaningful (chunked HoL
events vs. fixed-length client-concurrency events).
No code changes; design docs only.
|
Round 4 (1 HIGH + 3 LOW) すべて 7475060 で対応。
両方 design docs のみ。Code 変更なし。 @claude review |
|
Claude finished @bootjp's task in 6m 17s —— View job Claude's Review of PR #637 (Round 4)
Previous issues — all resolved ✅
New issuesMEDIUM —
|
| Severity | Doc | Issue | Status |
|---|---|---|---|
| MEDIUM | Admission | §3.3.1 bootstrap reservation: peek or acquire? Release lifecycle unspecified | Open |
| MEDIUM | Blob Offload | §3.2 degraded floor of 2 chunkblob copies provides weaker-than-Raft durability for N > 3; not acknowledged | Open |
| LOW | Admission | §3.2 s3RaftEntryByteBudget undefined; §3.3.1 64 KiB frame size incoherent with 1 MiB semaphore slot unit |
Open |
| LOW | Blob Offload | §3.5 Phase (3b.i) must specify conditional delete to avoid data loss on re-reference race | Open |
No HIGH issues this pass. The two MEDIUMs are worth closing before implementation starts to avoid the admission control's charging ambiguity and the cluster-shrink durability gap from silently carrying into code. The two LOWs are clean-up worth doing in the same pass.
s3_admission_control.md — MEDIUM: §3.3.1 "Bootstrap reservation" was ambiguous between peek and acquire. Pin it as a peek (`peekHeadroom(s3RaftEntryByteBudget)`, no slot acquisition, matching admission A's contract) and rename the heading to "Bootstrap headroom check." Document why it must be a peek (an acquire would multiply per-request slot hold by `concurrent_chunked_PUTs × 4 MiB` of bootstrap-only credit with no corresponding payload, reintroducing the head-of-line hazard the design exists to prevent). s3_admission_control.md — LOW: §3.3.1 "frame size up to 64 KiB" was incoherent with the §3.3 semaphore's 1 MiB slot unit (a channel-backed semaphore can't acquire fractional slots). Clarify that the awsChunkedReader progress callback **buffers decoded bytes until a full s3ChunkSize is accumulated, then calls acquire(s3ChunkSize)**. Worst-case extra buffer per concurrent chunked PUT is bounded by 1 MiB; on stream EOF the partial buffer flushes via one final acquire rounded up to one slot. Also adds `s3RaftEntryByteBudget` to §3.2's constant block (it was used throughout §3.3.1 but never defined) with a comment showing the derivation (s3ChunkSize × s3ChunkBatchOps). s3_raft_blob_offload.md — MEDIUM: §3.2 degraded path floor of 2 chunkblob copies provides weaker-than-Raft durability for N > 3 clusters. On a 5-node cluster Raft tolerates 2 simultaneous failures for the chunkref but the degraded chunkblob path (leader + 1 follower) tolerates only 1. Add an explicit note acknowledging the asymmetry, recommend `chunkBlobMinReplicas = N` for operators who need the legacy "blob durability == Raft durability" guarantee, and clarify that the default `(N/2)+1` is sized for "match Raft quorum" not "match Raft fault tolerance" — a distinction that is invisible at N=3 and material at N≥5. s3_raft_blob_offload.md — LOW: §3.5 Phase (3b.i) needs to specify that the queue-entry delete is **conditional** on (a) the entry existing and (b) the RC counter still being 0 at the txn's read timestamp. An unconditional delete would silently succeed on a queue entry that a re-reference txn has just removed, then proceed to phase (3b.ii) and local-delete a chunkblob whose RC has bounced back to 1 — a correctness bug, not just a space leak. The conditional form is what makes the sweeper safe against the re-reference race. No code changes; design docs only.
|
Round 5 (2 MEDIUM + 2 LOW) すべて fabc81e で対応。
両方 design docs のみ。Code 変更なし。 @claude review |
|
Claude finished @bootjp's task in 5m 20s —— View job Claude's Review of PR #637 (Round 5)
Previous issues — all resolved ✅
New issuesMEDIUM —
|
| Severity | Doc | Issue | Status |
|---|---|---|---|
| MEDIUM | Blob Offload | §3.2 PushChunkBlob local-write / peer-push pipeline unspecified; p99 latency equivalence claim only holds under pipelined/parallel model |
Open |
| LOW | Admission | §3.3.1 malformed-client "503 immediately" description contradicts 1 MiB accumulation: actual path is timeout 503, not immediate | Open |
| LOW | Admission | §5 dispatchAdmissionTimeout added in M2 but depended on by M1's chunked per-batch admission; M1 fallback timeout unspecified |
Open |
| LOW | Blob Offload | §3.2 flow diagram step 3 "≥ chunkBlobMinReplicas peers" inconsistent with prose "chunkBlobMinReplicas - 1 followers" | Open |
| LOW | Blob Offload | §3.5 orphan scan scope doesn't explicitly cover PUT-failed-before-Dispatch (no chunkref txn ever submitted) | Open |
No HIGH issues this pass. The one MEDIUM is worth closing before M1 implementation starts, since the latency-equivalence claim is load-bearing for operator buy-in and the pipelining decision affects the RPC interface design. The four LOWs are clean-up worth batching in a single pass.
s3_raft_blob_offload.md — MEDIUM: §3.2 PushChunkBlob latency claim
("PUT p99 ≈ legacy") was load-bearing but the
local-write/peer-push pipeline was unspecified. Sequential
ordering would silently double the per-chunk latency
(`chunkBlobMinReplicas × fsync_latency`). Pin the contract:
local Pebble write and PushChunkBlob fan out **concurrently**;
multiple followers' pushes are **fanned out in parallel**, not
sequentially. Update the flow diagram to show the pipeline
explicitly and call out that this is part of the contract, not
an optional optimization.
s3_admission_control.md — LOW: §3.3.1 malformed-client failure
mode said "we 503 immediately" but the accumulation design
(`acquire(s3ChunkSize)` only) means the immediate-503 path is
never reachable on the chunked side — the per-frame acquire is
always 1 MiB, well under the 256 MiB cap. Reword to specify the
actual path: successive 1 MiB acquires block under Raft pressure
and the PUT eventually surfaces 503 on
`dispatchAdmissionTimeout`. The "immediate 503 for oversized
request" path is fixed-length only.
s3_admission_control.md — LOW: §5 milestone table had M2 saying
"Add `dispatchAdmissionTimeout`" but M1 already ships the
chunked per-frame admission B path which is gated on it. Move
the constant into M1; M2 narrows to "add fixed-length per-batch
admission B + cleanup," with chunked already using the path
from M1.
s3_raft_blob_offload.md — LOW: the §3.2 flow-diagram step 3
phrasing "synchronously replicate to ≥ chunkBlobMinReplicas
peers" was inconsistent with the prose's
"chunkBlobMinReplicas - 1 followers." Resolved as part of the
MEDIUM rewrite — the diagram now reads "PushChunkBlob to
chunkBlobMinReplicas-1 followers" with parallel fan-out, matching
the prose count.
s3_raft_blob_offload.md — LOW: §3.5 orphan scan was framed as
the recovery path for "sweeper crash between Phase i and ii,"
but it implicitly also covers a more common scenario — chunkblobs
written to local Pebble by §3.2 step 2 when the PUT then fails
before `coordinator.Dispatch` is called (admission 503, client
disconnect, PushChunkBlob quorum failure). In that case neither
RC nor GC queue entry exists, so the sweeper never sees the
orphan; only the orphan scan does. Make this case explicit so the
PUT-handler abort path can rely on the orphan scan rather than
needing its own best-effort local delete.
No code changes; design docs only.
|
Round 5 (Claude bot 19:36 UTC, 1 MEDIUM + 4 LOW) すべて 1fb39ba で対応。
両方 design docs のみ。Code 変更なし。 @claude review |
|
Claude finished @bootjp's task in 6m 12s —— View job Claude's Review of PR #637 (Round 6)
Previous issues — all resolved ✅
New issuesMEDIUM —
|
| Severity | Doc | Issue | Status |
|---|---|---|---|
| MEDIUM | Admission | §5 M1 "acquire / release" ambiguous — fixed-length PUTs may only get admission A (peekHeadroom) in M1; actual semaphore charging ships in M2 | Open |
| LOW | Admission | §3.1 flow diagram labels A as "reserve" — contradicts the peek-only semantics in §3.3 | Open |
| LOW | Admission | §3.3 acquire early-return guard is dead code under the stated design |
Open |
| LOW | Blob Offload | §3.2 mid-push PushChunkBlob failure handling unspecified; degradation ladder covers pre-flight only |
Open |
No HIGH issues this pass. The MEDIUM is worth closing before M1 implementation starts — an implementor who reads M1's "acquire / release" wording as including fixed-length PUTs would ship with an unintended (though non-catastrophic) change in M1 scope. The three LOWs are clean-up worth batching in one pass.
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
docs/design/2026_04_25_proposed_s3_admission_control.md (1)
25-25: Add language specifiers to code blocks.For better syntax highlighting and linting compliance, specify language identifiers on code blocks.
📝 Suggested fixes
Line 25:
-``` +```text leader-side worst-case = concurrent_PUTs × pending_entries_per_PUT × entry_size ≈ concurrent_PUTs × MaxInflight × 4 MiBLine 108: ```diff -``` +```text client ─[Content-Length]─► (A) reserve full body bytes ...Line 326: ```diff -``` +```text elastickv_s3_put_admission_inflight_bytes gauge ...</details> Also applies to: 108-108, 326-326 <details> <summary>🤖 Prompt for AI Agents</summary>Verify each finding against the current code and only fix it if needed.
In
@docs/design/2026_04_25_proposed_s3_admission_control.mdat line 25, Several
fenced code blocks lack language identifiers; update each backtick fence to
include an appropriate language specifier (e.g.,text orprometheus) so
syntax highlighting and linting work: change the fence before the "leader-side
worst-case = ..." block to ```text, the fence before the client arrow diagram to(elastickv_s3_put_admission_inflight_bytes gauge ...) to a suitable language like ```prometheus or ```text; apply the same change to the other two occurrences noted in the review so all three code blocks include language specifiers.docs/design/2026_04_25_proposed_s3_raft_blob_offload.md (1)
23-23: Add language specifiers to code blocks.For better syntax highlighting and linting compliance, specify language identifiers on code blocks.
📝 Suggested fixes
Line 23:
-``` +```text HTTP PUT body ─► s3ChunkSize (1 MiB) chunks ...Line 105: ```diff -``` +```text !s3|chunkref|<bucket>|<gen>|<objectKey>|<uploadID>|<partNo>|<chunkNo> ...Line 131: ```diff -``` +```text client ─► HTTP PUT body ...</details> Also applies to: 105-105, 131-131 <details> <summary>🤖 Prompt for AI Agents</summary>Verify each finding against the current code and only fix it if needed.
In
@docs/design/2026_04_25_proposed_s3_raft_blob_offload.mdat line 23, Several
fenced code blocks in the document lack language specifiers; update the three
affected fences so they include a language identifier (e.g., ```text) to enable
proper syntax highlighting and linting. Specifically, change the fence that
encloses the snippet starting "HTTP PUT body ─► s3ChunkSize (1 MiB) chunks",
the fence that contains the
"!s3|chunkref||||||" line,
and the fence around "client ─► HTTP PUT body" to use a language tag such asfollowed by the chosen specifier and leave the closing backticks unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/design/2026_04_25_proposed_s3_raft_blob_offload.md`:
- Line 358: Typo: change the word "elegibility" to "eligibility" in the phrase
"elegibility-start" so it reads "eligibility-start" (update the token
"elegibility-start" to "eligibility-start" in the sentence "naturally sorted by
elegibility-start time, and any node can").
---
Nitpick comments:
In `@docs/design/2026_04_25_proposed_s3_admission_control.md`:
- Line 25: Several fenced code blocks lack language identifiers; update each
backtick fence to include an appropriate language specifier (e.g., ```text or
```prometheus) so syntax highlighting and linting work: change the fence before
the "leader-side worst-case = ..." block to ```text, the fence before the client
arrow diagram to ```text, and the fence before the metrics snippet
(elastickv_s3_put_admission_inflight_bytes gauge ...) to a suitable language
like ```prometheus or ```text; apply the same change to the other two
occurrences noted in the review so all three code blocks include language
specifiers.
In `@docs/design/2026_04_25_proposed_s3_raft_blob_offload.md`:
- Line 23: Several fenced code blocks in the document lack language specifiers;
update the three affected fences so they include a language identifier (e.g.,
```text) to enable proper syntax highlighting and linting. Specifically, change
the fence that encloses the snippet starting "HTTP PUT body ─► s3ChunkSize (1
MiB) chunks", the fence that contains the
"!s3|chunkref|<bucket>|<gen>|<objectKey>|<uploadID>|<partNo>|<chunkNo>" line,
and the fence around "client ─► HTTP PUT body" to use a language tag such as
```text at the opening backticks. Ensure each opening triple-backtick is
followed by the chosen specifier and leave the closing backticks unchanged.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: f6f38c8c-9687-45fc-b33e-53ae03da68de
📒 Files selected for processing (2)
docs/design/2026_04_25_proposed_s3_admission_control.mddocs/design/2026_04_25_proposed_s3_raft_blob_offload.md
…ags) CodeRabbit round 1 on PR #637: - s3_raft_blob_offload.md:358 typo "elegibility" → "eligibility" (single character). - Add `text` language identifier to the six previously-bare fenced code blocks (the §1 worst-case formula in admission_control, the §3.3 client/admission flow diagram in admission_control, the §3.5 metrics block in admission_control, the §1 Raft-log flow in blob_offload, the §3.1 keyspace block in blob_offload, and the §3.2 PUT pipeline diagram in blob_offload). Markdownlint flags bare fences; tagging them enables both linting compliance and consistent syntax highlighting in the rendered docs. No code changes; design docs only.
|
CodeRabbit review (1 actionable + 2 nitpicks) すべて aa418ce で対応。
@claude review |
|
Claude finished @bootjp's task in 4m 33s —— View job Claude's Review of PR #637 (Round 7)
CodeRabbit items — verified resolved ✅
Round 6 issues — all still openThe latest commit ( MEDIUM —
|
| Severity | Doc | Issue | Status |
|---|---|---|---|
| MEDIUM | Admission | §5 M1 "acquire / release" ambiguous — fixed-length PUTs only get peekHeadroom in M1, not actual semaphore charging |
Open |
| LOW | Admission | §3.1 flow diagram labels admission A as "reserve full body bytes" — contradicts §3.3 peek-only semantics | Open |
| LOW | Admission | §3.3 acquire early-return guard unreachable at all defined call sites; "malformed client" attribution is inaccurate |
Open |
| LOW | Blob Offload | §3.2 mid-push PushChunkBlob failure handling unspecified; degradation ladder covers pre-flight only |
Open |
No HIGH issues this pass. The MEDIUM and the three LOWs from Round 6 are all carry-forwards — the latest commit only addressed the CodeRabbit cosmetic items (typo + code fences), which are correctly resolved. The MEDIUM is worth closing before M1 implementation starts to avoid the fixed-length admission scope ambiguity propagating into code; the three LOWs can be batched in a single pass.
Summary
PR #636 (
s3ChunkBatchOps = 4で Raft entry をMaxSizePerMsg = 4 MiBに整列) のフォローアップとして、design doc を 2 本提出します。コードは含まず docs だけ。1.
docs/design/2026_04_25_proposed_s3_admission_control.md問題: PR #636 は per-entry の memory accounting を直すが、aggregate (= 同時 PUT 数 × 各 PUT の inflight) は client 並列度で青天井。
提案: ノードが受け入れて Raft に commit する前の S3 PUT body bytes を 共通セマフォで hard cap。
Content-Lengthで request entry に pre-chargeflushBatchの per-batch (4 MiB) で sub-lease —dispatchAdmissionTimeout経過で 503 SlowDownRetry-After付き 503 → AWS SDK が自動 retry2.
docs/design/2026_04_25_proposed_s3_raft_blob_offload.md問題: 5 GiB PUT が WAL に丸ごと書かれ、follower catch-up と snapshot transfer が O(stored bytes) でスケールしない。
提案: Raft が運ぶのは
ChunkRef{SHA256, Size}(~100 B) と manifest だけ。chunk 本体は別キー名前空間!s3|chunkblob|<SHA>でローカル Pebble に直書き、follower は applyLoop でchunkrefを見たら 非同期 fetch RPC で peer から取りに行く。ELASTICKV_S3_BLOB_OFFLOAD=trueでオプトイン、legacyBlobKey経路は並走両方とも:
docs/design/README.mdのファイル名・ヘッダ規約に準拠 (Status: Proposed / Date: 2026-04-25)Test plan
YYYY_MM_DD_proposed_<slug>.md準拠レビューは設計の方向性 (admission control の cap 値、blob offload の content-addressing 単位、M0-M4 の milestone 配分) に focus してもらえると助かります。
/gemini review
@codex review
Summary by CodeRabbit
Documentation