Skip to content

docs(design): propose S3 admission control + raft blob offload#637

Merged
bootjp merged 14 commits intomainfrom
docs/s3-raft-design
Apr 25, 2026
Merged

docs(design): propose S3 admission control + raft blob offload#637
bootjp merged 14 commits intomainfrom
docs/s3-raft-design

Conversation

@bootjp
Copy link
Copy Markdown
Owner

@bootjp bootjp commented Apr 25, 2026

Summary

PR #636 (s3ChunkBatchOps = 4 で Raft entry を MaxSizePerMsg = 4 MiB に整列) のフォローアップとして、design doc を 2 本提出します。コードは含まず docs だけ。

1. docs/design/2026_04_25_proposed_s3_admission_control.md

問題: PR #636 は per-entry の memory accounting を直すが、aggregate (= 同時 PUT 数 × 各 PUT の inflight) は client 並列度で青天井。
提案: ノードが受け入れて Raft に commit する前の S3 PUT body bytes を 共通セマフォで hard cap

  • (A) Content-Length で request entry に pre-charge
  • (B) flushBatch の per-batch (4 MiB) で sub-lease — dispatchAdmissionTimeout 経過で 503 SlowDown
  • デフォルト 256 MiB cap (GOMEMLIMIT=1800 MiB の ~14%)、env で tunable
  • Retry-After 付き 503 → AWS SDK が自動 retry
  • M1〜M4 の段階導入、metrics/Grafana panel まで含む

2. docs/design/2026_04_25_proposed_s3_raft_blob_offload.md

問題: 5 GiB PUT が WAL に丸ごと書かれ、follower catch-up と snapshot transfer が O(stored bytes) でスケールしない。
提案: Raft が運ぶのは ChunkRef{SHA256, Size} (~100 B) と manifest だけ。chunk 本体は別キー名前空間 !s3|chunkblob|<SHA> でローカル Pebble に直書き、follower は applyLoop で chunkref を見たら 非同期 fetch RPC で peer から取りに行く

  • 1 GiB PUT の Raft 負荷: ~100 KiB に圧縮 (10⁴× 削減)
  • Snapshot size: O(manifest count) になる
  • Reference count + grace period の GC
  • ELASTICKV_S3_BLOB_OFFLOAD=true でオプトイン、legacy BlobKey 経路は並走
  • M0 spike 〜 M4 migrator までの 5 段階ロールアウト

両方とも:

Test plan

  • markdown lint clean (textlint なし、手動確認)
  • ファイル名規約 YYYY_MM_DD_proposed_<slug>.md 準拠
  • 既存 design へのリンクが docs/design/README.md と整合

レビューは設計の方向性 (admission control の cap 値、blob offload の content-addressing 単位、M0-M4 の milestone 配分) に focus してもらえると助かります。

/gemini review
@codex review

Summary by CodeRabbit

Documentation

  • Added design proposal for S3 PUT request admission control and load management strategy
  • Added design proposal for S3 object storage optimization approach

Two companion proposals to PR #636 (s3ChunkBatchOps=4, Raft entry size
aligned with MaxSizePerMsg). PR #636 fixes the per-entry memory
accounting; these docs cover the two follow-up axes the in-PR
discussion identified:

- 2026_04_25_proposed_s3_admission_control.md
  Hard cap on the aggregate S3 PUT body bytes a node accepts but has
  not yet committed to Raft. Per-PUT pipeline is already capped by
  MaxInflight × MaxSizePerMsg; this proposal bounds the multiplier
  from concurrent PUTs so leader-side worst-case memory has a hard
  ceiling regardless of client behaviour. Two-level admission
  (Content-Length pre-charge + per-batch slot) with HTTP 503
  SlowDown rejection.

- 2026_04_25_proposed_s3_raft_blob_offload.md
  Take large chunk payloads off the Raft log entirely. Raft replicates
  ChunkRefs (32-byte SHA256 + size) plus the manifest; chunk bytes
  travel via a side-channel gRPC fetch protocol between peers. Snapshot
  size becomes O(manifest count), follower catch-up is amortised by
  lazy fetch, and WAL growth no longer scales with stored bytes.

Both docs are filed as proposals only; no code changes. They follow
the docs/design/README.md filename + header conventions and
cross-reference each other plus the prerequisite PRs (#593, #600,
#612, #617, #636) and the workload isolation roadmap.
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 25, 2026

Warning

Rate limit exceeded

@bootjp has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 51 minutes and 20 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 51 minutes and 20 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 31f43529-21b3-4bfa-a646-01610d947ee7

📥 Commits

Reviewing files that changed from the base of the PR and between 1fb39ba and aa418ce.

📒 Files selected for processing (2)
  • docs/design/2026_04_25_proposed_s3_admission_control.md
  • docs/design/2026_04_25_proposed_s3_raft_blob_offload.md
📝 Walkthrough

Walkthrough

Two new design documents proposing features for S3 request handling: one defines an admission control mechanism with per-node caps on in-flight PUT body bytes using two-tier checking and semaphores; the other outlines moving S3 object chunk payloads from the Raft log into a separate chunkblob keyspace with lazy-fetch semantics on followers.

Changes

Cohort / File(s) Summary
Design Documents
docs/design/2026_04_25_proposed_s3_admission_control.md, docs/design/2026_04_25_proposed_s3_raft_blob_offload.md
Two comprehensive architectural designs for S3 improvements: admission control with fixed and chunked PUT request handling, two-tier checking, semaphore-based enforcement, and standardized error semantics; blob offload separating chunkref metadata replication via Raft from non-Raft chunkblob byte storage, with follower lazy-fetch, snapshot redesign, reference-counting GC, and mixed-version gating.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 Hop, hop! The bytes take flight,
Admission gates now hold them tight,
Blobs offload from Raft's long chain,
S3 streams flow fast, not strained!
Design docs bloom like spring carrots green. 🥕

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and concisely summarizes the main changes: two new design proposals for S3 admission control and Raft blob offload.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch docs/s3-raft-design

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@chatgpt-codex-connector
Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, you can upgrade your account or add credits to your account and enable them for code reviews in your settings.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces two design documents aimed at improving the scalability and stability of S3 PUT operations. The first document proposes a memory-based admission control system to prevent leader memory exhaustion, while the second outlines a strategy to offload large blob payloads from the Raft log to reduce write amplification and snapshot sizes. Review feedback highlighted a significant durability risk in the blob offload path where data could be lost if a leader fails before followers fetch the out-of-band data. Concerns were also raised regarding the admission control's handling of chunked uploads, which could cause head-of-line blocking due to an oversized pre-charge against the memory budget, and a design contradiction regarding deduplication in the storage layer.

Comment on lines +306 to +311
- **Local-blob-only on a single node.** If only one node has a
given chunkblob and that node fails before peers fetch it, data
is lost. Mitigation: PUT path replicates the chunkblob to *N*
peers asynchronously before returning 200 (e.g. quorum write
outside of Raft). N defaults to 2 (one extra copy = parity with
Raft's quorum durability).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

リーダーがローカルの Pebble に書き込み、Raft で chunkref をコミットした直後に、フォロワーがフェッチを完了する前にリーダーが全損した場合、データが失われるリスクがあります。Raft の「クォーラム合意=耐久性確保」という前提が本体データについては適用されなくなります。リスク緩和策として「N台のピアに非同期複製してから200 OKを返す」とありますが、厳密な耐久性を保証するには、少なくとも1台のフォロワーへの書き込み完了を確認してから Raft コミットを行う「準同期」的な仕組みの検討が必要です。また、本機能の導入はストレージ構成の大きな変更を伴うため、ローリングアップデートの可否や、ゼロダウンタイムでの切り替え戦略(例:デュアルライト、論理レプリケーション等)についても詳細を記載してください。

References
  1. When a design document identifies a significant operational risk, such as the inability to perform rolling upgrades, it must also detail potential mitigation strategies, like implementing a temporary "bridge" or "proxy" mode.
  2. When designing a production-grade system migration, the plan must consider and detail strategies for live or zero-downtime cutovers (e.g., blue/green, dual-write proxy, logical replication) to avoid service interruption.

Comment on lines +185 to +190
For aws-chunked transfers (`Content-Length == -1`), the request-entry
charge falls back to a conservative `s3MaxObjectSizeBytes` (5 GiB)
reservation. The downside is that one chunked PUT can monopolise the
budget; the upside is correctness without re-reading headers.
We will instrument a metric to find out empirically how large that
hit actually is before optimising.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

aws-chunked (Content-Length: -1) の場合に s3MaxObjectSizeBytes (5 GiB) を一括で pre-charge する設計は、デフォルトの s3PutAdmissionMaxInflightBytes (256 MiB) を大幅に上回っています。これにより、1つの chunked upload が開始された瞬間に他のすべての PUT リクエストが 503 で拒絶される状態(ヘッドオブラインブロッキング)が発生します。M4 で予定されている「デコードされたバイト数に応じた逐次チャージ」を M1 の時点で最低限導入するか、chunked upload 専用のバッファ枠を設けるなどの緩和策を検討してください。また、この制限が既存の運用に与える影響を考慮し、ローリングアップデート時の互換性や影響緩和策についても記載が必要です。

References
  1. When a design document identifies a significant operational risk, such as the inability to perform rolling upgrades, it must also detail potential mitigation strategies, like implementing a temporary "bridge" or "proxy" mode.

Comment on lines +77 to +79
- No deduplication across objects. Each `(bucket, objectKey,
uploadID, partNo, chunkNo)` keeps its own key. Real S3 also does
not deduplicate at this layer.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

「No deduplication across objects」という記述は、3.1節の !s3|chunkblob|<contentSHA256> によるコンテンツアドレッシングおよび 3.5節の参照カウント (RC) による管理と矛盾しています。SHA-256 をキーとして共有する以上、ストレージレイヤーでは実質的に重複排除が行われます。設計の整合性を保つため、記述を修正してください。

bootjp added 2 commits April 26, 2026 00:37
Gemini medium on PR #637 — the original "No deduplication across
objects" non-goal contradicts §3.1 (chunkblob is content-addressed
by SHA-256) and §3.5 (reference-counted GC). Two distinct objects
whose chunks happen to hash identically *will* share a chunkblob
row; that is a structural property of content addressing.

Rewrite the bullet to be precise about what we are *not* promising
without contradicting what the design does:

- Surface no dedup API / metering / billing surface.
- Reference layer (chunkref) stays per-object so DELETE / lifecycle
  still reason about objects independently.
- Authorisation enforcement remains on chunkref reads, never on
  chunkblob reads — see the new "Cross-tenant blob fetch" risk
  entry that pins the threat model.

Add a matching risk entry under §6 covering the SHA-256-guessing
attack the dedup non-goal alludes to: feasible only if a tenant can
both guess a victim's chunk SHA *and* bypass auth. We block the
second by keeping S3BlobFetch internal-only and gating GET through
chunkref ACL before the proxy-on-miss path is invoked. A future
design that exposes blob fetch on a public surface would need to
reintroduce tenant-scoped authorisation at the blob layer; this
proposal explicitly does not.

No code changes — design doc only.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 25, 2026

Gemini medium ご指摘どおり矛盾していました。bfd4a56f で修正。

矛盾の中身:

  • §3.1: chunkblob キーは <contentSHA256> で content-addressed
  • §3.5: Reference count (RC) で gc 管理
  • → SHA-256 をキーにしている時点で同一バイト列を持つチャンクは storage 層で 共有 = 実質 dedup される
  • それなのに non-goal で「No deduplication across objects」と書いていた → 内部矛盾

修正方針: 「dedup is happening structurally, but we don't expose it」と precise に書き直し、暗黙の前提だった「ACL は chunkref 層で enforce」を明示化。さらに dedup の影響として懸念される cross-tenant な SHA 推測攻撃 をリスクセクションに正式に追加し、S3BlobFetch を internal-only に保つ・GET は chunkref ACL を経由するという現行設計の理由を pin。

変更箇所:

  • 旧 non-goal「No deduplication across objects」削除
  • 新 non-goal:「dedup API / metering / billing / accounting credit はユーザに見せない」「reference 層 (chunkref) は per-object 維持」「ACL は chunkref 側のみ」「§6 のリスクへリンク」と書き直し
  • §6 (Risks) に「Cross-tenant blob fetch via SHA-256 guessing」エントリを追加し、攻撃前提条件 (a) SHA を当てる + (b) 認可をバイパス、の両方を満たす必要があることと、本提案では (b) を構造的に塞いでいることを明記

コードチャンジなし、design doc 1 ファイルの編集だけです。

/gemini review
@codex review

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@chatgpt-codex-connector
Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, you can upgrade your account or add credits to your account and enable them for code reviews in your settings.

Round-2 review on PR #637 — Gemini high (durability, rolling
upgrade) + medium (aws-chunked head-of-line blocking).

s3_raft_blob_offload.md (Gemini high):

- §3.2: chunkblob writes now go through semi-synchronous quorum
  replication via a new PushChunkBlob RPC before the chunkref
  enters Raft. Without it, a leader crash between the chunkref
  commit and the eventual async fetch on followers would leave a
  Raft-committed manifest pointing at a chunkblob no surviving
  node has — Raft's quorum bound the chunkref but said nothing
  about the blob payload. chunkBlobMinReplicas defaults to a
  Raft-quorum-equivalent floor, restoring end-to-end durability
  parity with the legacy "every byte through Raft" path.
- §3.6: add the PushChunkBlob RPC alongside the existing
  FetchChunkBlob, with a Push-side fsync ack and a streaming
  request shape symmetric to the fetch path.
- §3.8 (new): explicit two-step rolling-upgrade strategy gated on
  cluster-wide capability advertisement (feature_s3_blob_offload
  in GetClusterOverview). PUTs only flip to the offload path
  when *every* peer advertises the capability; any disagreement
  causes the leader to route the PUT through the legacy BlobKey
  path. Roll-back works the same way in reverse with no data
  loss because both keyspaces are namespaced and reads serve
  whichever resolves first.
- §6: replace the optimistic "Local-blob-only on a single node"
  bullet with the explicit "Leader-only durability before
  chunkref commit" risk + the §3.2 mitigation.

s3_admission_control.md (Gemini medium):

- §3.3.1 (new): aws-chunked transfers no longer pre-charge
  s3MaxObjectSizeBytes (5 GiB on default tunables, 20× over the
  256 MiB cap and a guaranteed head-of-line block for every other
  PUT). Bootstrap reservation is now s3RaftEntryByteBudget
  (4 MiB) and the rest is paid-as-decoded via an awsChunkedReader
  progress callback that hooks into the same per-batch admission
  the fixed-length path uses. The legacy 5 GiB pre-charge stays
  available behind ELASTICKV_S3_PUT_ADMISSION_CHUNKED_INCREMENTAL=false
  for incident response, but ships disabled by default.
- §5: chunked progress callback moves from M4 to M1 so the HoL
  hazard is closed in the first shippable milestone. Adds an
  explicit "Rolling upgrade" subsection covering mixed-cluster
  safety (additive cap, no protocol change) and the new
  observable client behaviour (mid-stream 503 SlowDown for
  chunked PUTs that beat Raft drain).

No code changes; design docs only.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 25, 2026

残り 2 件のレビューに 4d0707b で対応。

Gemini high — Durability & rolling upgrade (s3_raft_blob_offload.md:324)

指摘: リーダーが chunkblob を local Pebble に書いて Raft で chunkref を commit した直後に全損したらデータロス。Raft quorum は chunkref しか守ってない。「N peers async 複製」では不十分で、最低 1 follower 確認後に commit する 準同期 が必要。さらに rolling upgrade 戦略の記述が無い。

対応:

  • §3.2 のリスト 3. に「chunkref Raft commit の に chunkblob を chunkBlobMinReplicas-1 個の peer に 同期 push & fsync 確認」というステップを正式に追加
  • §3.6 に対応する PushChunkBlob RPC (リーダー → フォロワー、fsync ack ストリーミング) を追加
  • §3.8 (新規) に 2-step 厳密 rolling upgrade を明記:
    1. 全ノードに新バイナリを ELASTICKV_S3_BLOB_OFFLOAD=false で配備、24h soak
    2. リーダー → フォロワー順に env を flip。クラスタ全体が capability を advertise してから offload path に切り替わる
    • capability negotiation: AdminServer.GetClusterOverviewfeature_s3_blob_offload フラグ。1 ノードでも未対応なら自動で legacy path に fallback
    • Roll-back: env を false に戻すだけ。両 keyspace は namespaced なので reads は無影響
  • §6 の "Local-blob-only on a single node" を「§3.2 step 3 で塞いだ」と書き直し、リスク → 緩和済みリスクへ格上げ

durability と Raft quorum 等価性 (= legacy "every byte through Raft" と同じ保証) は §3.2 で陽に説明、PUT latency への影響 (chunkblob fsync が 1 つ増える) もトレードオフとして明記。

Gemini medium — aws-chunked HoL blocking (s3_admission_control.md:190)

指摘: Content-Length: -1s3MaxObjectSizeBytes=5 GiB pre-charge は default cap 256 MiB の 20×。1 つの chunked PUT で他全 PUT が 503 になる head-of-line blocking。

対応:

  • §3.3.1 (新規) で chunked admission を分離:
    • Bootstrap: s3RaftEntryByteBudget=4 MiB だけ pre-charge
    • Pay-as-decoded: awsChunkedReader の progress callback 経由で per-batch admission に乗せる (fixed-length と同じ admission B 経路)
  • 旧「conservative 5 GiB pre-charge」は ELASTICKV_S3_PUT_ADMISSION_CHUNKED_INCREMENTAL=false でだけ復活、incident response 用のみ
  • §5 の milestones を更新: progress-callback admission を M4 → M1 に移動 (HoL hazard を最初の shippable milestone で閉じる)
  • §5 の "Rolling upgrade" subsection (新規) で mixed-cluster 互換性 + 新たに観測される 503 (chunked PUT が Raft drain に負ける場合) を明記

両方 design doc のみで code 変更なし。

/gemini review
@codex review

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@chatgpt-codex-connector
Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, you can upgrade your account or add credits to your account and enable them for code reviews in your settings.

bootjp pushed a commit that referenced this pull request Apr 25, 2026
The auto-review on PR #637 ran for 7m13s and exited successfully, but
no review comment was posted. Job log on the prior run #640 showed
"No buffered inline comments", suggesting the action's post path
silently no-op'd. Workflow GITHUB_TOKEN had only read on pull-requests
and issues, so any fallback gh post would also have failed.

Grant pull-requests: write and issues: write so a future re-enable
posts comments correctly. The workflow stays in .disable state.

https://claude.ai/code/session_01UCYhQp5BYJRzWEbxCanHrZ
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 25, 2026

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented Apr 25, 2026

Claude finished @bootjp's task in 3m 14s —— View job


Claude's Review of PR #637

  • Gather context and read design documents
  • Review docs/design/2026_04_25_proposed_s3_admission_control.md
  • Review docs/design/2026_04_25_proposed_s3_raft_blob_offload.md
  • Post comprehensive review feedback

Both documents are well-structured and clearly motivated. The three issues Gemini flagged (deduplication contradiction, durability gap, HoL blocking) have all been addressed correctly. The following review focuses on remaining concerns I found in the current state of the docs.


docs/design/2026_04_25_proposed_s3_admission_control.md

HIGH — §6 Risk: "aws-chunked overcharge" is stale

The Risks section (§6) still says:

A small chunked PUT charges 5 GiB against the budget until M4 lands. We accept this temporarily because aws-chunked traffic is rare in practice…

But §3.3.1 was updated to ship pay-as-you-decode in M1, not M4. This risk item is now incorrect and will confuse future readers who find it contradicts §3.3.1 and the updated milestone table. It should be replaced with the new failure mode: "a chunked PUT can 503 mid-stream when Raft drain falls behind" (already documented in the Rolling Upgrade section — just needs to be surfaced in §6 as the residual risk).

Fix this →

MEDIUM — §3.3: Admission A and B interaction on the same semaphore is ambiguous

§3.3 says both A (full Content-Length pre-charge at request entry) and B (per-batch 4 MiB sub-lease) draw from the same semaphore. But the interaction is undefined:

  • If A pre-charges all 100 MiB of a 100 MiB PUT's slots into the semaphore, and then B also tries to acquire 4 MiB slots per batch from that same semaphore, those slots are already consumed — so B blocks or fails even for a legitimately admitted request.
  • If A only checks headroom without actually holding the slots (acting as a gate), then the semaphore doesn't actually bound concurrent in-flight bytes between the check and B's first acquire.

The text says "The PUT handler acquires s3ChunkSize × s3ChunkBatchOps = 4 MiB units before reading the next 4 MiB window from the body and releases them on coordinator.Dispatch ack" — this reads like B is the only thing that actually holds the semaphore, and A is a fast-path gate. If that's the intent, it should be stated explicitly: "Admission A checks whether Content-Length bytes could fit (would not exceed the global cap) without acquiring. B does the actual acquire/release cycle per batch." Otherwise the double-charge semantics need to be spelled out.

MEDIUM — §4: Retry budget release protocol unspecified

§4 correctly flags "Admission must release its budget around retries, otherwise a long retry chain double-counts." But it doesn't say how. There are two valid approaches with different tradeoffs:

  1. Release the entire pre-charge before retrying, re-acquire after backoff. Risk: another request steals the budget between release and re-acquire, starving the retry.
  2. Hold the pre-charge across retries (no release). Risk: a stuck retry chain ties up the budget for s3TxnRetryMaxAttempts × retry_latency = up to 8 × several seconds.

This is an implementation decision that affects correctness under contention; it should be pinned in the design rather than left to the implementor.

LOW — §3.2: dispatchAdmissionTimeout comment is misleading

The constant comment says:

Sized comfortably above the Raft in-flight queue's drain time at 1 Gbps (1024 × 4 MiB / 125 MB/s ≈ 33 s)

The 33 s figure is the drain time for the full MaxInflight × MaxSizePerMsg queue, not for the 256 MiB admission cap (which would drain in ~2 s at 1 Gbps). The timeout of 30 s is intended to tolerate a genuinely stalled follower, not a "normal drain" scenario. The comment would be more accurate as: "30 s allows a transiently stalled follower to recover before we 503; at steady-state the cap drains in ~2 s at 1 Gbps."

LOW — §3.3.1: Oversized frame path needs clarification

If the awsChunkedReader frame size ever exceeds s3RaftEntryByteBudget (a malformed client), the per-batch acquire asks for more than the cap allows and we 503 immediately.

A semaphore whose capacity is maxInflightBytes / chunkSize slots will block indefinitely (until timeout) if you try to acquire more slots than capacity — it won't return an error immediately unless the implementation explicitly checks requested > capacity before waiting. "503 immediately" requires an upfront bounds check in reserve(), not just waiting for the deadline. This should be explicit in the reserve() spec.


docs/design/2026_04_25_proposed_s3_raft_blob_offload.md

HIGH — §3.5: No mechanism to track when RC reached 0

The sweeper deletes blobs whose "RC has been 0 for at least chunkBlobGCGracePeriod (1 hour)." But !s3|chunkref-rc|<SHA> is described as a plain counter. A counter at 0 doesn't tell you when it became 0 — the sweeper can't determine the grace period has elapsed without a timestamp.

Two common approaches:

  • Queue-based: when RC drops to 0, write a !s3|chunkblob-gc-queue|<timestamp>|<SHA> entry in the same Raft txn. Sweeper processes entries older than the grace period.
  • Tombstone-with-TTL: when RC drops to 0, write a !s3|chunkref-rc|<SHA> entry with an explicit deletion time field. Sweeper reads the field, skips entries where now < deletion_time.

Neither is sketched in the current doc. The RC + grace-period GC claim is currently unimplementable as written.

Fix this →

MEDIUM — M1/M2 milestone gap: proxy-on-miss without fetch protocol

M1 promises "GET path checks chunkblob locally with proxy-on-miss" but M2 delivers "Follower fetch protocol + async fetch worker pool." The proxy-on-miss path in §3.3 depends on FetchChunkBlob RPC — without the fetch protocol, M1's GET can only do local-hit or 503, not proxy. Either:

  • M1 should be "503-on-miss" (not proxy-on-miss), with proxy deferred to M2, or
  • The FetchChunkBlob RPC stub ships in M1 alongside proxy-on-miss, with the async worker pool added in M2.

The current text implies M1 ships a functional proxy, which isn't consistent with the milestone scope.

MEDIUM — §3.5: Concurrent GET and GC delete race unaddressed

The sweeper deletes chunkblob entries that have been RC=0 for the grace period. However, if a GET proxies to a peer for a chunkblob at the same time the peer's sweeper is deleting it (after its local grace period elapsed), the proxy fetch fails — the blob is still reachable (from the requester's perspective) but gone from the target peer. The design says "proxy to a peer that does have the chunk" but doesn't specify how to detect or handle this case. At a minimum, the retry-from-alternate-peer path (already specified for SHA mismatch in §3.3) should be documented as the mitigation for fetch-returning-404-from-peer.

LOW — SourcePeer staleness after leadership change

ChunkRef.SourcePeer records "the leader-locality hint" — the node that originally wrote the chunk. After a leadership change, the old leader (now a follower) may have crashed, leaving SourcePeer pointing at an unavailable node. The design says followers fall back to fanout, which handles this, but worth making the staleness expectation explicit: SourcePeer is a hint, not a guarantee, and callers must handle NOT_FOUND on FetchChunkBlob from that peer by falling back to fanout.

LOW — chunkBlobMinReplicas behaviour during cluster shrink

§3.2 sets chunkBlobMinReplicas to (N/2)+1 for N-node clusters. If the cluster shrinks (e.g. from 5 to 3 nodes after a controlled decommission), the floor calculated at PUT time may exceed the number of reachable followers. Should PushChunkBlob block until the configured minimum is reachable, or degrade to "as many as currently available, ≥ 1"? Without a specified behaviour, a single node outage during a PUT could block indefinitely. The Rolling Upgrade section handles upgrades well but cluster shrink is a separate operational case worth one sentence.


Summary

Severity Doc Issue
HIGH Admission §6 risk "aws-chunked overcharge" contradicts §3.3.1 (fixed in M1, not M4)
HIGH Blob Offload §3.5 GC grace period unimplementable — no mechanism to record when RC hit 0
MEDIUM Admission §3.3 A+B semaphore interaction ambiguous — potential double-charge or gate-only semantics unclear
MEDIUM Admission §4 retry budget release protocol unspecified
MEDIUM Blob Offload M1/M2 gap — proxy-on-miss without the fetch RPC
MEDIUM Blob Offload §3.5 concurrent GET + GC delete race unaddressed
LOW Admission §3.2 dispatchAdmissionTimeout comment compares wrong drain baseline
LOW Admission §3.3.1 oversized-frame path needs explicit cap-check in reserve() spec
LOW Blob Offload SourcePeer staleness not documented as hint-only
LOW Blob Offload chunkBlobMinReplicas behaviour during cluster shrink unspecified

The two HIGH items should block merge; the MEDIUMs are worth addressing in this doc before implementation begins. The LOWs can be fixed or deferred to implementation notes.

bootjp pushed a commit that referenced this pull request Apr 25, 2026
The PR-triggered claude-code-review workflow runs on every PR but the
action exits without posting a review comment (see run 24935980219 on
PR #637). The mention-triggered claude.yml works correctly (verified on
PR #639). Disable the auto flow until the root cause is identified.

https://claude.ai/code/session_01UCYhQp5BYJRzWEbxCanHrZ
bootjp added a commit that referenced this pull request Apr 25, 2026
)

## Summary
- `claude-code-review.yml` を `.disable` 付きにリネームし、PR
起動時の自動レビューワークフローを停止します。
- mention 起動の `claude.yml` は引き続き有効。

## Background
- 自動レビューは [run 24935980219 (PR
#637)](https://github.com/bootjp/elastickv/actions/runs/24935980219/job/73021502681?pr=637)
で実行されたものの、PR にレビューコメントが投稿されない事象が発生。
- mention 起動 (`@claude`) は [PR #639
のコメント](#639 (comment))
で正常動作を確認済み。
- 原因調査が完了するまで自動フローを停止。

## Test plan
- [ ] このPRに対して `Claude Code Review` ワークフローが**起動しない**ことを確認
- [ ] `@claude` mention が引き続き反応することを別PRで確認
- [ ] 既存の Go テスト・Lint・CodeQL ワークフローに影響がないことを確認

https://claude.ai/code/session_01UCYhQp5BYJRzWEbxCanHrZ

---
_Generated by [Claude
Code](https://claude.ai/code/session_01UCYhQp5BYJRzWEbxCanHrZ)_
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 25, 2026

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented Apr 25, 2026

Claude finished @bootjp's task in 3m 48s —— View job


Claude's Review of PR #637 (Updated Pass)

  • Gather context and read CLAUDE.md
  • Read current state of both design documents
  • Check git diff for all changes since origin/main
  • Review docs/design/2026_04_25_proposed_s3_admission_control.md
  • Review docs/design/2026_04_25_proposed_s3_raft_blob_offload.md
  • Post comprehensive review feedback

Note: No commits touched either design document after my previous review (last doc change: 4d0707b at ~15:45 UTC). The findings below reflect the current state of the files on this branch — all issues from the previous review are still open.


docs/design/2026_04_25_proposed_s3_admission_control.md

HIGH — §6 aws-chunked overcharge risk entry contradicts milestone table

Line 337–339 still reads:

A small chunked PUT charges 5 GiB against the budget until M4 lands. We accept this temporarily because aws-chunked traffic is rare in practice and the conservative cap fails safe.

But §3.3.1 and the milestone table (§5) now ship pay-as-you-decode in M1, not M4. The risk entry describes a hazard that no longer exists after M1 ships; a future reader implementing M1 will find the risk and milestone table contradicting each other and may incorrectly believe the HoL hazard was accepted as a known regression.

Replace the stale entry with the residual risk M1 actually introduces: a chunked PUT that would have silently succeeded under the old "no admission" code can now 503 mid-stream when Raft drain falls behind. That new client-visible behaviour should be in §6 and is currently only buried in the Rolling Upgrade subsection.

Fix this →

MEDIUM — §3.3 A+B semaphore interaction semantics are still unspecified

The design describes two things happening on the same semaphore:

  • (A): "Pre-charge Content-Length total at request entry. The release fires in a deferred handler on the request goroutine."
  • (B): "The PUT handler acquires s3ChunkSize × s3ChunkBatchOps = 4 MiB units before reading the next 4 MiB window from the body and releases them on coordinator.Dispatch ack."

These two uses conflict: if A holds Content-Length / 1 MiB slots for the lifetime of the request, and B tries to acquire 4 more slots per batch from the same fixed-capacity channel (max / s3ChunkSize = 256 slots), then B is double-counting bytes already accounted for by A. Conversely, if A only checks headroom without acquiring (a fast-path gate), then the semaphore doesn't bound aggregate bytes between the check and B's first per-batch acquire — another PUT could race in and consume the budget window A was counting on.

The coherent design is probably: A is a headroom gate (check only, no acquire); B is the actual acquire/release cycle. If that's the intent, it should be stated explicitly: "Admission A reads the current semaphore occupancy and returns 503 if Content-Length > remaining capacity; it does not acquire slots. Admission B acquires and holds 4 MiB of slots per batch."

MEDIUM — §4 retry budget release protocol still unspecified

The coordinator.Dispatch retries bullet says:

Admission must release its budget around retries, otherwise a long retry chain double-counts.

Two approaches have opposite tradeoffs — release-before-retry (other requests can steal the budget) vs. hold-through-retry (stalled chain locks budget for up to 8 × retry_latency). This is an implementation decision that affects correctness under contention and should be pinned in the design. A one-sentence choice here prevents implementors from independently picking the wrong approach.

LOW — §3.2 dispatchAdmissionTimeout comment references wrong drain baseline

The comment says:

Sized comfortably above the Raft in-flight queue's drain time at 1 Gbps (1024 × 4 MiB / 125 MB/s ≈ 33 s)

33 s is the drain time for the full MaxInflight × MaxSizePerMsg queue — not for the 256 MiB admission cap, which drains in ~2 s at 1 Gbps. The 30 s timeout is designed to tolerate a genuinely stalled follower, not a normal drain. The comment would be accurate as: "30 s allows a transiently stalled follower to recover; at steady-state the 256 MiB cap drains in ~2 s at 1 Gbps."

LOW — §3.3.1 oversized-frame path requires an explicit cap-check in reserve() spec

If the awsChunkedReader frame size ever exceeds s3RaftEntryByteBudget (a malformed client), the per-batch acquire asks for more than the cap allows and we 503 immediately.

A buffered-channel semaphore whose capacity is max / s3ChunkSize slots will block until dispatchAdmissionTimeout, not return immediately, when you try to acquire more slots than the channel's capacity — unless the implementation explicitly checks requested > capacity before waiting. The reserve() spec should include: "if bytes > capacity * unitSize, return ErrAdmissionExhausted immediately without waiting."


docs/design/2026_04_25_proposed_s3_raft_blob_offload.md

HIGH — §3.5 GC sweeper has no mechanism to determine when RC reached 0

The sweeper (§3.5 item 2) deletes blobs "whose RC has been 0 for at least chunkBlobGCGracePeriod (1 hour)." But !s3|chunkref-rc|<SHA> is a plain counter — a counter at 0 carries no timestamp. The sweeper cannot determine whether the grace period has elapsed without additional state.

This makes the grace-period GC unimplementable as written. Two standard approaches:

  • GC queue: when RC drops to 0, write !s3|chunkblob-gc-queue|<timestamp-nanos>|<SHA> in the same Raft txn. Sweeper scans entries older than chunkBlobGCGracePeriod and deletes the corresponding chunkblob.
  • Tombstone with deletion time: when RC drops to 0, overwrite the RC entry with a struct {count: 0, eligible_after: time.Now().Add(gracePeriod)}. Sweeper reads the field and skips entries where now < eligible_after.

One of these (or a similar equivalent) must be specified before this design can be implemented correctly.

Fix this →

MEDIUM — M1 ships proxy-on-miss but the fetch RPC is M2 scope

§5 milestone table:

  • M1: "GET path checks chunkblob locally with proxy-on-miss"
  • M2: "Follower fetch protocol + async fetch worker pool"

Proxy-on-miss depends on S3BlobFetch.FetchChunkBlob RPC (§3.6). Without that RPC, M1 GET can only serve local-hit or 503 — not proxy. Either M1 should be "503-on-miss" (not proxy), with proxy promoted to M2, or the FetchChunkBlob RPC stub explicitly ships in M1. The current text implies M1 delivers a functional proxy that cannot actually be built from M1 scope alone.

MEDIUM — §3.5 concurrent GET + GC delete race is unaddressed

A GET proxying to peer X for a chunkblob can lose a race against peer X's sweeper, which deletes the blob after its local grace period expires. The blob is still reachable via chunkref (another peer may hold it), but the proxied fetch to that specific peer returns 404. §3.3 specifies retry-from-alternate-peer on SHA mismatch; the same retry path should be explicitly documented for FetchChunkBlob returning 404 (blob GC'd on that peer), to make clear this case is handled rather than a silent data-loss path.

LOW — SourcePeer staleness should be documented as hint-only

After a leadership change, the node recorded in ChunkRef.SourcePeer may have crashed or restarted. Followers fall back to fanout (correctly), but it should be stated explicitly: "SourcePeer is a best-effort hint set at write time and may be stale; callers MUST handle NOT_FOUND from FetchChunkBlob on that peer and fall back to fanout." Without this, an implementor may treat a 404 from SourcePeer as unexpected rather than a normal fallback trigger.

LOW — chunkBlobMinReplicas behaviour during cluster shrink is undefined

§3.2 sets chunkBlobMinReplicas to (N/2)+1. During a controlled decommission (e.g. 5 → 3 nodes), the floor computed at write time may exceed the number of reachable live followers. Should PushChunkBlob block until the configured minimum is reachable, degrade gracefully to "as many as currently available, ≥ 1", or fail the PUT? An unanswered question here means a single node outage during a shrink window could block PUTs indefinitely. One sentence specifying the degradation behaviour (and whether it requires operator intervention) is sufficient.


Summary

Severity Doc Issue Status
HIGH Admission §6 "aws-chunked overcharge" contradicts M1 milestone (§3.3.1) Open
HIGH Blob Offload §3.5 GC grace period unimplementable — no timestamp when RC hit 0 Open
MEDIUM Admission §3.3 A+B semaphore double-charge semantics unresolved Open
MEDIUM Admission §4 retry budget release protocol not specified Open
MEDIUM Blob Offload M1 ships proxy-on-miss without the M2 fetch RPC Open
MEDIUM Blob Offload §3.5 GET + GC delete race unaddressed Open
LOW Admission §3.2 timeout comment references wrong drain baseline Open
LOW Admission §3.3.1 oversized frame "503 immediately" requires explicit cap-check in reserve() Open
LOW Blob Offload SourcePeer staleness not documented as hint-only Open
LOW Blob Offload chunkBlobMinReplicas during cluster shrink undefined Open

The two HIGH items should block merge. The MEDIUMs are worth closing before implementation starts to avoid design ambiguities carrying into code.


bootjp added 2 commits April 26, 2026 03:59
s3_admission_control.md — HIGH 1: stale §6 entry "aws-chunked
overcharge persists until M4" contradicted §3.3.1 / §5 milestone
table which now ship pay-as-you-decode in M1. Replace with the
actual residual M1 risk: a chunked PUT that previously succeeded
silently can now 503 mid-stream when Raft drain falls behind.

s3_admission_control.md — MEDIUM 1: §3.3 A+B semaphore semantics
were ambiguous. Pin the contract: A is a check-only headroom gate
(`peekHeadroom`, no acquire); B is the actual acquire/release on
the semaphore. The bytes-in-flight bound is enforced by B alone;
A is a fast-fail at request entry. Includes the explicit
"if bytes > capacity * unitSize, fail immediately without waiting"
rule for malformed oversized frames (LOW 2 from the same review).

s3_admission_control.md — MEDIUM 2: §4 retry-budget release was
underspecified. Fix the contract to hold-through-retry: per-batch
slot is released exactly once on the *final* outcome of the retry
chain, never between attempts. The bytes are still in pendingBatch
across retries, so the budget must reflect them; release-between-
retries would let a second PUT proceed while the first is still
memory-resident.

s3_admission_control.md — LOW 1: dispatchAdmissionTimeout comment
referenced the wrong drain baseline (33 s = full MaxInflight queue,
not the 256 MiB cap which drains in ~2 s). Rewrite to make clear
30 s is the budget for a transiently stalled follower to recover,
not normal drain time.

s3_raft_blob_offload.md — HIGH 2: §3.5 GC grace window was
unimplementable as written — a plain RC counter at zero carries no
"became eligible at T" signal so the sweeper cannot enforce the
1-hour grace. Switch to a queue-based scheme:
`!s3|chunkblob-gc-queue|<commitTS-nanos>|<SHA>` written in the same
Raft txn that drives RC to zero. The sweeper scans the queue range
older than the grace window, re-checks RC at read time, and
deletes via Raft so concurrent sweepers serialise on
write-write-conflict.

s3_raft_blob_offload.md — MEDIUM 3: M1 milestone shipped
proxy-on-miss but the FetchChunkBlob RPC was scheduled in M2,
so M1 GET could only serve local-hit or 503. Move both
FetchChunkBlob and PushChunkBlob into M1 (PushChunkBlob is
already needed by §3.2 quorum write); M2 narrows to the async
catch-up worker pool that does not exist on the request goroutine.

s3_raft_blob_offload.md — MEDIUM 4: §3.3 had no explicit
GET-vs-GC race section. Add §3.3.1 specifying that
FetchChunkBlob NOT_FOUND on a peer mid-GC is a normal fallback
trigger (cycle through remaining peers in randomised order) and
that a fanout-wide NOT_FOUND on an SHA whose chunkref is still
present is the genuine durability failure that surfaces 500 +
metric.

s3_raft_blob_offload.md — LOW 3: SourcePeer was implicitly a hint
but never said so. Document explicitly that SourcePeer is best-
effort, callers MUST handle NOT_FOUND from FetchChunkBlob on that
peer as a normal fallback to fanout, and that treating it as a
hard error would pin clients on a momentarily-bad source.

s3_raft_blob_offload.md — LOW 4: chunkBlobMinReplicas behaviour
during cluster shrink / partial outage was undefined. Specify the
degradation ladder: ≥ minReplicas → normal; < minReplicas but
Raft quorum intact → degrade to "as many as available, never
< 2" with a metric; < 2 reachable peers → 503 (leader-only
durability is the regression this design exists to prevent).

No code changes; design docs only.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 25, 2026

Claude bot review (2 HIGH, 4 MEDIUM, 4 LOW = 10 件) すべて 7e59409 で対応。

Admission control

Severity Issue Fix
HIGH §6 "aws-chunked overcharge persists until M4" が §3.3.1 / §5 と矛盾 §6 entry を入れ替え: 旧 5 GiB pre-charge ハザードは M1 で消える、その代わり「chunked PUT が Raft drain 落ちで 503 mid-stream」が新たな M1 観測サイン
MEDIUM §3.3 admission A+B の double-charge 懸念 A=peekHeadroom (check-only, no acquire), B=acquire/release で semaphore を charge する2-phase contract を pin。in-flight bound は B のみが enforce
MEDIUM §4 retry budget release protocol 未定義 hold-through-retry に決定。slot は retry chain の最終 outcome で 1 回だけ release。bytes は pendingBatch に居続けるので budget もそのまま反映
LOW §3.2 dispatchAdmissionTimeout コメントが drain time を取り違え (33 s ≠ 256 MiB cap drain) 「30s は stalled follower 復旧の budget。256 MiB cap は ~2s で drain」と書き直し
LOW §3.3.1 oversized frame の 503 immediately を実装する reserve() 仕様明記 acquire 仕様に「bytes > capacity * unitSize なら ctx を待たず即 ErrAdmissionExhausted」を追加

Blob offload

Severity Issue Fix
HIGH §3.5 GC grace 1h は plain RC counter では実装不能 (timestamp が無い) `!s3
MEDIUM M1 が proxy-on-miss を出すのに FetchChunkBlob RPC は M2 FetchChunkBlob と PushChunkBlob を M1 に moveup (PushChunkBlob は §3.2 quorum write で M1 必須)。M2 は async catch-up worker pool に絞る
MEDIUM §3.3 の GET vs. GC delete race が未対応 §3.3.1 (新規) を追加: FetchChunkBlob NOT_FOUND は normal fallback trigger、fanout-wide NOT_FOUND は genuine durability failure で 500 + metric
LOW SourcePeer が hint-only と明記されてない §3.3 に明示: best-effort hint、callers MUST handle NOT_FOUND を fallback として
LOW chunkBlobMinReplicas during cluster shrink が未定義 degradation ladder を pin: ≥ minReplicas → normal、< minReplicas だが Raft quorum intact → "as many as available, never < 2" + metric、< 2 → 503

両方 design docs のみで code 変更なし。HIGH 2 件は merge blocker、MEDIUM 4 件は実装前に閉じておくことで code に曖昧さを持ち込まない。

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented Apr 25, 2026

Claude finished @bootjp's task in 4m 12s —— View job


Claude's Review of PR #637 (Third Pass)

  • Read current state of both design documents
  • Verify all 10 previously identified issues
  • Post comprehensive review feedback

All 10 issues from the previous two passes are resolved. One new HIGH was found during this read, along with three LOWs. Details below.


Previous issues — all resolved ✅

# Severity Doc Issue Status
1 HIGH Admission §6 "aws-chunked overcharge until M4" contradicted §3.3.1 ✅ Resolved — §6 now correctly describes the residual M1 risk (mid-stream 503 for chunked PUTs when Raft drain falls behind)
2 HIGH Blob Offload §3.5 GC grace period unimplementable — no timestamp when RC hit 0 ✅ Resolved — !s3|chunkblob-gc-queue|<commitTS-nanos>|<SHA> entry added, written in the same Raft txn as the RC decrement
3 MEDIUM Admission §3.3 A+B semaphore double-charge / gate-only semantics unresolved ✅ Resolved — peekHeadroom is check-only (no acquire), acquire is the only path that charges; explicitly stated
4 MEDIUM Admission §4 retry budget release protocol unspecified ✅ Resolved — hold-through-retry specified, rationale given (bytes still in pendingBatch throughout retry window)
5 MEDIUM Blob Offload M1 shipped proxy-on-miss but M2 scoped FetchChunkBlob ✅ Resolved — both RPCs explicitly moved to M1 with rationale
6 MEDIUM Blob Offload §3.5 concurrent GET + GC delete race unaddressed ✅ Resolved — §3.3.1 added; NOT_FOUND from peer is a normal fallback trigger; fanout-wide NOT_FOUND is 500 + metric
7 LOW Admission §3.2 timeout comment referenced wrong drain baseline (33 s) ✅ Resolved — comment now says "256 MiB cap drains in ~2 s at 1 Gbps; 30 s is the budget for a stalled follower"
8 LOW Admission §3.3.1 "503 immediately" for oversized frame needed explicit cap-check in spec ✅ Resolved — acquire spec now says: if bytes > capacity * s3ChunkSize, return ErrAdmissionExhausted immediately
9 LOW Blob Offload SourcePeer staleness not documented as hint-only ✅ Resolved — §3.3 now says "best-effort hint"; callers MUST handle NOT_FOUND as fallback trigger
10 LOW Blob Offload chunkBlobMinReplicas behaviour during cluster shrink undefined ✅ Resolved — three-tier degradation ladder added: ≥ configured min → normal; < min but Raft quorum intact → degrade to ≥ 2 + metric; < 2 → 503

New issues

HIGH — docs/design/2026_04_25_proposed_s3_raft_blob_offload.md §3.5 step (3b): "single Raft txn" atomicity claim is architecturally impossible

Lines 326–330:

if the RC is still 0, deletes the local !s3|chunkblob|<SHA> AND deletes the queue entry — both as a single Raft txn so the queue stays consistent with the keyspace

!s3|chunkblob|<SHA> is explicitly local-only Pebble — §3.1 states "chunkblob is not written through Raft." The queue entry !s3|chunkblob-gc-queue|… is Raft-replicated. These are two operations in two different storage layers and cannot be wrapped in one Raft transaction.

The atomicity claim would send an implementor down one of two wrong paths:

  1. Route chunkblob deletes through Raft (wrong — reintroduces WAL amplification at GC time).
  2. Silently accept non-atomic behaviour without understanding the failure mode.

The actual failure modes of the two-phase operation are:

  • Local delete succeeds, Raft queue-entry delete fails: blob is gone locally, queue entry remains. Next sweep cycle re-checks, local delete is a no-op (already absent), Raft txn retried — benign.
  • Raft queue-entry delete succeeds, local delete fails: queue entry is gone, blob is an unreachable local orphan that no future sweep will find — permanent space leak.

The safe ordering is therefore Raft commit first, local delete second: if the Raft commit fails and we retry, the blob is still present; if the local delete fails after a successful Raft commit, the orphan is a space waste but carries no correctness consequence. A periodic "orphan scan" (chunkblob keys whose SHA has RC = 0 and no queue entry) can reclaim these without urgency.

The doc should: (a) remove the "single Raft txn" claim, (b) specify the two-step ordering (queue entry via Raft → local Pebble delete), and (c) acknowledge the orphan-on-local-failure as a known bounded leak.

Fix this →

LOW — docs/design/2026_04_25_proposed_s3_admission_control.md §4: redisDispatchTimeout in the S3 path

The retry-chain bound in §4 (hold-through-retry rationale) reads:

bounded by s3TxnRetryMaxAttempts × (redisDispatchTimeout + s3TxnRetryMaxBackoff)

redisDispatchTimeout is a Redis-path constant; this is the S3 admission design. Almost certainly a copy-paste error — should reference s3DispatchTimeout, s3TxnRetryTimeout, or whichever S3-specific timeout governs one Dispatch call.

LOW — docs/design/2026_04_25_proposed_s3_admission_control.md §3.5: protocol="chunked" label missing from metrics spec

§6 and the Rolling Upgrade subsection both reference a stage="perbatch", protocol="chunked" label combination for isolating chunked-PUT rejection events. The formal metrics spec in §3.5 only defines stage="prereserve" | "perbatch" — the protocol label dimension is absent. Without it in the spec, the label would either be forgotten in the implementation or implemented inconsistently. Add protocol="fixed-length" | "chunked" to the elastickv_s3_put_admission_rejections_total definition.

LOW — docs/design/2026_04_25_proposed_s3_raft_blob_offload.md §3.5: "Both are Raft-replicated" is misleading

Line 339:

Both are Raft-replicated, so every node arrives at the same set of sweepable SHAs and the same eligibility window.

Only the GC queue is Raft-replicated. Each node deletes its own local chunkblob copy independently — that is the whole point of the architecture. The accurate claim: "The GC queue is Raft-replicated, so every node sees the same set of sweepable SHAs and the same eligibility window. Each node independently deletes its own local copy of eligible blobs."


Summary

Severity Doc Issue Status
HIGH Blob Offload §3.5 "single Raft txn" atomicity claim for local chunkblob + Raft queue entry is architecturally impossible; needs correct two-phase ordering and orphan-on-failure documented Open
LOW Admission §4 redisDispatchTimeout reference should be an S3-path constant Open
LOW Admission §3.5 metrics spec missing protocol="chunked" label dimension referenced in §6 Open
LOW Blob Offload §3.5 "Both are Raft-replicated" should say only the queue is replicated; chunkblob deletes are node-local Open

The HIGH item should be fixed before merge; the three LOWs are clean-up worth doing before implementation starts.

bootjp added 2 commits April 26, 2026 04:23
s3_raft_blob_offload.md — HIGH: §3.5 step (3b) claimed "delete the
local chunkblob AND the queue entry as a single Raft txn" — but
chunkblob is local-only Pebble per §3.1 and the queue is
Raft-replicated, so the two ops live in different storage layers
and cannot share a Raft txn. Rewrite the step to specify a
two-phase ordering:

  i.  Raft phase first: delete the queue entry through Raft.
      Concurrent sweepers serialise here on write-write-conflict;
      the queue is therefore the global "we are GC-ing this SHA"
      lock.
  ii. Local phase second: delete the local chunkblob from Pebble.

Document the failure mode of the inverse ordering (local-first
would orphan the queue entry on crash) and of the chosen ordering
(crash between phases leaves a local space leak — bounded, no
correctness consequence — recoverable by a periodic orphan scan).

s3_raft_blob_offload.md — LOW: the same §3.5 closing paragraph
said "Both are Raft-replicated" referring to the queue and RC.
That phrasing implied the chunkblob deletes were Raft-replicated
too. Rewrite to explicitly distinguish: queue + RC are
Raft-replicated; local chunkblob deletes are deliberately
node-local because that is the whole point of the architecture.

s3_admission_control.md — LOW: §4 retry-budget bound formula
referenced `redisDispatchTimeout`, a Redis-path constant copy-
pasted into the S3 design. The S3 PUT path actually uses the
inbound `*http.Request` context (no S3-specific Dispatch timeout),
so the formula now reads `single_dispatch_budget` with an explicit
note that the upper bound is whatever the request context allows
at that moment.

s3_admission_control.md — LOW: §3.5 metrics spec defined only
`stage="prereserve" | "perbatch"` but §6 and the Rolling Upgrade
subsection both reference a `stage="perbatch", protocol="chunked"`
label combination for isolating chunked-PUT rejection events. Add
the `protocol="fixed-length" | "chunked"` label dimension to
`elastickv_s3_put_admission_rejections_total` and
`elastickv_s3_put_admission_wait_seconds`, with a brief paragraph
explaining why the split is operationally meaningful (chunked HoL
events vs. fixed-length client-concurrency events).

No code changes; design docs only.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 25, 2026

Round 4 (1 HIGH + 3 LOW) すべて 7475060 で対応。

Severity Doc Issue Fix
HIGH Blob Offload §3.5 step (3b) "single Raft txn for local chunkblob + Raft queue entry" は不可能 (storage 層が違う) 2-phase ordering を明記: Raft phase 先 (queue 削除 = global lock 兼ねる) → local phase 後 (Pebble 削除)。Crash between phases は bounded local space leak で correctness 影響なし、orphan scan で回収
LOW Blob Offload §3.5 末尾の "Both are Raft-replicated" が chunkblob deletes も replicated と読める 明示的に書き換え: queue + RC は Raft-replicated、chunkblob deletes は deliberately 各ノード独立
LOW Admission §4 retry budget bound formula で redisDispatchTimeout (Redis-path 定数) を誤参照 削除して single_dispatch_budget という抽象表現 + S3 path は HTTP request context を使う (S3-specific Dispatch timeout は無し) と明記
LOW Admission §3.5 metrics spec が stage のみで、§6 / Rolling Upgrade で参照されている protocol="chunked" ラベルが定義されてない protocol="fixed-length" | "chunked" ラベル次元を rejections_total / wait_seconds に追加。なぜ split が operationally meaningful か (HoL vs. concurrency) も併記

両方 design docs のみ。Code 変更なし。

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented Apr 25, 2026

Claude finished @bootjp's task in 6m 17s —— View job


Claude's Review of PR #637 (Round 4)

  • Gather context and read CLAUDE.md
  • Read current state of both design documents
  • Verify all 4 issues from round 3 are resolved
  • Review for new issues
  • Post comprehensive review feedback

Previous issues — all resolved ✅

# Severity Doc Issue Status
1 HIGH Blob Offload §3.5 "single Raft txn" for local chunkblob + Raft queue entry is architecturally impossible ✅ Resolved — explicit two-phase ordering (Raft first → local Pebble second), orphan-on-crash acknowledged as bounded space leak
2 LOW Blob Offload §3.5 "Both are Raft-replicated" misleading ✅ Resolved — now explicitly states "Local chunkblob deletes are deliberately not replicated: each node deletes its own copy independently"
3 LOW Admission §4 retry budget bound formula used redisDispatchTimeout (Redis-path constant) ✅ Resolved — replaced with abstract single_dispatch_budget and correctly notes S3 uses HTTP request context with no S3-specific dispatch timeout
4 LOW Admission §3.5 metrics spec missing protocol="chunked" label dimension ✅ Resolved — protocol = "fixed-length" | "chunked" added to both rejections_total and wait_seconds, with rationale for why the split is operationally necessary

New issues

MEDIUM — docs/design/2026_04_25_proposed_s3_admission_control.md §3.3.1: "bootstrap reservation" semantics are ambiguous

§3.3 defines admission A as peek-only (no slot acquisition): "peekHeadroom … does NOT take out a reservation." But §3.3.1 introduces a "Bootstrap reservation = s3RaftEntryByteBudget (4 MiB)" at chunked PUT request entry — calling it a "reservation" implies an actual acquire.

The two interpretations have different observable effects:

  • If it is a genuine acquire: 4 MiB of slots are held from request entry until some point later. What is the release point? If held until request completion, 64 concurrent chunked PUTs exhaust the entire 256 MiB cap in bootstrap overhead alone (64 × 4 MiB = 256 MiB), blocking all fixed-length PUTs. If released at the first per-frame ack, the bootstrap effectively becomes the first per-frame charge, and the document needs to say so. Either way, the lifecycle is unspecified.
  • If it is a peek: §3.3.1 should call it a "bootstrap check" or "bootstrap peekHeadroom," not a "reservation." The distinction matters because peek is racy (another PUT can consume the checked headroom before the first per-frame acquire fires) while an acquire holds the slots.

The design should pin which path it takes and, if it is a genuine acquire, specify when the 4 MiB is released relative to the per-frame acquire/release cycle that follows.

Fix this →

MEDIUM — docs/design/2026_04_25_proposed_s3_raft_blob_offload.md §3.2: degraded-path durability gap for N > 3 clusters is not acknowledged

§3.2 specifies that in the degraded path ("reachable peers < chunkBlobMinReplicas but ≥ Raft quorum"), the cluster degrades to "as many as currently available, but never fewer than 2" chunkblob copies (leader + 1 follower). For a 3-node cluster this matches Raft's quorum of 2 — a single node failure is tolerated. But for a 5-node cluster:

  • Raft quorum = 3. Raft normally tolerates 2 simultaneous failures.
  • Degraded chunkblob replication = 2 copies (leader + 1 follower). Tolerates only 1 simultaneous failure.

If the leader and the 1 chunkblob-holding follower both fail (a double-node failure that Raft handles for the chunkref), the 3 remaining Raft quorum nodes elect a new leader, find a committed chunkref, and discover no surviving peer holds the chunkblob. The chunkref is durable; the object data is lost. This is a weaker durability guarantee than the legacy "every byte through Raft" path, which loses data only when Raft itself loses quorum (i.e., 3 simultaneous failures for N=5).

The document frames the floor of 2 as the "strict invariant" without calling out that it provides weaker-than-Raft durability for clusters with N > 3. Operators who size their clusters for Raft's N-fault tolerance and deploy this feature in the degraded path will get silently weaker durability for blobs than for metadata.

A one-sentence acknowledgement is sufficient: "Note: the degraded floor of 2 chunkblob copies tolerates only a single node failure; for N > 3 clusters, a double-node failure during the degraded window can cause blob loss even though the chunkref survives on Raft quorum. Operators who cannot accept this tradeoff should configure chunkBlobMinReplicas = N."

Fix this →

LOW — docs/design/2026_04_25_proposed_s3_admission_control.md §3.2/§3.3.1: s3RaftEntryByteBudget undefined and 64 KiB frame-size vs. 1 MiB slot-unit incoherence

Two related gaps in §3.3.1:

  1. Undefined constant. s3RaftEntryByteBudget (4 MiB) is used throughout §3.3.1 but is never added to the §3.2 constant block, which only defines s3PutAdmissionMaxInflightBytes and dispatchAdmissionTimeout. An implementor needs to know it equals s3ChunkSize × s3ChunkBatchOps = 1 MiB × 4 = 4 MiB. Add it to §3.2.

  2. Frame-size / slot-size impedance mismatch. §3.3.1 says each decoded frame is "typically up to 64 KiB on the wire after framing overhead," and that it "acquires a slot equal to the bytes about to flow into Pebble." But the semaphore defined in §3.3 has capacity max / s3ChunkSize = 256 slots of 1 MiB each. A 64 KiB frame cannot acquire a fractional slot from a channel-backed semaphore. Either the progress callback must accumulate frames until it has a full s3ChunkSize worth before calling acquire (in which case "each decoded chunk frame" is inaccurate), or the slot unit must be smaller (64 KiB), changing the semaphore's capacity to 4096 slots and the acquire spec. The document needs to pick one and make the slot unit and frame size coherent.

LOW — docs/design/2026_04_25_proposed_s3_raft_blob_offload.md §3.5: GC sweeper Phase (3b.i) requires conditional (not unconditional) delete

Phase (3b.i) says "Delete the queue entry through a Raft txn." In the scenario where:

  1. Sweeper reads the queue at a slightly stale read-timestamp and finds a queue entry for SHA X (committed at T1)
  2. Between the sweeper's scan and the Raft txn, a re-reference txn commits at T2 > T1, atomically incrementing RC and deleting the queue entry

The sweeper's Raft txn tries to delete a key that is already absent. If the delete is unconditional (delete key regardless of current value), the txn succeeds (no-op key deletion), and Phase (3b.ii) deletes the local chunkblob — which is now live (RC = 1). Data loss.

If the delete is conditional (e.g., compare-and-delete: key must exist with the expected empty value), the txn fails on the absent key, the sweeper aborts, and the blob is safe.

The document should specify that Phase (3b.i) uses a conditional Raft delete (key must exist at current Raft state), not an unconditional one. The write-write-conflict serialisation already mentioned in the design implies conditional semantics between concurrent sweepers; this note extends that guarantee to cover the re-reference race as well.


Summary

Severity Doc Issue Status
MEDIUM Admission §3.3.1 bootstrap reservation: peek or acquire? Release lifecycle unspecified Open
MEDIUM Blob Offload §3.2 degraded floor of 2 chunkblob copies provides weaker-than-Raft durability for N > 3; not acknowledged Open
LOW Admission §3.2 s3RaftEntryByteBudget undefined; §3.3.1 64 KiB frame size incoherent with 1 MiB semaphore slot unit Open
LOW Blob Offload §3.5 Phase (3b.i) must specify conditional delete to avoid data loss on re-reference race Open

No HIGH issues this pass. The two MEDIUMs are worth closing before implementation starts to avoid the admission control's charging ambiguity and the cluster-shrink durability gap from silently carrying into code. The two LOWs are clean-up worth doing in the same pass.


s3_admission_control.md — MEDIUM: §3.3.1 "Bootstrap reservation"
was ambiguous between peek and acquire. Pin it as a peek
(`peekHeadroom(s3RaftEntryByteBudget)`, no slot acquisition,
matching admission A's contract) and rename the heading to
"Bootstrap headroom check." Document why it must be a peek (an
acquire would multiply per-request slot hold by
`concurrent_chunked_PUTs × 4 MiB` of bootstrap-only credit with no
corresponding payload, reintroducing the head-of-line hazard the
design exists to prevent).

s3_admission_control.md — LOW: §3.3.1 "frame size up to 64 KiB"
was incoherent with the §3.3 semaphore's 1 MiB slot unit (a
channel-backed semaphore can't acquire fractional slots). Clarify
that the awsChunkedReader progress callback **buffers decoded
bytes until a full s3ChunkSize is accumulated, then calls
acquire(s3ChunkSize)**. Worst-case extra buffer per concurrent
chunked PUT is bounded by 1 MiB; on stream EOF the partial buffer
flushes via one final acquire rounded up to one slot. Also adds
`s3RaftEntryByteBudget` to §3.2's constant block (it was used
throughout §3.3.1 but never defined) with a comment showing
the derivation (s3ChunkSize × s3ChunkBatchOps).

s3_raft_blob_offload.md — MEDIUM: §3.2 degraded path floor of 2
chunkblob copies provides weaker-than-Raft durability for N > 3
clusters. On a 5-node cluster Raft tolerates 2 simultaneous
failures for the chunkref but the degraded chunkblob path
(leader + 1 follower) tolerates only 1. Add an explicit note
acknowledging the asymmetry, recommend `chunkBlobMinReplicas = N`
for operators who need the legacy "blob durability == Raft
durability" guarantee, and clarify that the default `(N/2)+1` is
sized for "match Raft quorum" not "match Raft fault tolerance" —
a distinction that is invisible at N=3 and material at N≥5.

s3_raft_blob_offload.md — LOW: §3.5 Phase (3b.i) needs to specify
that the queue-entry delete is **conditional** on (a) the entry
existing and (b) the RC counter still being 0 at the txn's read
timestamp. An unconditional delete would silently succeed on a
queue entry that a re-reference txn has just removed, then proceed
to phase (3b.ii) and local-delete a chunkblob whose RC has bounced
back to 1 — a correctness bug, not just a space leak. The
conditional form is what makes the sweeper safe against the
re-reference race.

No code changes; design docs only.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 25, 2026

Round 5 (2 MEDIUM + 2 LOW) すべて fabc81e で対応。

Severity Doc Issue Fix
MEDIUM Admission §3.3.1 "Bootstrap reservation" の peek vs. acquire が曖昧 peek-only (peekHeadroom(s3RaftEntryByteBudget)、admission A 同等) に確定、heading を "Bootstrap headroom check" にリネーム。acquire にすると concurrent_chunked × 4 MiB の bootstrap-only credit が HoL を再導入する理由を併記
MEDIUM Blob Offload §3.2 degraded floor が N > 3 で Raft より弱い (5-node で fault tolerance 2→1 に縮む) 明示注記を追加。chunkBlobMinReplicas = N で legacy "blob durability == Raft durability" を取り戻せる、default の (N/2)+1 は「match Raft quorum」であって「match Raft fault tolerance」ではないと clarify
LOW Admission §3.3.1 64 KiB frame vs. 1 MiB slot 不整合 + s3RaftEntryByteBudget 未定義 progress callback が s3ChunkSize 分 buffer してから acquire(s3ChunkSize) する設計に明記、stream EOF 時の partial flush も specify。s3RaftEntryByteBudget = s3ChunkSize × s3ChunkBatchOps を §3.2 の定数 block に追加
LOW Blob Offload §3.5 Phase (3b.i) が unconditional delete だと re-reference race で correctness bug conditional delete (queue entry が存在 ∧ RC=0 at read TS) に specify。fail 時 sweeper abort、success 時のみ phase (3b.ii) へ進む

両方 design docs のみ。Code 変更なし。

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented Apr 25, 2026

Claude finished @bootjp's task in 5m 20s —— View job


Claude's Review of PR #637 (Round 5)

  • Gather context and read CLAUDE.md
  • Read current state of both design documents
  • Verify all 4 Round 4 issues are resolved
  • Review for new issues
  • Post comprehensive review feedback

Previous issues — all resolved ✅

# Severity Doc Issue Status
1 MEDIUM Admission §3.3.1 "Bootstrap reservation" peek vs. acquire ambiguity ✅ Resolved — heading is now "Bootstrap headroom check," calls peekHeadroom(s3RaftEntryByteBudget), explicitly says "No slot is acquired"
2 MEDIUM Blob Offload §3.2 degraded floor of 2 provides weaker-than-Raft durability for N > 3 ✅ Resolved — explicit "Important durability note for N > 3 clusters" section added; chunkBlobMinReplicas = N escape hatch documented; distinction between "match Raft quorum" vs. "match Raft fault tolerance" pinned
3 LOW Admission §3.3.1 64 KiB frame vs. 1 MiB slot incoherence; s3RaftEntryByteBudget undefined ✅ Resolved — progress callback accumulates to s3ChunkSize before calling acquire(s3ChunkSize); partial flush at EOF specified; s3RaftEntryByteBudget = s3ChunkSize * s3ChunkBatchOps added to §3.2 constant block
4 LOW Blob Offload §3.5 Phase (3b.i) unconditional delete causes data loss on re-reference race ✅ Resolved — conditional Raft txn specified (queue entry must exist AND RC=0 at txn's read timestamp); "unconditional delete would silently succeed on the now-absent key" is explicitly called out as the correctness bug it prevents

New issues

MEDIUM — docs/design/2026_04_25_proposed_s3_raft_blob_offload.md §3.2: PushChunkBlob local-write / peer-push pipeline unspecified; p99 claim not justified

§3.2 step 2 says:

Leader pushes the chunkblob to chunkBlobMinReplicas - 1 followers via S3BlobFetch.PushChunkBlob and waits for each follower's "fsync ack."

The phrase "waits for each follower's fsync ack" is ambiguous:

  • If sequential (write local → fsync → push to follower 1 → fsync ack → push to follower 2 → fsync ack), per-chunk latency ≈ chunkBlobMinReplicas × fsync_latency. For chunkBlobMinReplicas = 2: ≈ 2 × 10 ms = 20 ms per chunk, vs. Raft's quorum write ≈ max(leader_fsync, follower_fsync) ≈ 10 ms. That is 2× the legacy latency.
  • If pipelined (leader streams bytes to the follower while writing to local Pebble — similar to how Raft pipelines leader append and replication), per-chunk latency ≈ max(local_fsync, peer_fsync) ≈ 10 ms — equivalent to Raft's quorum write.

The document claims:

PUT p99 is roughly equivalent to today's "every byte through Raft" latency — we are paying the same fsync cost, just to a different keyspace.

This claim is only true under the pipelined model. The sequential model silently doubles the PUT p99 per chunk. The milestone that ships PushChunkBlob (M1) would behave very differently depending on which model is implemented, and the acceptance criterion (§5: "No regression in PUT p99 latency") would catch the gap only in testing — by which point fixing the protocol is more disruptive than specifying it now.

The design should explicitly state whether step 1 (local write) and step 2 (peer push) are pipelined (bytes streamed to peer concurrently with the local Pebble write), and whether pushes to multiple followers (when chunkBlobMinReplicas > 2) are fanned out in parallel or done sequentially.

Fix this →

LOW — docs/design/2026_04_25_proposed_s3_admission_control.md §3.3.1: malformed-client "503 immediately" failure mode contradicts the 1 MiB accumulation design

§3.3.1 failure modes says:

If the awsChunkedReader frame size ever exceeds s3RaftEntryByteBudget (a malformed client whose decoded cumulative output between framing acks already exceeds 4 MiB), the first per-frame acquire asks for more than the cap allows and we 503 immediately.

But §3.3.1 also specifies:

The callback buffers decoded bytes until a full slot unit (s3ChunkSize = 1 MiB) is accumulated, then calls acquire(s3ChunkSize).

Since the callback always calls acquire(1 MiB), the acquire spec's immediate-return condition — "if bytes > capacity * s3ChunkSize" (i.e. > 256 MiB) — is never triggered: 1 MiB can never exceed 256 MiB. A malformed client whose frame is 100 MiB produces many 1 MiB acquires that block normally; it does not trigger an immediate 503. The actual behaviour for oversized frames is: callback accumulates → many 1 MiB acquires → if Raft is congested these block → eventually dispatchAdmissionTimeout fires → 503.

The "503 immediately" claim applies to fixed-length PUTs (via peekHeadroom(Content-Length > 256 MiB)), not to chunked PUTs via the accumulation path. The failure mode description should say: "a malformed chunked stream that continuously decodes bytes faster than Raft drains will eventually 503 on dispatchAdmissionTimeout" — the same path as any other Raft-stall scenario. The immediate-503 path is not available here.

LOW — docs/design/2026_04_25_proposed_s3_admission_control.md §5: dispatchAdmissionTimeout dependency ordering across milestones

The M2 milestone says:

Per-batch admission B inside flushBatch for fixed-length PUTs. Add dispatchAdmissionTimeout. Mid-stream 503 with cleanup. (Chunked PUTs already use this path through their incremental charging.)

But M1 ships chunked per-batch admission (§3.3.1 pay-as-you-decode), which calls acquire(ctx, s3ChunkSize) — and §3.3.1 explicitly says:

the next 1 MiB acquire blocks (capped by dispatchAdmissionTimeout). Beyond that timeout, mid-stream 503.

If dispatchAdmissionTimeout only exists from M2, then M1's chunked per-batch admission has no defined timeout bound and would rely on the inbound request context's deadline (server writeTimeout / client cancel). This is probably the intent (use request context in M1; formalize the 30 s constant in M2), but it is not stated. If an operator ships M1 in isolation, chunked PUTs can block for the full HTTP server write timeout (potentially minutes) rather than 30 s. Clarify: does M1's chunked admission acquire use the request context as its upper bound, and does M2 introduce dispatchAdmissionTimeout as a tighter per-batch ceiling?

LOW — docs/design/2026_04_25_proposed_s3_raft_blob_offload.md §3.2: flow diagram step 3 count is inconsistent with the prose

The flow diagram shows:

3. *** synchronously replicate to ≥ chunkBlobMinReplicas peers ***

The prose immediately below says "Leader pushes the chunkblob to chunkBlobMinReplicas - 1 followers." The diagram's "≥ chunkBlobMinReplicas peers" implies chunkBlobMinReplicas followers plus the leader, which equals chunkBlobMinReplicas + 1 total — stronger than intended. It should read "chunkBlobMinReplicas - 1 followers" (plus the leader = chunkBlobMinReplicas total), consistent with the prose.

LOW — docs/design/2026_04_25_proposed_s3_raft_blob_offload.md §3.5: orphan scan scope should explicitly cover the common PUT-failure case

The orphan scan is introduced as the recovery path for "crash between Raft phase and local phase" of the sweeper. But it implicitly covers a more common scenario: chunkblobs written to local Pebble in the PUT loop (§3.2 step 2) when the PUT then fails before coordinator.Dispatch submits the chunkref txn (e.g. admission control 503, client disconnect, PushChunkBlob quorum failure). In that case, the RC entry and GC queue entry are never written — the sweeper's queue-range scan never finds these orphans. The orphan scan (scanning !s3|chunkblob|* for SHAs with no RC entry or RC=0 and no queue entry) would reclaim them, but the design does not explicitly list this as one of the cases the orphan scan handles. An implementor reading only the sweeper section would not know their putAbort / cleanup path relies on it.


Summary

Severity Doc Issue Status
MEDIUM Blob Offload §3.2 PushChunkBlob local-write / peer-push pipeline unspecified; p99 latency equivalence claim only holds under pipelined/parallel model Open
LOW Admission §3.3.1 malformed-client "503 immediately" description contradicts 1 MiB accumulation: actual path is timeout 503, not immediate Open
LOW Admission §5 dispatchAdmissionTimeout added in M2 but depended on by M1's chunked per-batch admission; M1 fallback timeout unspecified Open
LOW Blob Offload §3.2 flow diagram step 3 "≥ chunkBlobMinReplicas peers" inconsistent with prose "chunkBlobMinReplicas - 1 followers" Open
LOW Blob Offload §3.5 orphan scan scope doesn't explicitly cover PUT-failed-before-Dispatch (no chunkref txn ever submitted) Open

No HIGH issues this pass. The one MEDIUM is worth closing before M1 implementation starts, since the latency-equivalence claim is load-bearing for operator buy-in and the pipelining decision affects the RPC interface design. The four LOWs are clean-up worth batching in a single pass.


s3_raft_blob_offload.md — MEDIUM: §3.2 PushChunkBlob latency claim
("PUT p99 ≈ legacy") was load-bearing but the
local-write/peer-push pipeline was unspecified. Sequential
ordering would silently double the per-chunk latency
(`chunkBlobMinReplicas × fsync_latency`). Pin the contract:
local Pebble write and PushChunkBlob fan out **concurrently**;
multiple followers' pushes are **fanned out in parallel**, not
sequentially. Update the flow diagram to show the pipeline
explicitly and call out that this is part of the contract, not
an optional optimization.

s3_admission_control.md — LOW: §3.3.1 malformed-client failure
mode said "we 503 immediately" but the accumulation design
(`acquire(s3ChunkSize)` only) means the immediate-503 path is
never reachable on the chunked side — the per-frame acquire is
always 1 MiB, well under the 256 MiB cap. Reword to specify the
actual path: successive 1 MiB acquires block under Raft pressure
and the PUT eventually surfaces 503 on
`dispatchAdmissionTimeout`. The "immediate 503 for oversized
request" path is fixed-length only.

s3_admission_control.md — LOW: §5 milestone table had M2 saying
"Add `dispatchAdmissionTimeout`" but M1 already ships the
chunked per-frame admission B path which is gated on it. Move
the constant into M1; M2 narrows to "add fixed-length per-batch
admission B + cleanup," with chunked already using the path
from M1.

s3_raft_blob_offload.md — LOW: the §3.2 flow-diagram step 3
phrasing "synchronously replicate to ≥ chunkBlobMinReplicas
peers" was inconsistent with the prose's
"chunkBlobMinReplicas - 1 followers." Resolved as part of the
MEDIUM rewrite — the diagram now reads "PushChunkBlob to
chunkBlobMinReplicas-1 followers" with parallel fan-out, matching
the prose count.

s3_raft_blob_offload.md — LOW: §3.5 orphan scan was framed as
the recovery path for "sweeper crash between Phase i and ii,"
but it implicitly also covers a more common scenario — chunkblobs
written to local Pebble by §3.2 step 2 when the PUT then fails
before `coordinator.Dispatch` is called (admission 503, client
disconnect, PushChunkBlob quorum failure). In that case neither
RC nor GC queue entry exists, so the sweeper never sees the
orphan; only the orphan scan does. Make this case explicit so the
PUT-handler abort path can rely on the orphan scan rather than
needing its own best-effort local delete.

No code changes; design docs only.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 25, 2026

Round 5 (Claude bot 19:36 UTC, 1 MEDIUM + 4 LOW) すべて 1fb39ba で対応。

Severity Doc Issue Fix
MEDIUM Blob Offload §3.2 PushChunkBlob pipeline 未定義、p99 等価 claim が sequential model だと不成立 local Pebble write + parallel PushChunkBlob fan-out を contract として明記。flow diagram もパイプライン化を可視化。p99 ≈ max(local_fsync, slowest_quorum_follower_fsync) ≈ Raft quorum write
LOW Admission §3.3.1 chunked malformed-client "503 immediately" が不可能 (1 MiB acquire は 256 MiB cap < 不発) 「accumulation design では immediate-503 は到達不可、実際は dispatchAdmissionTimeout で 503」に書き直し。immediate 503 path は fixed-length 専用と明記
LOW Admission §5 で M2 "Add dispatchAdmissionTimeout" が M1 chunked admission に必要 dispatchAdmissionTimeout を M1 に移動、M2 は fixed-length per-batch + cleanup に narrow
LOW Blob Offload §3.2 flow diagram step 3 と prose の peer count 不整合 MEDIUM の rewrite と同時に解消 — 「PushChunkBlob to chunkBlobMinReplicas-1 followers」で statementも diagram も統一
LOW Blob Offload §3.5 orphan scan が PUT-failed-before-Dispatch case をカバーすると明記されてない 検出条件 (RC entry なし or RC=0 + queue entry なし) を明示し、sweeper crash + PUT abort の両方を orphan scan が回収すると specify。PUT handler の abort path に best-effort local delete を要求しない

両方 design docs のみ。Code 変更なし。

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented Apr 25, 2026

Claude finished @bootjp's task in 6m 12s —— View job


Claude's Review of PR #637 (Round 6)

  • Gather context and read CLAUDE.md
  • Read current state of both design documents
  • Verify all 5 Round 5 issues are resolved
  • Review for new issues
  • Post comprehensive review feedback

Previous issues — all resolved ✅

# Severity Doc Issue Status
1 MEDIUM Blob Offload §3.2 PushChunkBlob local-write / peer-push pipeline unspecified; p99 latency claim only holds under pipelined/parallel model ✅ Resolved — flow diagram updated with "pipelined: bytes also stream out"; prose explicitly states pushes are fanned out in parallel; p99 = max(local_fsync, slowest_quorum_follower_fsync); sequential model's failure mode documented
2 LOW Admission §3.3.1 "503 immediately" for malformed chunked client contradicts 1 MiB accumulation ✅ Resolved — new paragraph explicitly says the immediate-503 path "applies only to fixed-length PUTs (via peekHeadroom(Content-Length > 256 MiB))"; the chunked path "cannot trigger the immediate-503 path" and instead surfaces 503 on dispatchAdmissionTimeout
3 LOW Admission §5 dispatchAdmissionTimeout added in M2 but needed by M1's chunked per-batch admission ✅ Resolved — dispatchAdmissionTimeout explicitly moved to M1 in the table; M1 now says "not in M2"; M2 narrows to fixed-length per-batch + cleanup
4 LOW Blob Offload §3.2 flow diagram "≥ chunkBlobMinReplicas peers" inconsistent with prose "chunkBlobMinReplicas - 1 followers" ✅ Resolved — diagram now says "PushChunkBlob to chunkBlobMinReplicas-1 followers" consistent with prose
5 LOW Blob Offload §3.5 orphan scan scope did not explicitly cover PUT-failed-before-Dispatch ✅ Resolved — §3.5 now lists both sources of orphans: sweeper crash between phases AND PUT failure before chunkref Dispatch; detection criterion covers "no RC entry at all, or RC=0 with no queue entry"; PUT handler does not need its own best-effort local-delete

New issues

MEDIUM — docs/design/2026_04_25_proposed_s3_admission_control.md §5: M1 "acquire / release" scope is ambiguous for fixed-length PUTs

M1 says:

Add putAdmission type + per-node singleton + fixed-length Content-Length admission (peekHeadroom). Wire prepareStreamingPutBody to acquire / release. aws-chunked progress-callback admission (§3.3.1) ships in this milestone too…

M2 says:

Per-batch admission B inside flushBatch for fixed-length PUTs (chunked PUTs already use admission B as of M1). Mid-stream 503 with cleanup on the fixed-length path.

The M1 phrase "Wire prepareStreamingPutBody to acquire / release" is in direct tension with M2's statement that fixed-length per-batch acquisition ships in M2. "Acquire / release" strongly implies actual semaphore charging, but if fixed-length PUTs only get peekHeadroom (admission A) in M1, no slot is ever acquired by a fixed-length PUT until M2. An implementor reading M1 top-to-bottom could conclude that fixed-length PUTs get both A and B in M1, then find M2 adds the same thing — or, worse, ship M1 thinking fixed-length PUTs are fully bounded when they are not.

The net effect: in M1, the semaphore is charged only by chunked PUTs (via the progress callback). Fixed-length PUTs get a racy headroom check at request entry but no actual slot acquisition during the flush. The 256 MiB cap does not bound concurrent fixed-length PUT body bytes in M1. This is a meaningful gap that should be stated explicitly so operators know what M1 does and does not protect against.

Fix this →

LOW — docs/design/2026_04_25_proposed_s3_admission_control.md §3.1: Flow diagram labels A as "reserve" — contradicts peek-only semantics

The overview diagram says:

client ─[Content-Length]─► (A) reserve full body bytes

But §3.3 defines admission A explicitly as a peek-only check with no slot acquisition: "peekHeadroom … does NOT take out a reservation — the only effect is 'fail fast at request entry'… so it cannot double-count against admission B." The label "(A) reserve full body bytes" implies an actual reservation (slot acquisition) and directly contradicts the key distinction §3.3 relies on for correctness (admission A is racy by design; admission B is the only path that charges). A reader who sees the diagram before reading §3.3 will have the wrong mental model. The diagram label should say "(A) check headroom (peek only)" or similar.

LOW — docs/design/2026_04_25_proposed_s3_admission_control.md §3.3: acquire's early-return guard is dead code under the stated design

The acquire spec includes:

If bytes > capacity * s3ChunkSize (a malformed client whose frame exceeds the entire budget), returns ErrAdmissionExhausted immediately without waiting

Under the stated design:

  • Fixed-length PUTs: admission A (peekHeadroom(Content-Length)) rejects any PUT with Content-Length > 256 MiB before acquire is ever called. The per-batch acquire calls use s3RaftEntryByteBudget = 4 MiB, which is never > 256 MiB.
  • Chunked PUTs: the progress callback always calls acquire(s3ChunkSize = 1 MiB), which is never > 256 MiB.

The condition bytes > capacity * s3ChunkSize = 256 MiB is therefore unreachable on both paths. The spec comment says it handles "a malformed client whose frame exceeds the entire budget" but no call site produces such a large value — the accumulation design and the fixed-length pre-check together guarantee bytes ≤ s3RaftEntryByteBudget = 4 MiB at every acquire callsite. The guard should either be removed from the spec (it's dead code) or the comment should explain the one concrete misconfiguration case that would trigger it (s3ChunkBatchOps tuned so high that s3RaftEntryByteBudget > 256 MiB).

LOW — docs/design/2026_04_25_proposed_s3_raft_blob_offload.md §3.2: Mid-push PushChunkBlob failure handling unspecified

The degradation ladder handles pre-flight availability ("reachable peers < chunkBlobMinReplicas") but does not specify what happens when a peer becomes unreachable during an in-progress push — timeout mid-stream, crash after receiving partial bytes, network partition after the leader starts the RPC fan-out. Two reasonable behaviours:

  1. Count failure as "not acked," try another reachable peer. The leader's fan-out had chunkBlobMinReplicas - 1 targets; one failed; if another peer is available, try it before falling back to the degradation ladder.
  2. Apply the degradation ladder immediately. If the successful acks (including the leader) ≥ 2, proceed in degraded mode; if < 2, 503.

Neither is specified. This leaves an important durability decision to each implementor independently, which risks divergent behaviour across milestones and operators' assumptions about what "my cluster is at floor=2" actually means when one push fails. A one-paragraph "failure during push" section would close the gap.


Summary

Severity Doc Issue Status
MEDIUM Admission §5 M1 "acquire / release" ambiguous — fixed-length PUTs may only get admission A (peekHeadroom) in M1; actual semaphore charging ships in M2 Open
LOW Admission §3.1 flow diagram labels A as "reserve" — contradicts the peek-only semantics in §3.3 Open
LOW Admission §3.3 acquire early-return guard is dead code under the stated design Open
LOW Blob Offload §3.2 mid-push PushChunkBlob failure handling unspecified; degradation ladder covers pre-flight only Open

No HIGH issues this pass. The MEDIUM is worth closing before M1 implementation starts — an implementor who reads M1's "acquire / release" wording as including fixed-length PUTs would ship with an unintended (though non-catastrophic) change in M1 scope. The three LOWs are clean-up worth batching in one pass.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
docs/design/2026_04_25_proposed_s3_admission_control.md (1)

25-25: Add language specifiers to code blocks.

For better syntax highlighting and linting compliance, specify language identifiers on code blocks.

📝 Suggested fixes

Line 25:

-```
+```text
 leader-side worst-case  =  concurrent_PUTs × pending_entries_per_PUT × entry_size
                         ≈  concurrent_PUTs × MaxInflight × 4 MiB

Line 108:
```diff
-```
+```text
 client ─[Content-Length]─► (A) reserve full body bytes
 ...

Line 326:
```diff
-```
+```text
 elastickv_s3_put_admission_inflight_bytes        gauge
 ...

</details>


Also applies to: 108-108, 326-326

<details>
<summary>🤖 Prompt for AI Agents</summary>

Verify each finding against the current code and only fix it if needed.

In @docs/design/2026_04_25_proposed_s3_admission_control.md at line 25, Several
fenced code blocks lack language identifiers; update each backtick fence to
include an appropriate language specifier (e.g., text or prometheus) so
syntax highlighting and linting work: change the fence before the "leader-side
worst-case = ..." block to ```text, the fence before the client arrow diagram to

(elastickv_s3_put_admission_inflight_bytes gauge ...) to a suitable language
like ```prometheus or ```text; apply the same change to the other two
occurrences noted in the review so all three code blocks include language
specifiers.
docs/design/2026_04_25_proposed_s3_raft_blob_offload.md (1)

23-23: Add language specifiers to code blocks.

For better syntax highlighting and linting compliance, specify language identifiers on code blocks.

📝 Suggested fixes

Line 23:

-```
+```text
 HTTP PUT body  ─►  s3ChunkSize (1 MiB) chunks
 ...

Line 105:
```diff
-```
+```text
 !s3|chunkref|<bucket>|<gen>|<objectKey>|<uploadID>|<partNo>|<chunkNo>
 ...

Line 131:
```diff
-```
+```text
 client ─► HTTP PUT body
 ...

</details>


Also applies to: 105-105, 131-131

<details>
<summary>🤖 Prompt for AI Agents</summary>

Verify each finding against the current code and only fix it if needed.

In @docs/design/2026_04_25_proposed_s3_raft_blob_offload.md at line 23, Several
fenced code blocks in the document lack language specifiers; update the three
affected fences so they include a language identifier (e.g., ```text) to enable
proper syntax highlighting and linting. Specifically, change the fence that
encloses the snippet starting "HTTP PUT body ─► s3ChunkSize (1 MiB) chunks",
the fence that contains the
"!s3|chunkref||||||" line,
and the fence around "client ─► HTTP PUT body" to use a language tag such as

followed by the chosen specifier and leave the closing backticks unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/design/2026_04_25_proposed_s3_raft_blob_offload.md`:
- Line 358: Typo: change the word "elegibility" to "eligibility" in the phrase
"elegibility-start" so it reads "eligibility-start" (update the token
"elegibility-start" to "eligibility-start" in the sentence "naturally sorted by
elegibility-start time, and any node can").

---

Nitpick comments:
In `@docs/design/2026_04_25_proposed_s3_admission_control.md`:
- Line 25: Several fenced code blocks lack language identifiers; update each
backtick fence to include an appropriate language specifier (e.g., ```text or
```prometheus) so syntax highlighting and linting work: change the fence before
the "leader-side worst-case = ..." block to ```text, the fence before the client
arrow diagram to ```text, and the fence before the metrics snippet
(elastickv_s3_put_admission_inflight_bytes gauge ...) to a suitable language
like ```prometheus or ```text; apply the same change to the other two
occurrences noted in the review so all three code blocks include language
specifiers.

In `@docs/design/2026_04_25_proposed_s3_raft_blob_offload.md`:
- Line 23: Several fenced code blocks in the document lack language specifiers;
update the three affected fences so they include a language identifier (e.g.,
```text) to enable proper syntax highlighting and linting. Specifically, change
the fence that encloses the snippet starting "HTTP PUT body  ─►  s3ChunkSize (1
MiB) chunks", the fence that contains the
"!s3|chunkref|<bucket>|<gen>|<objectKey>|<uploadID>|<partNo>|<chunkNo>" line,
and the fence around "client ─► HTTP PUT body" to use a language tag such as
```text at the opening backticks. Ensure each opening triple-backtick is
followed by the chosen specifier and leave the closing backticks unchanged.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f6f38c8c-9687-45fc-b33e-53ae03da68de

📥 Commits

Reviewing files that changed from the base of the PR and between 1e7e9bc and 1fb39ba.

📒 Files selected for processing (2)
  • docs/design/2026_04_25_proposed_s3_admission_control.md
  • docs/design/2026_04_25_proposed_s3_raft_blob_offload.md

Comment thread docs/design/2026_04_25_proposed_s3_raft_blob_offload.md Outdated
…ags)

CodeRabbit round 1 on PR #637:

- s3_raft_blob_offload.md:358 typo "elegibility" → "eligibility"
  (single character).
- Add `text` language identifier to the six previously-bare fenced
  code blocks (the §1 worst-case formula in admission_control,
  the §3.3 client/admission flow diagram in admission_control,
  the §3.5 metrics block in admission_control, the §1
  Raft-log flow in blob_offload, the §3.1 keyspace block in
  blob_offload, and the §3.2 PUT pipeline diagram in
  blob_offload). Markdownlint flags bare fences; tagging them
  enables both linting compliance and consistent syntax
  highlighting in the rendered docs.

No code changes; design docs only.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 25, 2026

CodeRabbit review (1 actionable + 2 nitpicks) すべて aa418ce で対応。

  • Minor (actionable): elegibilityeligibility typo @ blob_offload.md:358
  • Nitpick × 2: bare ``` fence × 6 個に text language tag を付与
    • admission: §1 worst-case 式 / §3.3 flow diagram / §3.5 metrics block
    • blob_offload: §1 raft-log flow / §3.1 keyspace / §3.2 PUT pipeline

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented Apr 25, 2026

Claude finished @bootjp's task in 4m 33s —— View job


Claude's Review of PR #637 (Round 7)

  • Read current state of both design documents
  • Verify Round 6 open issues (1 MEDIUM + 3 LOW) are resolved
  • Verify CodeRabbit changes (typo + code fences) are applied
  • Review for new issues
  • Post comprehensive review feedback

CodeRabbit items — verified resolved ✅

Item Status
elegibilityeligibility typo @ blob_offload.md:358 ✅ Fixed in aa418ce
6 bare code fences → text language tags (3 per doc) ✅ Fixed in aa418ce

Round 6 issues — all still open

The latest commit (aa418ce) only addressed the CodeRabbit items. All four issues from Round 6 remain unresolved in the current state of the docs.

MEDIUM — docs/design/2026_04_25_proposed_s3_admission_control.md §5: M1 "acquire / release" still ambiguous for fixed-length PUTs

The M1 milestone row still reads:

"Wire prepareStreamingPutBody to acquire / release. aws-chunked progress-callback admission (§3.3.1) ships in this milestone too…"

While M2 says:

"Per-batch admission B inside flushBatch for fixed-length PUTs (chunked PUTs already use admission B as of M1)."

The "acquire / release" in M1 is accurate only for the chunked PUT progress-callback path. For fixed-length PUTs, M1 only wires peekHeadroom (admission A — no slot acquisition); the actual semaphore charging (admission B, flushBatch per-batch acquire) ships in M2. An implementor reading M1 top-to-bottom will see "acquire / release" and ship fixed-length admission B in M1, then discover M2 redundantly adds it again.

The fix: clarify M1 to say "Wire prepareStreamingPutBody to call peekHeadroom for fixed-length PUTs (admission A only — no slot acquisition); the progress-callback acquire/release cycle for aws-chunked PUTs also ships here." Add a note that the semaphore is charged only by chunked PUTs in M1; fixed-length PUT body bytes are not semaphore-bounded until M2.

Fix this →

LOW — docs/design/2026_04_25_proposed_s3_admission_control.md §3.1: flow diagram still labels admission A as "reserve"

Line 109:

client ─[Content-Length]─► (A) reserve full body bytes

§3.3 explicitly says: "peekHeadroom is admission A. It returns ErrAdmissionExhausted without acquiring slots… it does NOT take out a reservation." The diagram label "(A) reserve full body bytes" directly contradicts this and is the most visible part of the doc for a reader skimming the overview. The label should say "(A) check headroom (peek only — no slot acquired)" or similar.

Fix this →

LOW — docs/design/2026_04_25_proposed_s3_admission_control.md §3.3: acquire early-return guard is unreachable at all defined call sites

The acquire spec comment (lines 191–195):

If bytes > capacity * s3ChunkSize (a malformed client whose frame exceeds the entire budget), returns ErrAdmissionExhausted immediately without waiting

§3.3.1 already documents that this path is unreachable on the chunked PUT path (line 291: "the early-return in acquire's spec is never hit on the chunked path"). But the acquire spec itself implies it's a normal reachable path, and the same unreachability holds for fixed-length PUTs: admission A's peekHeadroom(Content-Length) ensures only PUTs with Content-Length ≤ 256 MiB reach acquire, and per-batch calls use s3RaftEntryByteBudget = 4 MiB, which can never satisfy 4 MiB > 256 MiB.

The only concrete trigger is operator misconfiguration: s3ChunkBatchOps tuned so large that s3RaftEntryByteBudget = s3ChunkSize × s3ChunkBatchOps > s3PutAdmissionMaxInflightBytes. The spec comment should say this explicitly — "In normal operation this path is unreachable (both call sites pass values ≤ s3RaftEntryByteBudget); it fires only if s3RaftEntryByteBudget is misconfigured to exceed the cap itself" — rather than attributing it to "a malformed client" (which is inaccurate under the accumulation design).

Fix this →

LOW — docs/design/2026_04_25_proposed_s3_raft_blob_offload.md §3.2: mid-push PushChunkBlob failure handling still unspecified

The degradation ladder (lines 200–213) specifies behaviour based on pre-flight peer reachability. §3.2 step 3 says the leader fans out PushChunkBlob RPCs in parallel and waits for a quorum of fsync-acks, but does not specify what happens when a peer becomes unreachable during an in-progress push (timeout mid-stream, crash after receiving partial bytes, network partition after the RPC starts):

  • If the leader counts the mid-push failure as "not acked" and tries another peer from the remaining set, a single-peer failure during a normal push is transparent.
  • If the leader immediately applies the degradation ladder on the first mid-push failure, a transient RPC timeout could demote a healthy cluster to the degraded path unnecessarily.

Without pinning this, two implementors will independently make opposite choices and produce inconsistent durability behaviour across milestone releases. A one-paragraph "failure during push" subsection — "if a push RPC fails or times out mid-stream, count that peer as 'not acked' and, if another reachable peer exists, start a new RPC to it before applying the degradation ladder" — closes the gap.

Fix this →


Summary

Severity Doc Issue Status
MEDIUM Admission §5 M1 "acquire / release" ambiguous — fixed-length PUTs only get peekHeadroom in M1, not actual semaphore charging Open
LOW Admission §3.1 flow diagram labels admission A as "reserve full body bytes" — contradicts §3.3 peek-only semantics Open
LOW Admission §3.3 acquire early-return guard unreachable at all defined call sites; "malformed client" attribution is inaccurate Open
LOW Blob Offload §3.2 mid-push PushChunkBlob failure handling unspecified; degradation ladder covers pre-flight only Open

No HIGH issues this pass. The MEDIUM and the three LOWs from Round 6 are all carry-forwards — the latest commit only addressed the CodeRabbit cosmetic items (typo + code fences), which are correctly resolved. The MEDIUM is worth closing before M1 implementation starts to avoid the fixed-length admission scope ambiguity propagating into code; the three LOWs can be batched in a single pass.

@bootjp bootjp merged commit 29011bb into main Apr 25, 2026
8 checks passed
@bootjp bootjp deleted the docs/s3-raft-design branch April 25, 2026 20:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant