perf(s3): align Raft entry size with MaxSizePerMsg via s3ChunkBatchOps=4#636
perf(s3): align Raft entry size with MaxSizePerMsg via s3ChunkBatchOps=4#636
Conversation
Before this change, S3 PutObject built one Raft proposal per s3ChunkBatchOps (16) chunks at s3ChunkSize (1 MiB), so each Raft entry was ~16 MiB. Combined with PR #593's defaults (MaxSizePerMsg=4 MiB, MaxInflightMsgs=1024) the result was: - etcd/raft's util.go:limitSize sends a *single* entry that exceeds MaxSizePerMsg as a solo MsgApp (this is documented behaviour: the "as an exception, if the size of the first entry exceeds maxSize, a non-empty slice with just this entry is returned" branch). - One 16 MiB entry per MsgApp means PR #593's advertised per-peer memory bound — `MaxInflight × MaxSizePerMsg = 1024 × 4 MiB = 4 GiB` — silently grew to `1024 × 16 MiB = 16 GiB / peer`, with the three-node leader-side worst case at 32 GiB. - WAL fsync amortised 16 MiB of writes per entry, increasing tail latency on slow disks and amplifying follower fall-behind windows. s3ChunkBatchOps=4 caps each Dispatch at 4 × 1 MiB = 4 MiB plus a few hundred bytes of protobuf overhead. The Raft entry now fits exactly within MaxSizePerMsg, so: - The leader-side per-peer worst case matches the bound PR #593 documents (4 GiB / peer). For S3-heavy workloads the cluster-side worst case drops from 32 GiB to 8 GiB on a 3-node deployment. - Each WAL fsync writes 4 MiB instead of 16 MiB; per-entry sync latency falls 4× and the WAL group-commit path landed in PR #600 amortises the higher commit count. - A 5 GiB PUT now produces 1280 Raft commits instead of 320 (4× growth). The follower apply loop and the read-side prefix scan (scanAllDeltaElems via s3keys.BlobKey) handle the same total bytes but in 4× more keys; on disk locality is unchanged because the underlying Pebble keys still sort by manifest order. No reads or stored manifests are affected — s3ChunkBatchOps is purely a write-path batching parameter, manifests carry their own ChunkSizes slice that mirrors what was actually committed. Build / vet / lint clean. S3 unit tests pass. Follow-ups (separate PRs): - docs/design: admission control on concurrent S3 PUT body bytes so the leader-side memory bound has a hard ceiling regardless of client concurrency. - docs/design: Raft snapshot strategy / blob bypass so large objects do not bloat snapshot transfers when a follower falls behind.
|
Warning Rate limit exceeded
Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 10 minutes and 28 seconds. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request reduces the s3ChunkBatchOps constant from 16 to 4 to ensure that Raft entries align with the 4 MiB MaxSizePerMsg limit, preventing potential memory exhaustion on the leader. While this change correctly bounds memory for data-heavy operations, the reviewer pointed out that it also unnecessarily restricts the batch size for metadata-only operations like deletions and cleanup. Since these operations have much smaller payloads, using a batch size of 4 is inefficient; it is recommended to decouple the batch size configuration for data transfers from that of metadata operations to maintain performance.
| // advertises. Per-PUT Raft commit count grows 4× (a 5 GiB PUT goes | ||
| // from 320 to 1280 entries) — absorbed by the WAL group commit | ||
| // landed in PR #600 and the smaller per-entry fsync. | ||
| s3ChunkBatchOps = 4 |
There was a problem hiding this comment.
s3ChunkBatchOps を 4 に変更したことで、putObject や uploadPart での Raft エントリサイズが適切に制限されるようになりますが、一方で deleteByPrefix (line 1947) や cleanupPartBlobsAsync (line 1893) など、データ本体を含まないメタデータ操作や削除操作のバッチサイズも同時に 4 に制限されてしまいます。
削除操作はペイロードが非常に小さいため、MaxSizePerMsg (4 MiB) の制限内でもより大きなバッチ(例: 64〜128)で効率的に処理可能です。この変更により、大きなオブジェクトの削除やアップロードの中断時のクリーンアップ処理において Raft プロポーザルの回数が不必要に増加し、スループットやクリーンアップ完了までの時間に影響を与える可能性があります。データ転送を伴う書き込み用のバッチサイズと、削除・スキャン用のバッチサイズを分離することを検討してください。
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 080facc17f
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| // advertises. Per-PUT Raft commit count grows 4× (a 5 GiB PUT goes | ||
| // from 320 to 1280 entries) — absorbed by the WAL group commit | ||
| // landed in PR #600 and the smaller per-entry fsync. | ||
| s3ChunkBatchOps = 4 |
There was a problem hiding this comment.
Reduce chunk batch size to include protobuf overhead
Setting s3ChunkBatchOps to 4 still produces Raft payloads larger than a 4 MiB MaxSizePerMsg once encoding overhead is added (the Request/mutation protobuf plus the 1-byte raft encoding prefix in marshalRaftCommand), so full-size 4×1 MiB chunk batches continue to take etcd/raft’s oversized-first-entry path instead of truly fitting the limit. In practice this means the stated “4 MiB aligned” bound in this change is not actually guaranteed for PutObject; you need headroom (or byte-based batching) rather than 4 * s3ChunkSize exactly.
Useful? React with 👍 / 👎.
…ni medium) Address two PR #636 review concerns: Codex P2 — `4 × 1 MiB = 4 MiB` exactly is *over* `MaxSizePerMsg` once protobuf framing overhead is counted. Each pb.Mutation carries the Op tag, Key tag + bytes, and Value length prefix; the pb.Request envelope wraps them; marshalRaftCommand prepends one byte. With a normal-length objectKey the encoded entry is ~4 MiB + 300 B, with a 1 KiB objectKey it grows to ~4 MiB + 4 KiB. Either way the entry falls into etcd/raft's util.go:limitSize oversized-first-entry path (the documented "as an exception" branch), bypassing the `MaxInflight × MaxSizePerMsg` per-peer memory bound the prior commit was trying to enforce. Fix: lower s3ChunkBatchOps from 4 to 3. `3 × 1 MiB = 3 MiB + few hundred bytes` leaves ~1 MiB of headroom even for kilobyte-scale keys, so the entry stays strictly under MaxSizePerMsg and the batched-MsgApp path applies. Per-PUT Raft commit count grows ~5× from the pre-PR-#636 baseline (5 GiB PUT: 320 → ~1707 entries) but each fsync is ~5× smaller; the WAL group commit landed in PR #600 absorbs the rate. Gemini medium — the previous commit also forced the cleanup paths (cleanupPartBlobsAsync, deleteByPrefix, cleanupManifestBlobs, appendPartBlobKeys) onto the same 4-op batch even though they carry no chunk payload. Those ops are pure-key Dels; a 4-op batch amplifies cleanup latency for large objects (5 GiB cleanup → ~1707 batches at the data-write size). Fix: introduce s3MetaBatchOps = 64. With BlobKey-shaped keys ~100 B each, 64 keys × 100 B ≈ 6 KiB per batch, still three orders of magnitude under MaxSizePerMsg. A 5 GiB-object cleanup commits ~80 batches instead of ~1707, so orphaned-blob garbage collection finishes proportionally faster and does not amplify Raft load relative to the data-write path. Updates the four cleanup callsites (adapter/s3.go:1879, 1900, 1933, 2206, 2227) plus the appendPartBlobKeys loop helper. Tests: - TestS3ChunkBatchFitsInRaftMaxSize builds a worst-case PutObject batch (1 KiB objectKey, fully-populated 1 MiB chunk values), marshals via the same protobuf path the real Dispatch uses, and asserts the encoded entry plus the marshalRaftCommand 1-byte framing prefix is strictly under MaxSizePerMsg = 4 MiB. A second assertion pins the headroom margin at >= 64 KiB so a future bump in s3ChunkBatchOps or s3ChunkSize is caught at PR time. - TestS3MetaBatchFitsInRaftMaxSize is the equivalent invariant for s3MetaBatchOps on the cleanup path. Build / vet / lint clean. S3 unit tests pass.
|
Codex P2 と Gemini medium の両方に 5338c05 で対応。 Codex P2 ( protobuf encoding 込みで実測すると → → Gemini medium (cleanup paths share data-write batch size)
→ →
Test plan
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request optimizes S3 operation batching by adjusting s3ChunkBatchOps and introducing s3MetaBatchOps to ensure Raft entries stay within the 4 MiB limit. It adds regression tests to verify these size invariants. A bug was identified in appendPartBlobKeys where passing the pending slice by value prevents the batching logic from correctly flushing data.
| Key: s3keys.VersionedBlobKey(bucket, generation, objectKey, uploadID, part.PartNo, chunkIndex, part.PartVersion), | ||
| }) | ||
| if len(pending) >= s3ChunkBatchOps { | ||
| if len(pending) >= s3MetaBatchOps { |
There was a problem hiding this comment.
appendPartBlobKeys 内でのバッチ処理ロジックが正しく動作していません。pending スライスが値渡しされているため、この関数内での append による変更(長さの更新や再割り当て)が、flush クロージャが参照している呼び出し元の pending 変数に伝わりません。その結果、flush 内の len(pending) は常に 0 となり、バッチごとのフラッシュが実行されず、全キーが1つの巨大なバッチとして蓄積されてしまいます。これは、外部リクエストに起因するコレクションの無制限な増大を防ぎ、OOM(メモリ不足)を回避するというリポジトリの原則に抵触します。関数シグネチャをポインタ経由(*[]*kv.Elem[kv.OP])に変更するか、バッチ管理の構造を見直してください。
References
- To prevent unbounded memory growth and potential OOM issues, apply a fixed bound to collections that can grow from external requests, such as pending configuration changes. Reject new requests when the bound is reached.
|
@claude[agent] review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 5338c054c7
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| readTS := s.readTS() | ||
| readPin := s.pinReadTS(readTS) | ||
| kvs, err := s.store.ScanAt(ctx, cursor, end, s3ChunkBatchOps, readTS) | ||
| kvs, err := s.store.ScanAt(ctx, cursor, end, s3MetaBatchOps, readTS) |
There was a problem hiding this comment.
Keep blob cleanup scan page size independent of delete batch
deleteByPrefix now uses s3MetaBatchOps for ScanAt, so cleanup of blob prefixes scans 64 full chunk KVs per loop. In this codebase, ScanAt materializes values (e.g. store/mvcc_store.go clones Value in collectScanResults), and blob values are ~1 MiB each, so one cleanup worker can allocate/read ~64 MiB per page and up to ~1 GiB with 16 concurrent cleanup workers. This is a regression from the smaller scan page and can cause significant memory/GC and I/O spikes during abort/manifest cleanup even though the code only needs keys for deletes; consider keeping scan size small (or key-only scan) while still batching delete dispatches at 64.
Useful? React with 👍 / 👎.
LGTM ✅ Reviewed 5338c05. The changes correctly address the MaxSizePerMsg alignment issue: Key fixes verified:
Test coverage:
Performance trade-offs well-reasoned:
Documentation is thorough. No issues found. |
…s appends Gemini high — the previous round split cleanup paths onto s3MetaBatchOps but missed that `appendPartBlobKeys` still took `pending []*kv.Elem[kv.OP]` by value. The flush closure inside `cleanupManifestBlobs` captures the outer `pending` slice header; appends performed inside the helper update the local header only, so flush sees length 0 every iteration and silently no-ops. After the helper returns, the caller reassigns `pending` from the helper's return value — but by then the entire part's chunks have accumulated into one slice that bypasses the s3MetaBatchOps cap entirely. For a manifest with thousands of chunks (e.g. a 5 GiB-object cleanup with ~5120 chunks), the broken helper builds one giant batch and hands it to coordinator.Dispatch in a single call, re-introducing exactly the OOM / oversized-MsgApp risk the s3MetaBatchOps cap was meant to prevent. This is the same shape as the pre-PR-#636 behaviour s3ChunkBatchOps was tightening, just on the cleanup side. Fix: take `pending` by pointer (`*[]*kv.Elem[kv.OP]`) and write through it. Now appends inside the helper update the variable the flush closure reads, so threshold-triggered flushes fire correctly and the slice length never exceeds s3MetaBatchOps. Test: - TestAppendPartBlobKeys_FlushFiresEveryS3MetaBatchOps builds a part with 2 × s3MetaBatchOps + 7 chunks, calls the helper directly with a recording flush closure, and asserts flush fired exactly twice with batch sizes [s3MetaBatchOps, s3MetaBatchOps] plus 7 trailing entries left for the caller's tail flush. The test catches the bug: the value-passing version would record 0 flushes (closure always saw length 0) — the assertion `Equal(2, flushCalls)` would fail loudly. Build / vet / lint clean. All S3 + cleanup tests pass.
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
|
Gemini high の指摘どおり実バグでした。97b4fc58 で修正、回帰テスト付き。 何が壊れていたか
func (s *S3Server) appendPartBlobKeys(pending []*kv.Elem[kv.OP], ..., flush func()) ([]*kv.Elem[kv.OP], bool)呼び出し元 flush := func() {
if len(pending) == 0 { return } // ← 呼び出し元の pending を見る
Dispatch(pending)
pending = pending[:0]
}
for _, part := range manifest.Parts {
pending, ok = s.appendPartBlobKeys(pending, ..., flush) // 値渡し
}
flush()問題:
5 GiB オブジェクトのクリーンアップ (5120 chunks) で言うと、 修正
func (s *S3Server) appendPartBlobKeys(pending *[]*kv.Elem[kv.OP], ..., flush func()) bool {
*pending = append(*pending, &kv.Elem[kv.OP]{...})
if len(*pending) >= s3MetaBatchOps {
flush()
}
}回帰テスト
旧 (値渡し) コードでは 影響範囲
Build / vet / lint clean、3 つのテスト ( /gemini review |
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
|
Codex Review: Didn't find any major issues. Can't wait for the next one! ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
## Summary PR #636 (`s3ChunkBatchOps = 4` で Raft entry を `MaxSizePerMsg = 4 MiB` に整列) のフォローアップとして、design doc を 2 本提出します。コードは含まず docs だけ。 ### 1. `docs/design/2026_04_25_proposed_s3_admission_control.md` **問題**: PR #636 は per-entry の memory accounting を直すが、aggregate (= 同時 PUT 数 × 各 PUT の inflight) は client 並列度で青天井。 **提案**: ノードが受け入れて Raft に commit する前の S3 PUT body bytes を **共通セマフォで hard cap**。 - (A) `Content-Length` で request entry に pre-charge - (B) `flushBatch` の per-batch (4 MiB) で sub-lease — `dispatchAdmissionTimeout` 経過で 503 SlowDown - デフォルト 256 MiB cap (GOMEMLIMIT=1800 MiB の ~14%)、env で tunable - `Retry-After` 付き 503 → AWS SDK が自動 retry - M1〜M4 の段階導入、metrics/Grafana panel まで含む ### 2. `docs/design/2026_04_25_proposed_s3_raft_blob_offload.md` **問題**: 5 GiB PUT が WAL に丸ごと書かれ、follower catch-up と snapshot transfer が O(stored bytes) でスケールしない。 **提案**: Raft が運ぶのは `ChunkRef{SHA256, Size}` (~100 B) と manifest だけ。chunk 本体は別キー名前空間 `!s3|chunkblob|<SHA>` でローカル Pebble に直書き、follower は applyLoop で `chunkref` を見たら **非同期 fetch RPC で peer から取りに行く**。 - 1 GiB PUT の Raft 負荷: ~100 KiB に圧縮 (10⁴× 削減) - Snapshot size: O(manifest count) になる - Reference count + grace period の GC - `ELASTICKV_S3_BLOB_OFFLOAD=true` でオプトイン、legacy `BlobKey` 経路は並走 - M0 spike 〜 M4 migrator までの 5 段階ロールアウト 両方とも: - `docs/design/README.md` のファイル名・ヘッダ規約に準拠 (Status: Proposed / Date: 2026-04-25) - 関連 PR (#593, #600, #612, #617, #636) と既存 design (workload isolation, raft grpc streaming) を相互参照 - リスクと open questions を明示 ## Test plan - [x] markdown lint clean (textlint なし、手動確認) - [x] ファイル名規約 `YYYY_MM_DD_proposed_<slug>.md` 準拠 - [x] 既存 design へのリンクが docs/design/README.md と整合 レビューは設計の方向性 (admission control の cap 値、blob offload の content-addressing 単位、M0-M4 の milestone 配分) に focus してもらえると助かります。 /gemini review @codex review <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit ## Documentation * Added design proposal for S3 PUT request admission control and load management strategy * Added design proposal for S3 object storage optimization approach <!-- end of auto-generated comment: release notes by coderabbit.ai -->
Summary
S3 PutObject の Raft entry サイズを
MaxSizePerMsg(PR #593 で 4 MiB) と整列させます。変更:
s3ChunkBatchOps = 16 → 4。1 Raft entry =s3ChunkBatchOps × s3ChunkSize ≒ 4 MiBに揃える。1 行差分。Why
etcd/raftのutil.go:limitSizeには「単一 entry がMaxSizePerMsgを超えていても、その entry だけは reject せず単独で MsgApp に載せる」という documented exception があります ("if the size of the first entry exceeds maxSize, a non-empty slice with just this entry is returned"). 16 MiB entry はこの経路で素通りするので:PR #593 の本文が謳う
1024 × 4 MiB = 4 GiB / peerの memory bound は 小エントリ前提で、S3 経路では成立していませんでした。本 PR で S3 が普通の batched MsgApp 経路に乗るようになり、bound が S3 込みでも正確になります。Raft commit 数が 4× に増える代わりに、各 fsync が 4× 速くなるので、PR #600 の WAL group commit と相殺されてエンドツーエンド throughput はほぼ同等のはずです。
Test plan
go build ./...cleangolangci-lint run ./...0 issuesgo test ./adapter/ -short -run 'TestS3|S3Server'passFollow-ups (別 PR で design doc)
4 GiB × peersを超えないようにする。/gemini review
@codex review