perf(s3): align Raft entry size with MaxSizePerMsg via s3ChunkBatchOps=4 by bootjp · Pull Request #636 · bootjp/elastickv

bootjp · 2026-04-25T14:32:10Z

Summary

S3 PutObject の Raft entry サイズを MaxSizePerMsg (PR #593 で 4 MiB) と整列させます。

変更: s3ChunkBatchOps = 16 → 4。1 Raft entry = s3ChunkBatchOps × s3ChunkSize ≒ 4 MiB に揃える。1 行差分。

Why

etcd/raft の util.go:limitSize には「単一 entry が MaxSizePerMsg を超えていても、その entry だけは reject せず単独で MsgApp に載せる」という documented exception があります ("if the size of the first entry exceeds maxSize, a non-empty slice with just this entry is returned"). 16 MiB entry はこの経路で素通りするので:

項目	s3ChunkBatchOps=16	s3ChunkBatchOps=4 (本 PR)
1 Raft entry size	~16 MiB	~4 MiB
Leader worst-case / peer	1024 × 16 MiB = 16 GiB	1024 × 4 MiB = 4 GiB
3-node cluster (leader)	32 GiB	8 GiB
WAL fsync per entry	16 MiB	4 MiB
5 GiB PUT の Raft commit 数	320	1280 (4×)

PR #593 の本文が謳う 1024 × 4 MiB = 4 GiB / peer の memory bound は 小エントリ前提で、S3 経路では成立していませんでした。本 PR で S3 が普通の batched MsgApp 経路に乗るようになり、bound が S3 込みでも正確になります。

Raft commit 数が 4× に増える代わりに、各 fsync が 4× 速くなるので、PR #600 の WAL group commit と相殺されてエンドツーエンド throughput はほぼ同等のはずです。

Test plan

go build ./... clean
golangci-lint run ./... 0 issues
go test ./adapter/ -short -run 'TestS3|S3Server' pass
CI

Follow-ups (別 PR で design doc)

S3 admission control — 同時並行の S3 PUT body bytes に hard cap を入れて、クライアント並列度が上がっても leader-side memory が 4 GiB × peers を超えないようにする。
Raft snapshot 戦略 / blob bypass — follower fall-behind 時の snapshot transfer に 5 GiB blob が乗ると不都合なので、blob を Raft 経路から外して manifest だけ raft で同期するアーキテクチャ。

/gemini review
@codex review

Before this change, S3 PutObject built one Raft proposal per s3ChunkBatchOps (16) chunks at s3ChunkSize (1 MiB), so each Raft entry was ~16 MiB. Combined with PR #593's defaults (MaxSizePerMsg=4 MiB, MaxInflightMsgs=1024) the result was: - etcd/raft's util.go:limitSize sends a *single* entry that exceeds MaxSizePerMsg as a solo MsgApp (this is documented behaviour: the "as an exception, if the size of the first entry exceeds maxSize, a non-empty slice with just this entry is returned" branch). - One 16 MiB entry per MsgApp means PR #593's advertised per-peer memory bound — `MaxInflight × MaxSizePerMsg = 1024 × 4 MiB = 4 GiB` — silently grew to `1024 × 16 MiB = 16 GiB / peer`, with the three-node leader-side worst case at 32 GiB. - WAL fsync amortised 16 MiB of writes per entry, increasing tail latency on slow disks and amplifying follower fall-behind windows. s3ChunkBatchOps=4 caps each Dispatch at 4 × 1 MiB = 4 MiB plus a few hundred bytes of protobuf overhead. The Raft entry now fits exactly within MaxSizePerMsg, so: - The leader-side per-peer worst case matches the bound PR #593 documents (4 GiB / peer). For S3-heavy workloads the cluster-side worst case drops from 32 GiB to 8 GiB on a 3-node deployment. - Each WAL fsync writes 4 MiB instead of 16 MiB; per-entry sync latency falls 4× and the WAL group-commit path landed in PR #600 amortises the higher commit count. - A 5 GiB PUT now produces 1280 Raft commits instead of 320 (4× growth). The follower apply loop and the read-side prefix scan (scanAllDeltaElems via s3keys.BlobKey) handle the same total bytes but in 4× more keys; on disk locality is unchanged because the underlying Pebble keys still sort by manifest order. No reads or stored manifests are affected — s3ChunkBatchOps is purely a write-path batching parameter, manifests carry their own ChunkSizes slice that mirrors what was actually committed. Build / vet / lint clean. S3 unit tests pass. Follow-ups (separate PRs): - docs/design: admission control on concurrent S3 PUT body bytes so the leader-side memory bound has a hard ceiling regardless of client concurrency. - docs/design: Raft snapshot strategy / blob bypass so large objects do not bloat snapshot transfers when a follower falls behind.

coderabbitai · 2026-04-25T14:32:19Z

Warning

Rate limit exceeded

@bootjp has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 10 minutes and 28 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 10 minutes and 28 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: c4b1b934-5aa8-434e-8795-ffeba2badfa8

📥 Commits

Reviewing files that changed from the base of the PR and between dbe4725 and 97b4fc5.

📒 Files selected for processing (2)

adapter/s3.go
adapter/s3_chunk_batch_test.go

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch perf/s3-raft-entry-align

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request reduces the s3ChunkBatchOps constant from 16 to 4 to ensure that Raft entries align with the 4 MiB MaxSizePerMsg limit, preventing potential memory exhaustion on the leader. While this change correctly bounds memory for data-heavy operations, the reviewer pointed out that it also unnecessarily restricts the batch size for metadata-only operations like deletions and cleanup. Since these operations have much smaller payloads, using a batch size of 4 is inefficient; it is recommended to decouple the batch size configuration for data transfers from that of metadata operations to maintain performance.

gemini-code-assist · 2026-04-25T14:33:18Z

+	// advertises. Per-PUT Raft commit count grows 4× (a 5 GiB PUT goes
+	// from 320 to 1280 entries) — absorbed by the WAL group commit
+	// landed in PR #600 and the smaller per-entry fsync.
+	s3ChunkBatchOps          = 4


s3ChunkBatchOps を 4 に変更したことで、putObject や uploadPart での Raft エントリサイズが適切に制限されるようになりますが、一方で deleteByPrefix (line 1947) や cleanupPartBlobsAsync (line 1893) など、データ本体を含まないメタデータ操作や削除操作のバッチサイズも同時に 4 に制限されてしまいます。

削除操作はペイロードが非常に小さいため、MaxSizePerMsg (4 MiB) の制限内でもより大きなバッチ（例: 64〜128）で効率的に処理可能です。この変更により、大きなオブジェクトの削除やアップロードの中断時のクリーンアップ処理において Raft プロポーザルの回数が不必要に増加し、スループットやクリーンアップ完了までの時間に影響を与える可能性があります。データ転送を伴う書き込み用のバッチサイズと、削除・スキャン用のバッチサイズを分離することを検討してください。

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 080facc17f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-25T14:36:42Z

+	// advertises. Per-PUT Raft commit count grows 4× (a 5 GiB PUT goes
+	// from 320 to 1280 entries) — absorbed by the WAL group commit
+	// landed in PR #600 and the smaller per-entry fsync.
+	s3ChunkBatchOps          = 4


Reduce chunk batch size to include protobuf overhead

Setting s3ChunkBatchOps to 4 still produces Raft payloads larger than a 4 MiB MaxSizePerMsg once encoding overhead is added (the Request/mutation protobuf plus the 1-byte raft encoding prefix in marshalRaftCommand), so full-size 4×1 MiB chunk batches continue to take etcd/raft’s oversized-first-entry path instead of truly fitting the limit. In practice this means the stated “4 MiB aligned” bound in this change is not actually guaranteed for PutObject; you need headroom (or byte-based batching) rather than 4 * s3ChunkSize exactly.

Useful? React with 👍 / 👎.

…ni medium) Address two PR #636 review concerns: Codex P2 — `4 × 1 MiB = 4 MiB` exactly is *over* `MaxSizePerMsg` once protobuf framing overhead is counted. Each pb.Mutation carries the Op tag, Key tag + bytes, and Value length prefix; the pb.Request envelope wraps them; marshalRaftCommand prepends one byte. With a normal-length objectKey the encoded entry is ~4 MiB + 300 B, with a 1 KiB objectKey it grows to ~4 MiB + 4 KiB. Either way the entry falls into etcd/raft's util.go:limitSize oversized-first-entry path (the documented "as an exception" branch), bypassing the `MaxInflight × MaxSizePerMsg` per-peer memory bound the prior commit was trying to enforce. Fix: lower s3ChunkBatchOps from 4 to 3. `3 × 1 MiB = 3 MiB + few hundred bytes` leaves ~1 MiB of headroom even for kilobyte-scale keys, so the entry stays strictly under MaxSizePerMsg and the batched-MsgApp path applies. Per-PUT Raft commit count grows ~5× from the pre-PR-#636 baseline (5 GiB PUT: 320 → ~1707 entries) but each fsync is ~5× smaller; the WAL group commit landed in PR #600 absorbs the rate. Gemini medium — the previous commit also forced the cleanup paths (cleanupPartBlobsAsync, deleteByPrefix, cleanupManifestBlobs, appendPartBlobKeys) onto the same 4-op batch even though they carry no chunk payload. Those ops are pure-key Dels; a 4-op batch amplifies cleanup latency for large objects (5 GiB cleanup → ~1707 batches at the data-write size). Fix: introduce s3MetaBatchOps = 64. With BlobKey-shaped keys ~100 B each, 64 keys × 100 B ≈ 6 KiB per batch, still three orders of magnitude under MaxSizePerMsg. A 5 GiB-object cleanup commits ~80 batches instead of ~1707, so orphaned-blob garbage collection finishes proportionally faster and does not amplify Raft load relative to the data-write path. Updates the four cleanup callsites (adapter/s3.go:1879, 1900, 1933, 2206, 2227) plus the appendPartBlobKeys loop helper. Tests: - TestS3ChunkBatchFitsInRaftMaxSize builds a worst-case PutObject batch (1 KiB objectKey, fully-populated 1 MiB chunk values), marshals via the same protobuf path the real Dispatch uses, and asserts the encoded entry plus the marshalRaftCommand 1-byte framing prefix is strictly under MaxSizePerMsg = 4 MiB. A second assertion pins the headroom margin at >= 64 KiB so a future bump in s3ChunkBatchOps or s3ChunkSize is caught at PR time. - TestS3MetaBatchFitsInRaftMaxSize is the equivalent invariant for s3MetaBatchOps on the cleanup path. Build / vet / lint clean. S3 unit tests pass.

bootjp · 2026-04-25T14:47:30Z

Codex P2 と Gemini medium の両方に 5338c05 で対応。

Codex P2 (4 × 1 MiB = 4 MiB exactly is over MaxSizePerMsg)

protobuf encoding 込みで実測すると 4 × 1 MiB + ~300 B。1 KiB objectKey 込みで ~4 KiB 増えて、確実に MaxSizePerMsg = 4 MiB を超えていました。

→ s3ChunkBatchOps = 4 → 3 に下げる。3 MiB + few hundred B で MaxSizePerMsg から ~1 MiB のヘッドルーム確保。

→ TestS3ChunkBatchFitsInRaftMaxSize で worst-case (1 KiB objectKey + fully-populated 1 MiB chunks) を実際に protobuf marshal して < 4 MiB を assert。さらに >= 64 KiB headroom を pin して将来の値バンプが silently 限界を越えないようにガード。

Gemini medium (cleanup paths share data-write batch size)

cleanupPartBlobsAsync / deleteByPrefix / cleanupManifestBlobs / appendPartBlobKeys は payload を運ばない Del のみなので、4-op batch は無駄に commit 数を増やすだけ。

→ s3MetaBatchOps = 64 を新設して、cleanup 5 callsites を切り替え。BlobKey ~100 B × 64 ≈ 6 KiB / batch (4 MiB cap から 3 桁下)。5 GiB-object cleanup の commit 数: ~1707 → ~80 (21× 改善)。

→ TestS3MetaBatchFitsInRaftMaxSize で同じ invariant を pin。

	pre-PR (main)	r1 of #636 (4)	r2 of #636 (本コミット)
data-write batch	16 (16 MiB entry)	4 (~4 MiB + 300 B → over)	3 (~3 MiB → safely under)
cleanup batch	16	4 (no payload, wasted)	64 (key-only, optimal)
5 GiB PUT entries	320	1280	1707
5 GiB cleanup batches	320	1280	80

Test plan

TestS3ChunkBatchFitsInRaftMaxSize pass
TestS3MetaBatchFitsInRaftMaxSize pass
go test ./adapter/ -short -run TestS3 -count=1 pass
golangci-lint 0 issues

/gemini review
@codex review

gemini-code-assist

Code Review

This pull request optimizes S3 operation batching by adjusting s3ChunkBatchOps and introducing s3MetaBatchOps to ensure Raft entries stay within the 4 MiB limit. It adds regression tests to verify these size invariants. A bug was identified in appendPartBlobKeys where passing the pending slice by value prevents the batching logic from correctly flushing data.

gemini-code-assist · 2026-04-25T14:49:18Z

 			Key: s3keys.VersionedBlobKey(bucket, generation, objectKey, uploadID, part.PartNo, chunkIndex, part.PartVersion),
 		})
-		if len(pending) >= s3ChunkBatchOps {
+		if len(pending) >= s3MetaBatchOps {


appendPartBlobKeys 内でのバッチ処理ロジックが正しく動作していません。pending スライスが値渡しされているため、この関数内での append による変更（長さの更新や再割り当て）が、flush クロージャが参照している呼び出し元の pending 変数に伝わりません。その結果、flush 内の len(pending) は常に 0 となり、バッチごとのフラッシュが実行されず、全キーが1つの巨大なバッチとして蓄積されてしまいます。これは、外部リクエストに起因するコレクションの無制限な増大を防ぎ、OOM（メモリ不足）を回避するというリポジトリの原則に抵触します。関数シグネチャをポインタ経由（*[]*kv.Elem[kv.OP]）に変更するか、バッチ管理の構造を見直してください。

References

To prevent unbounded memory growth and potential OOM issues, apply a fixed bound to collections that can grow from external requests, such as pending configuration changes. Reject new requests when the bound is reached.

bootjp · 2026-04-25T14:55:20Z

@claude[agent] review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5338c054c7

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-25T14:58:05Z

 		readTS := s.readTS()
 		readPin := s.pinReadTS(readTS)
-		kvs, err := s.store.ScanAt(ctx, cursor, end, s3ChunkBatchOps, readTS)
+		kvs, err := s.store.ScanAt(ctx, cursor, end, s3MetaBatchOps, readTS)


Keep blob cleanup scan page size independent of delete batch

deleteByPrefix now uses s3MetaBatchOps for ScanAt, so cleanup of blob prefixes scans 64 full chunk KVs per loop. In this codebase, ScanAt materializes values (e.g. store/mvcc_store.go clones Value in collectScanResults), and blob values are ~1 MiB each, so one cleanup worker can allocate/read ~64 MiB per page and up to ~1 GiB with 16 concurrent cleanup workers. This is a regression from the smaller scan page and can cause significant memory/GC and I/O spikes during abort/manifest cleanup even though the code only needs keys for deletes; consider keeping scan size small (or key-only scan) while still batching delete dispatches at 64.

Useful? React with 👍 / 👎.

Claude · 2026-04-25T14:58:34Z

@claude[agent] review

LGTM ✅

Reviewed 5338c05. The changes correctly address the MaxSizePerMsg alignment issue:

Key fixes verified:

s3ChunkBatchOps = 3 provides ~1 MiB headroom under 4 MiB limit
s3MetaBatchOps = 64 optimizes cleanup paths (6 KiB vs 4 MiB cap)
All 5 cleanup callsites properly updated
Data-write paths (PutObject/UploadPart) correctly use the 3-op batch

Test coverage:

TestS3ChunkBatchFitsInRaftMaxSize validates worst-case (1 KiB key + full chunks) with actual protobuf marshaling, asserts < 4 MiB and >= 64 KiB headroom
TestS3MetaBatchFitsInRaftMaxSize pins cleanup batch invariant
Both tests pass ✅

Performance trade-offs well-reasoned:

5 GiB PUT commits: 320 → ~1707 (5.3× increase, but 5.3× smaller fsyncs)
5 GiB cleanup: ~1707 → ~80 batches (21× improvement)

Documentation is thorough. No issues found.

…s appends Gemini high — the previous round split cleanup paths onto s3MetaBatchOps but missed that `appendPartBlobKeys` still took `pending []*kv.Elem[kv.OP]` by value. The flush closure inside `cleanupManifestBlobs` captures the outer `pending` slice header; appends performed inside the helper update the local header only, so flush sees length 0 every iteration and silently no-ops. After the helper returns, the caller reassigns `pending` from the helper's return value — but by then the entire part's chunks have accumulated into one slice that bypasses the s3MetaBatchOps cap entirely. For a manifest with thousands of chunks (e.g. a 5 GiB-object cleanup with ~5120 chunks), the broken helper builds one giant batch and hands it to coordinator.Dispatch in a single call, re-introducing exactly the OOM / oversized-MsgApp risk the s3MetaBatchOps cap was meant to prevent. This is the same shape as the pre-PR-#636 behaviour s3ChunkBatchOps was tightening, just on the cleanup side. Fix: take `pending` by pointer (`*[]*kv.Elem[kv.OP]`) and write through it. Now appends inside the helper update the variable the flush closure reads, so threshold-triggered flushes fire correctly and the slice length never exceeds s3MetaBatchOps. Test: - TestAppendPartBlobKeys_FlushFiresEveryS3MetaBatchOps builds a part with 2 × s3MetaBatchOps + 7 chunks, calls the helper directly with a recording flush closure, and asserts flush fired exactly twice with batch sizes [s3MetaBatchOps, s3MetaBatchOps] plus 7 trailing entries left for the caller's tail flush. The test catches the bug: the value-passing version would record 0 flushes (closure always saw length 0) — the assertion `Equal(2, flushCalls)` would fail loudly. Build / vet / lint clean. All S3 + cleanup tests pass.

chatgpt-codex-connector · 2026-04-25T14:59:03Z

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, you can upgrade your account or add credits to your account and enable them for code reviews in your settings.

bootjp · 2026-04-25T14:59:27Z

Gemini high の指摘どおり実バグでした。97b4fc58 で修正、回帰テスト付き。

何が壊れていたか

appendPartBlobKeys を slice 値渡し していた:

func (s *S3Server) appendPartBlobKeys(pending []*kv.Elem[kv.OP], ..., flush func()) ([]*kv.Elem[kv.OP], bool)

呼び出し元 cleanupManifestBlobs:

flush := func() {
    if len(pending) == 0 { return }   // ← 呼び出し元の pending を見る
    Dispatch(pending)
    pending = pending[:0]
}
for _, part := range manifest.Parts {
    pending, ok = s.appendPartBlobKeys(pending, ..., flush)  // 値渡し
}
flush()

問題:

ヘルパー内で pending = append(pending, ...) するが、これは LOCAL の slice header を更新するだけ
flush クロージャは外側スコープの pending を読む → len(pending) == 0 → no-op で即 return
結果、ヘルパー内の閾値チェック if len(pending) >= s3MetaBatchOps { flush() } は 発火しても何も起きない
ヘルパー返却後に呼び出し元が pending = newPending で代入 → 全 chunk が 1 スライスに溜まったまま渡る
最後の flush() が chunk 全件を 1 回の Dispatch にして送る

5 GiB オブジェクトのクリーンアップ (5120 chunks) で言うと、s3MetaBatchOps = 64 で 80 batch にしたかったのに 1 batch (5120 ops) として送ってしまう。MaxSizePerMsg=4 MiB を超えて oversized-first-entry path に落ち、まさに本 PR が防ごうとしていた OOM / Raft 圧迫を再導入していました。

修正

pending をポインタ (*[]*kv.Elem[kv.OP]) で受け取り、*pending = append(*pending, ...) で書く。クロージャは同じ変数を読み書きするので一貫した状態が見える。

func (s *S3Server) appendPartBlobKeys(pending *[]*kv.Elem[kv.OP], ..., flush func()) bool {
    *pending = append(*pending, &kv.Elem[kv.OP]{...})
    if len(*pending) >= s3MetaBatchOps {
        flush()
    }
}

回帰テスト

TestAppendPartBlobKeys_FlushFiresEveryS3MetaBatchOps:

chunkCount = 2 × s3MetaBatchOps + 7
記録型 flush クロージャを渡してヘルパーを直接呼ぶ
assert: flush がちょうど 2 回呼ばれ、各回のバッチサイズが s3MetaBatchOps、残り 7 entries が末尾に残る

旧 (値渡し) コードでは flushCalls == 0 になるので Equal(2, flushCalls) でクラッシュ → 回帰検出。

影響範囲

appendPartBlobKeys の呼び出し元は cleanupManifestBlobs 1 箇所のみなので、他経路は無影響。シグネチャ変更はパッケージ内部のヘルパー (S3Server method) で外部 API 影響なし。

Build / vet / lint clean、3 つのテスト (TestS3ChunkBatchFitsInRaftMaxSize, TestS3MetaBatchFitsInRaftMaxSize, TestAppendPartBlobKeys_FlushFiresEveryS3MetaBatchOps) 全 pass。

/gemini review
@codex review

gemini-code-assist · 2026-04-25T14:59:30Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

chatgpt-codex-connector · 2026-04-25T15:02:05Z

Codex Review: Didn't find any major issues. Can't wait for the next one!

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@codex

## Summary PR #636 (`s3ChunkBatchOps = 4` で Raft entry を `MaxSizePerMsg = 4 MiB` に整列) のフォローアップとして、design doc を 2 本提出します。コードは含まず docs だけ。 ### 1. `docs/design/2026_04_25_proposed_s3_admission_control.md` **問題**: PR #636 は per-entry の memory accounting を直すが、aggregate (= 同時 PUT 数 × 各 PUT の inflight) は client 並列度で青天井。 **提案**: ノードが受け入れて Raft に commit する前の S3 PUT body bytes を **共通セマフォで hard cap**。 - (A) `Content-Length` で request entry に pre-charge - (B) `flushBatch` の per-batch (4 MiB) で sub-lease — `dispatchAdmissionTimeout` 経過で 503 SlowDown - デフォルト 256 MiB cap (GOMEMLIMIT=1800 MiB の ~14%)、env で tunable - `Retry-After` 付き 503 → AWS SDK が自動 retry - M1〜M4 の段階導入、metrics/Grafana panel まで含む ### 2. `docs/design/2026_04_25_proposed_s3_raft_blob_offload.md` **問題**: 5 GiB PUT が WAL に丸ごと書かれ、follower catch-up と snapshot transfer が O(stored bytes) でスケールしない。 **提案**: Raft が運ぶのは `ChunkRef{SHA256, Size}` (~100 B) と manifest だけ。chunk 本体は別キー名前空間 `!s3|chunkblob|<SHA>` でローカル Pebble に直書き、follower は applyLoop で `chunkref` を見たら **非同期 fetch RPC で peer から取りに行く**。 - 1 GiB PUT の Raft 負荷: ~100 KiB に圧縮 (10⁴× 削減) - Snapshot size: O(manifest count) になる - Reference count + grace period の GC - `ELASTICKV_S3_BLOB_OFFLOAD=true` でオプトイン、legacy `BlobKey` 経路は並走 - M0 spike 〜 M4 migrator までの 5 段階ロールアウト両方とも: - `docs/design/README.md` のファイル名・ヘッダ規約に準拠 (Status: Proposed / Date: 2026-04-25) - 関連 PR (#593, #600, #612, #617, #636) と既存 design (workload isolation, raft grpc streaming) を相互参照 - リスクと open questions を明示 ## Test plan - [x] markdown lint clean (textlint なし、手動確認) - [x] ファイル名規約 `YYYY_MM_DD_proposed_<slug>.md` 準拠 - [x] 既存 design へのリンクが docs/design/README.md と整合レビューは設計の方向性 (admission control の cap 値、blob offload の content-addressing 単位、M0-M4 の milestone 配分) に focus してもらえると助かります。 /gemini review @codex review  ## Summary by CodeRabbit ## Documentation * Added design proposal for S3 PUT request admission control and load management strategy * Added design proposal for S3 object storage optimization approach

gemini-code-assist Bot reviewed Apr 25, 2026

View reviewed changes

bootjp mentioned this pull request Apr 25, 2026

docs(design): propose S3 admission control + raft blob offload #637

Merged

3 tasks

chatgpt-codex-connector Bot reviewed Apr 25, 2026

View reviewed changes

gemini-code-assist Bot reviewed Apr 25, 2026

View reviewed changes

Claude started work on behalf of bootjp April 25, 2026 14:55 View session

chatgpt-codex-connector Bot reviewed Apr 25, 2026

View reviewed changes

Claude finished work on behalf of bootjp April 25, 2026 14:59

bootjp merged commit 466e554 into main Apr 25, 2026
8 checks passed

bootjp deleted the perf/s3-raft-entry-align branch April 25, 2026 15:08

Conversation

bootjp commented Apr 25, 2026

Summary

Why

Test plan

Follow-ups (別 PR で design doc)

Uh oh!

coderabbitai Bot commented Apr 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 25, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 25, 2026

Choose a reason for hiding this comment

Uh oh!

bootjp commented Apr 25, 2026

Test plan

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 25, 2026

Choose a reason for hiding this comment

Uh oh!

bootjp commented Apr 25, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 25, 2026

Choose a reason for hiding this comment

Uh oh!

Claude AI commented Apr 25, 2026

Uh oh!

chatgpt-codex-connector Bot commented Apr 25, 2026

Uh oh!

bootjp commented Apr 25, 2026

何が壊れていたか

修正

回帰テスト

影響範囲

Uh oh!

gemini-code-assist Bot commented Apr 25, 2026

Uh oh!

chatgpt-codex-connector Bot commented Apr 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coderabbitai Bot commented Apr 25, 2026 •

edited

Loading