perf(store): add ELASTICKV_FSM_SYNC_MODE for FSM apply fsync opt-out by bootjp · Pull Request #592 · bootjp/elastickv

bootjp · 2026-04-23T12:36:15Z

Root cause

Every committed raft entry triggered a pebble.Sync inside store.ApplyMutations / store.DeletePrefixAt (see store/lsm_store.go:1056, 1108), on top of the raft WAL fsync that etcd/raft already performs per Ready batch in persistReadyToWAL (internal/raftengine/etcd/wal_store.go:376).

The raft Ready loop in drainReady (internal/raftengine/etcd/engine.go:1389) already batches multiple entries per Ready, so the raft WAL fsync is fine. The hot fsync is the FSM-side b.Commit(pebble.Sync) that the apply loop (applyCommitted) hits once per entry. A prior cleanup (docs/review_todo.md section 3.4) intentionally kept ApplyMutations on pebble.Sync; this CL makes that choice tunable.

Microbenchmark (Apple M1 Max, APFS tempdir, BenchmarkApplyMutations_SyncMode):

mode	ns/op	allocs/op
sync	16292899	7
nosync	16293	8

~1000x on this platform. Real hardware fsync latency varies, but the sync/nosync ratio is consistently large on any WAL that enforces platform durability.

Durability argument

Pebble's FSM-commit fsync is redundant with the raft WAL under this codebase's crash-recovery model:

Raft WAL (etcd/raft) fsyncs every committed entry via persist.Save before Advance.
On restart, newMemoryStorage (internal/raftengine/etcd/persistence.go:352) reloads the snapshot + all WAL entries. newRawNode does not set Config.Applied, so etcdraft defaults it to snapshot.Metadata.Index.
The engine sets e.applied = maxAppliedIndex(LocalSnap) and every committed entry past the snapshot is re-delivered through CommittedEntries on the first Ready.
kv/fsm.applyCommitWithIdempotencyFallback treats an already-committed key (LatestCommitTS >= commitTS) as an idempotent retry, so replaying an entry whose effect survived the crash is safe.
FSM snapshots are fsynced (writeFSMSnapshotFile then f.Sync() in fsm_snapshot_file.go).

Therefore a crash that loses the unfsynced tail of Pebble's own WAL is recoverable: raft replays from the last fsynced FSM snapshot onwards, and the idempotent apply path re-materialises the lost state. Pebble on the FSM commit path effectively becomes a volatile cache of applied state whose durability boundary is the raft WAL.

Other pebble.Sync call sites (snapshot-batch commit, metadata-restore writes, compaction commitSnapshotBatch) are untouched: those are orthogonal durability boundaries (e.g. restore-directory swap) and are not per-proposal cost.

Env var + default

ELASTICKV_FSM_SYNC_MODE=sync (default) - current behaviour.
ELASTICKV_FSM_SYNC_MODE=nosync - b.Commit(pebble.NoSync) on the FSM hot path. Raft WAL remains the durability boundary.

Unknown values fall back to sync (fail-safe toward durability). Parsing is case-insensitive and whitespace-tolerant.

A Prometheus gauge elastickv_fsm_apply_sync_mode{mode="sync"|"nosync"} is set at NewRegistry time via store.FSMApplySyncModeLabel(), so dashboards can alert if a rolling deploy accidentally flips the durability posture.

Test plan

go test ./store/... ./monitoring/... ./kv/... ./internal/raftengine/... -count=1
Env var parsing: sync/nosync/mixed-case/whitespace/unknown then sync default
Functional equivalence of sync vs nosync on a Pebble store
Clean-shutdown reopen visibility (NoSync + Close + reopen preserves writes)
Prometheus gauge mutual exclusivity across successive SetFSMApplySyncMode calls
(Follow-up) Jepsen-style OS-level crash test for unfsynced-tail recovery - tracked in JEPSEN_TODO.md

Benchmark

go test ./store -run='^$' -bench='BenchmarkApplyMutations_SyncMode' -benchtime=2s -benchmem
BenchmarkApplyMutations_SyncMode/sync-10      141   16292899 ns/op   368 B/op   7 allocs/op
BenchmarkApplyMutations_SyncMode/nosync-10    129262     16293 ns/op   284 B/op   8 allocs/op

Previous lever documented in docs/review_todo.md section 3.4 (ApplyMutations retained pebble.Sync).
Pebble block cache default 256 MiB (perf(pebble): set explicit 256 MiB block cache with env override #588)
WAL retention / purge (fix(raft): retention GC for fsm snapshots + tunable snapshot trigger #589)
Lease read via AppliedIndex (perf(kv): serve ShardStore.GetAt reads via lease, not read-index #575)
Raft dispatcher lanes (perf(raft): partition etcd dispatcher by message type to isolate heartbeats #577)

Summary by CodeRabbit

New Features
- Configuration option for FSM commit durability mode with adjustable sync and no-sync settings
- New Prometheus metric exposing the current durability mode on the /metrics endpoint
Tests
- Unit tests validating FSM durability mode functionality
- Integration tests ensuring functional correctness across different durability modes
- Performance benchmarks evaluating FSM commit operation characteristics

Every committed raft entry triggered a pebble.Sync in ApplyMutations / DeletePrefixAt, on top of the raft WAL fsync that etcd upstream already performs per Ready batch. Under write-heavy workloads this per-proposal fsync dominates p50 latency (Apple M1 Max: 16 ms/op on an APFS tempdir vs 16 us/op with NoSync -- three orders of magnitude). Because raft re-delivers all committed entries past the last FSM snapshot on restart (Config.Applied defaults to snapshot.Metadata.Index and kv/fsm.applyCommitWithIdempotencyFallback treats already-committed writes as idempotent retries), Pebble durability on the FSM commit path is redundant with raft-WAL durability: a crash that loses the tail of the Pebble WAL is recoverable via raft-log replay from the fsynced FSM snapshot. This CL exposes that lever behind ELASTICKV_FSM_SYNC_MODE, defaulting to "sync" so production behaviour is unchanged. Operators can opt in to "nosync" to trade FSM-state crash durability (re-covered from raft) for write throughput. Other pebble.Sync call sites (snapshot commit, metadata boundary writes, compaction) are untouched: their durability contract is orthogonal to the raft log. Also adds an elastickv_fsm_apply_sync_mode gauge so dashboards alert on unexpected posture changes, wired at NewRegistry time via store.FSMApplySyncModeLabel(). Tests: env-var parsing (sync / nosync / case / whitespace / unknown fallback), functional equivalence of both modes, and clean-shutdown reopen visibility. Crash-recovery via un-fsynced WAL tail loss is an OS-level scenario tracked in JEPSEN_TODO and not a Go unit test.

coderabbitai · 2026-04-23T12:36:28Z

Warning

Rate limit exceeded

@bootjp has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 41 minutes and 51 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 41 minutes and 51 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: b62e6cbc-70cb-497d-90fb-96995377acf0

📥 Commits

Reviewing files that changed from the base of the PR and between 88910f9 and 1236dae.

📒 Files selected for processing (3)

monitoring/pebble.go
monitoring/pebble_test.go
store/lsm_store_sync_mode_benchmark_test.go

📝 Walkthrough

Walkthrough

This PR adds environment-controlled Pebble WAL durability configuration for FSM commits through a new ELASTICKV_FSM_SYNC_MODE environment variable supporting "sync" and "nosync" modes, with corresponding Prometheus metrics to expose the active mode at startup and runtime.

Changes

Cohort / File(s)	Summary
Startup Configuration `main.go`, `monitoring/registry.go`	Adds registry method `SetFSMApplySyncMode()` to propagate resolved FSM sync mode to metrics during initialization, with nil-safety checks.
Pebble Metrics `monitoring/pebble.go`, `monitoring/pebble_test.go`	Introduces `elastickv_fsm_apply_sync_mode` Prometheus `GaugeVec` metric and `SetFSMApplySyncMode()` method to track active durability mode; includes tests for mode flipping and nil-safety.
LSM Store Durability Control `store/lsm_store.go`, `store/lsm_store_env_test.go`	Adds environment-driven FSM apply write options configuration via `ELASTICKV_FSM_SYNC_MODE`, parsing logic with fallback to `sync`, package-level `fsmApplyWriteOpts` variable, and `FSMApplySyncModeLabel()` accessor for metrics integration; updates `ApplyMutations()` and `DeletePrefixAt()` to use configured write options.
Sync Mode Testing & Benchmarking `store/lsm_store_sync_mode_test.go`, `store/lsm_store_sync_mode_benchmark_test.go`	Adds functional tests validating data visibility and durability under `NoSync` mode, plus write-heavy benchmark (`BenchmarkApplyMutations_SyncMode`) measuring per-operation latency and throughput across sync modes.

Sequence Diagram(s)

sequenceDiagram
    participant main as Startup Process
    participant store as LSM Store<br/>(init)
    participant registry as Metrics Registry
    participant pebble as Pebble Metrics
    participant metrics as /metrics Endpoint

    main->>store: Load process
    activate store
    store->>store: Parse ELASTICKV_FSM_SYNC_MODE env var<br/>(default: "sync")
    store->>store: Resolve to fsmApplyWriteOpts<br/>& fsmApplySyncModeLabel
    deactivate store

    main->>registry: NewRegistry()
    activate registry
    registry->>pebble: NewPebbleMetrics()
    activate pebble
    pebble->>pebble: Register elastickv_fsm_apply_sync_mode<br/>GaugeVec
    deactivate pebble
    deactivate registry

    main->>store: FSMApplySyncModeLabel()
    store-->>main: Return active mode label<br/>("sync" or "nosync")

    main->>registry: SetFSMApplySyncMode(activeLabel)
    activate registry
    registry->>pebble: SetFSMApplySyncMode(activeLabel)
    activate pebble
    pebble->>pebble: Set gauge[activeLabel]=1,<br/>others=0
    deactivate pebble
    deactivate registry

    metrics->>pebble: Scrape /metrics
    pebble-->>metrics: elastickv_fsm_apply_sync_mode{mode="sync"}=1

sequenceDiagram
    participant client as Client
    participant store as PebbleStore<br/>(ApplyMutations)
    participant batch as Pebble Batch
    participant wal as Pebble WAL

    client->>store: ApplyMutations(mutations)
    activate store
    store->>batch: NewBatch()
    loop Each mutation
        batch->>batch: Apply put/delete
    end
    store->>batch: Commit(fsmApplyWriteOpts)
    activate batch
    alt fsmApplyWriteOpts = Sync
        batch->>wal: Sync write (durable)
    else fsmApplyWriteOpts = NoSync
        batch->>wal: Async write (performance)
    end
    batch-->>store: ✓ Committed
    deactivate batch
    store-->>client: Success
    deactivate store

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

A rabbit hops through metrics bright, 🐰
Sync or NoSync—choosing durability's right,
Environment variables guide the way,
Pebble WAL dances, come what may,
Benchmarks measure each eager hop's delight! 📊✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 75.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately and specifically describes the main change: introducing an environment variable ELASTICKV_FSM_SYNC_MODE that allows opting out of fsync on the FSM apply path for performance.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch perf/fsm-apply-group-commit

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

store/lsm_store.go (1)
1177-1180: Add matching NoSync coverage for DeletePrefixAt.

DeletePrefixAt now uses the same tunable FSM commit option as ApplyMutations, but the new NoSync functional/reopen tests shown only exercise ApplyMutations. A small test that applies keys, calls DeletePrefixAt under pebble.NoSync, then verifies visibility after clean reopen would catch regressions where this path accidentally goes back to pebble.Sync.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@store/lsm_store.go` around lines 1177 - 1180, Add a functional test that
mirrors the existing NoSync coverage for ApplyMutations but targets
DeletePrefixAt: create keys, call DeletePrefixAt using the same tunable FSM
commit option (pebble.NoSync / the fsmApplyWriteOpts variant for NoSync),
perform a clean reopen of the DB, and assert that deletions are visible after
reopen; this ensures DeletePrefixAt (the method) uses the same commit path as
ApplyMutations rather than falling back to pebble.Sync. Include references to
DeletePrefixAt, ApplyMutations, fsmApplyWriteOpts and pebble.NoSync when
locating where to add the test and to reuse the same commit-option setup.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@monitoring/pebble.go`:
- Around line 176-195: SetFSMApplySyncMode on PebbleMetrics currently accepts
any activeLabel and can create arbitrary metric labels; change it so unknown
values fallback to "sync" before touching the gauge. In the SetFSMApplySyncMode
function, validate activeLabel against "sync" and "nosync" (treat anything else
as "sync"), then zero the known labels via
m.fsmApplySyncMode.WithLabelValues("sync").Set(0) and ("nosync").Set(0) and
finally set the validated label to 1; keep the nil checks for m and
m.fsmApplySyncMode intact.

---

Nitpick comments:
In `@store/lsm_store.go`:
- Around line 1177-1180: Add a functional test that mirrors the existing NoSync
coverage for ApplyMutations but targets DeletePrefixAt: create keys, call
DeletePrefixAt using the same tunable FSM commit option (pebble.NoSync / the
fsmApplyWriteOpts variant for NoSync), perform a clean reopen of the DB, and
assert that deletions are visible after reopen; this ensures DeletePrefixAt (the
method) uses the same commit path as ApplyMutations rather than falling back to
pebble.Sync. Include references to DeletePrefixAt, ApplyMutations,
fsmApplyWriteOpts and pebble.NoSync when locating where to add the test and to
reuse the same commit-option setup.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 941ef56a-2487-4849-9307-87010ec04347

📥 Commits

Reviewing files that changed from the base of the PR and between 306eb0b and 88910f9.

📒 Files selected for processing (8)

main.go
monitoring/pebble.go
monitoring/pebble_test.go
monitoring/registry.go
store/lsm_store.go
store/lsm_store_env_test.go
store/lsm_store_sync_mode_benchmark_test.go
store/lsm_store_sync_mode_test.go

gemini-code-assist

Code Review

This pull request introduces an optional "nosync" mode for FSM commits, allowing operators to prioritize performance over local durability when Raft log replay is sufficient for recovery. The changes include environment variable resolution for ELASTICKV_FSM_SYNC_MODE, updates to the Pebble store's commit logic, and new Prometheus metrics for observability. Review feedback points out a discrepancy between the metrics update documentation and its implementation regarding label resetting, and highlights potential thread-safety concerns in tests due to the use of global variables for configuration.

gemini-code-assist · 2026-04-23T12:41:47Z

+// env var. Invoking again is safe and idempotent: the new label goes to
+// 1 and all previously-set labels go to 0.


The docstring claims that invoking this method again is idempotent and that "all previously-set labels go to 0". However, the implementation only explicitly zeros the "sync" and "nosync" labels. If a custom label was previously set (which the method allows), it will remain at 1, violating the mutual exclusivity property described. Consider updating the docstring to clarify that only the documented modes are reset, or use m.fsmApplySyncMode.Reset() if the intention is to clear all series (though Reset() may cause gaps in Prometheus scrapes).

gemini-code-assist · 2026-04-23T12:41:47Z

+func setFSMApplyWriteOptsForTest(t *testing.T, opts *pebble.WriteOptions, label string) {
+	t.Helper()
+	prevOpts := fsmApplyWriteOpts
+	prevLabel := fsmApplySyncModeLabel
+	fsmApplyWriteOpts = opts
+	fsmApplySyncModeLabel = label
+	t.Cleanup(func() {
+		fsmApplyWriteOpts = prevOpts
+		fsmApplySyncModeLabel = prevLabel
+	})
+}


The test helper setFSMApplyWriteOptsForTest modifies package-level global variables (fsmApplyWriteOpts and fsmApplySyncModeLabel). While it uses t.Cleanup to restore state, this approach is not thread-safe if tests within this package are ever run in parallel (e.g., via t.Parallel()). Although current tests do not appear to use parallelism, this global mutation is a potential source of flakiness for future tests. Consider refactoring the store to accept WriteOptions as a dependency in NewPebbleStore to avoid global state in tests. Note that per repository guidelines, wide-reaching API changes should be deferred to a dedicated follow-up pull request to manage complexity.

References

During large refactorings, limit the scope of a single pull request. Defer related but wide-reaching API changes to dedicated follow-up pull requests to manage complexity and avoid scope creep.

gosec G115 flagged uint64(i) * 2 on the benchmark's MVCC timestamp math. The loop counter is non-negative by construction, but the linter cannot prove it. Add an explicit if i < 0 guard so the conversion is provably safe and the reviewdog warning clears. Addresses PR #592 review feedback from coderabbit / reviewdog.

SetFSMApplySyncMode previously accepted any label verbatim, which broke the promised two-row gauge shape in two ways: (1) a prior unknown label was left at 1 because only sync/nosync were zeroed explicitly, and (2) the store's resolver maps unknown ELASTICKV_FSM_SYNC_MODE values to sync, so the monitoring gauge could disagree with the actual WriteOptions in use. Coerce any label outside {sync, nosync} to sync before setting, matching store.resolveFSMApplyWriteOpts's fallback. Update the docstring and add a dedicated test that exercises the coercion against a primed nosync posture. Addresses PR #592 review feedback from coderabbit and gemini.

bootjp · 2026-04-23T14:10:21Z

Thanks for the reviews. Pushed 56af1db and 1236dae addressing the actionable items:

1. gosec G115 in `lsm_store_sync_mode_benchmark_test.go` (coderabbit / reviewdog)

Added an explicit if i < 0 guard before the uint64(i) * 2 conversion in BenchmarkApplyMutations_SyncMode. The counter is non-negative by construction, but the explicit guard is what the linter needs to clear G115.

2. `SetFSMApplySyncMode` unknown-label handling (coderabbit minor + gemini medium)

Previously, an unrecognised activeLabel was accepted verbatim: it stayed pinned at 1 across subsequent calls (only "sync" / "nosync" were explicitly zeroed), and the gauge could disagree with the actual pebble.WriteOptions the store uses, since store.resolveFSMApplyWriteOpts maps unknown ELASTICKV_FSM_SYNC_MODE values to sync.

Fixed by coercing any label outside {"sync","nosync"} to "sync" before setting — same fallback the store resolver uses, so the two paths agree. Docstring updated; new test TestSetFSMApplySyncMode_UnknownLabelCoercesToSync primes the gauge with nosync, calls SetFSMApplySyncMode("batch"), and asserts sync=1 / nosync=0.

3. Global-state test helper `setFSMApplyWriteOptsForTest` (gemini medium) — deferred

Leaving this as-is for this PR. The reviewer's own "References" section on that comment cites the repo guideline: "During large refactorings, limit the scope of a single pull request. Defer related but wide-reaching API changes to dedicated follow-up pull requests to manage complexity and avoid scope creep." Changing NewPebbleStore to accept WriteOptions as a dependency is exactly that kind of wide-reaching API change — it ripples through every store construction site (main, engine bootstrap, tests, benchmarks) and is orthogonal to the durability opt-out this PR introduces. Happy to do it in a follow-up; not in this PR.

Local verification:

go test -race ./store/... ./monitoring/... — pass
golangci-lint run ./store/... ./monitoring/... — 0 issues

/gemini review
@codex review

chatgpt-codex-connector · 2026-04-23T14:16:11Z

Codex Review: Didn't find any major issues. Nice work!

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

gemini-code-assist

Code Review

This pull request introduces a configurable FSM apply sync mode via the ELASTICKV_FSM_SYNC_MODE environment variable, allowing operators to choose between sync and nosync for Pebble write operations. It includes updates to the metrics registry to surface the active durability posture, comprehensive unit tests for environment variable parsing and functional equivalence, and benchmarks comparing the performance of the two modes. Feedback was provided regarding the use of package-level variables for core configuration, suggesting they be moved into the pebbleStore struct to improve test safety and avoid global state mutation.

gemini-code-assist · 2026-04-23T14:16:43Z

+// fsmApplyWriteOpts is the Pebble WriteOptions value applied on the FSM
+// commit path. Resolved once from ELASTICKV_FSM_SYNC_MODE at init() and
+// then treated as read-only. Exposed as a package variable so tests can
+// swap it via setFSMApplyWriteOptsForTest; production code must not
+// mutate it after init().
+//
+// The zero (unset) state is `pebble.Sync`, preserving legacy behaviour.
+var fsmApplyWriteOpts = pebble.Sync


Using a package-level variable for core write path configuration is brittle and can lead to race conditions or flakiness in tests. While the init() logic ensures it is set once at startup, the test helpers (setFSMApplyWriteOptsForTest) mutate this global state, which is not thread-safe if tests are run in parallel.

Consider moving applyWriteOpts into the pebbleStore struct and initializing it in NewPebbleStore. This would allow for per-instance configuration and safer testing without relying on global state mutation.

bootjp · 2026-04-23T14:46:27Z

Re-raising the same item. This PR intentionally defers the NewPebbleStore(WriteOptions) dependency-injection refactor to a follow-up.

Justification (from the previous round): gemini's own reference note says "During large refactorings, limit the scope of a single pull request. Defer related but wide-reaching API changes to dedicated follow-up pull requests to manage complexity and avoid scope creep." This PR's scope is the ELASTICKV_FSM_SYNC_MODE knob + the mutual-exclusivity gauge shape; changing NewPebbleStore's signature would touch every call site across store/, kv/, adapter/, main.go, plus every test harness that constructs a store. That diff would dominate the diff, obscure the perf change, and invite merge conflicts with other in-flight work.

Concretely for test safety: setFSMApplyWriteOptsForTest already uses t.Cleanup to restore the prior value, and no test in this package uses t.Parallel(). The race window is zero under the current harness. If future parallelism lands, the helper can be swapped for struct-injected options in a dedicated refactor PR without changing the production semantics of this one.

Leaving item as-is. No new commits.

/gemini review
@codex review

gemini-code-assist

Code Review

This pull request introduces a configurable FSM apply sync mode via the ELASTICKV_FSM_SYNC_MODE environment variable, allowing operators to choose between "sync" (default) and "nosync" for Pebble write operations. The changes update ApplyMutations and DeletePrefixAt to respect the selected mode and add a Prometheus gauge to monitor the active durability posture. Feedback was provided to simplify the environment variable resolution logic by merging redundant switch cases.

gemini-code-assist · 2026-04-23T14:53:10Z

+	switch strings.ToLower(strings.TrimSpace(envVal)) {
+	case fsmSyncModeNoSync:
+		return pebble.NoSync, fsmSyncModeNoSync
+	case "", fsmSyncModeSync:
+		return pebble.Sync, fsmSyncModeSync
+	default:
+		return pebble.Sync, fsmSyncModeSync
+	}


The switch statement can be simplified by merging the redundant cases that fall back to the default sync behavior. This improves readability and maintainability by removing unnecessary code paths.

func resolveFSMApplyWriteOpts(envVal string) (*pebble.WriteOptions, string) { switch strings.ToLower(strings.TrimSpace(envVal)) { case fsmSyncModeNoSync: return pebble.NoSync, fsmSyncModeNoSync default: return pebble.Sync, fsmSyncModeSync } }

References

Remove dead or unreachable code paths to improve code clarity and maintainability.

chatgpt-codex-connector · 2026-04-23T14:57:01Z

Codex Review: Didn't find any major issues. Chef's kiss.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

…re field Per the gemini review on PR #592, migrate the FSM apply write options (and its sync-mode label) from package-level globals onto the pebbleStore struct so each store instance owns its resolved ELASTICKV_FSM_SYNC_MODE state. NewPebbleStore now calls resolveFSMApplyWriteOpts directly and stores the result; the ApplyMutations and DeletePrefixAt commit paths read s.fsmApplyWriteOpts instead of the package var. - Remove the package-level fsmApplyWriteOpts / fsmApplySyncModeLabel vars and the init() assignment that populated them. - Add fsmApplyWriteOpts and fsmApplySyncModeLabel fields on pebbleStore plus a (*pebbleStore).FSMApplySyncModeLabel() accessor. - Drop setFSMApplyWriteOptsForTest / setBenchFSMApplyWriteOpts helpers; replace with newPebbleStoreWithFSMApplyWriteOptsForTest which builds a real store and overrides the fields, so tests exercise the full construction path without mutating globals. - Main wiring now reads the label off the first built shard store via a fsmApplySyncModeLabeler type assertion after buildShardGroups, skipping the gauge update if no store exposes the accessor. Env-var parsing coverage (sync / nosync / mixed-case / whitespace / unknown / empty / garbage) remains exercised directly against resolveFSMApplyWriteOpts, and the NoSync functional-equivalence and reopen-visibility tests continue to validate the commit-path behaviour through the per-store override.

bootjp · 2026-04-23T16:16:16Z

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: de873f5478

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-23T16:26:48Z

+	// via ELASTICKV_FSM_SYNC_MODE=nosync when the raft WAL's durability is
+	// considered sufficient (raft-log replay from the last FSM snapshot
+	// re-applies any entries lost from Pebble after a crash).
+	if err := b.Commit(s.fsmApplyWriteOpts); err != nil {


Keep non-Raft ApplyMutations commits synced

Switching ApplyMutations to s.fsmApplyWriteOpts makes ELASTICKV_FSM_SYNC_MODE=nosync affect callers that do not have Raft-log replay as a durability backstop. A concrete production path is distribution.EnsureCatalogSnapshot -> CatalogStore.Save -> store.ApplyMutations, which writes catalog state directly; if the process crashes before Pebble flushes, those acknowledged writes can be lost with no Raft entry to re-apply. This change should be scoped to true FSM/raft-apply call paths only, or direct callers must continue to force pebble.Sync.

Useful? React with 👍 / 👎.

Codex flagged that the "Unsupported Commands by Name" (Redis summary) and "FSM Apply Sync Mode" (Pebble internals) panels query metrics not yet on main, so during an incident operators could mistake the empty panels for a scrape failure. Append a sentence to each panel's description calling out the dependency and that the panel will populate automatically once PR #594 / PR #592 merge.

…ctor Redis/Cluster dashboards (#597) ## Summary - Add **Elastickv Pebble Internals** dashboard covering block cache hit rate/capacity, L0 pressure, compactions, memtables, FSM apply sync mode, and store write conflicts. - Add **Unsupported Commands by Name** panel to the Redis dashboard (driven by PR #594 `elastickv_redis_unsupported_commands_total`) and annotate the legacy **Errors by Command** panel so operators know the `unknown` bucket is broken down by real name in the new panel. - Fold the standalone Redis hot-path (PR #560) dashboard into a collapsed row at the bottom of the Redis summary dashboard, delete the standalone file, and retitle the summary to **Elastickv Redis** so there is a single Redis-facing dashboard. - Rename the misnamed `elastickv-cluster-summary.json` (which was already a pure DynamoDB dashboard) to `elastickv-dynamodb.json` / **Elastickv DynamoDB** with `uid=elastickv-dynamodb`, and retitle `elastickv-cluster-overview.json` to **Elastickv Cluster** now that it is the sole cluster landing page. ## Before / After file list | State | File | | --- | --- | | Added | `monitoring/grafana/dashboards/elastickv-pebble-internals.json` | | Renamed | `elastickv-cluster-summary.json` to `elastickv-dynamodb.json` | | Retitled | `elastickv-cluster-overview.json` (title now `Elastickv Cluster`) | | Modified | `elastickv-redis-summary.json` (title `Elastickv Redis`; added Unsupported Commands panel and a collapsed Hot Path row) | | Deleted | `monitoring/grafana/dashboards/elastickv-redis-hotpath.json` | | Unchanged | `monitoring/grafana/dashboards/elastickv-raft-status.json` | ## What each dashboard now covers - **Elastickv Cluster** - leader identity, cluster-wide latency/error posture, per-node Raft health, and an embedded Redis row. - **Elastickv DynamoDB** - per-operation throughput/latency and per-table error/volume breakdown for the DynamoDB-compatible API. - **Elastickv Raft Status** - unchanged: membership, leader changes, failed proposals, node state, index drift, backlog, leader contact. - **Elastickv Redis** - per-command throughput/latency/errors, unsupported-commands breakdown, secondary-write and Lua diagnostics, and a collapsible `Hot Path (legacy PR #560)` row with the 9 GET-fast-path panels preserved verbatim. - **Elastickv Pebble Internals** - block cache hit rate and size-vs-capacity, hits/misses rate, L0 sublevels/files with write-stall thresholds, compaction debt/progress/rate, memtable count/size/zombies, FSM apply sync mode, store write conflicts by prefix. ## Dead-metric warnings Two metrics referenced by panels in this PR are **not yet present in `monitoring/*.go` on `main`**: - `elastickv_redis_unsupported_commands_total` (PR #594) - the spec asked for a panel targeting this and this PR adds one; it will stay empty until #594 lands. - `elastickv_pebble_fsm_apply_sync_mode` (PR #592) - same situation; panel is in place and will populate once #592 lands. Both are noted here so reviewers can decide whether to sequence this PR after the upstream ones. ## Test plan - [x] `python3 -c "import json; json.load(open('<file>'))"` passes on every modified dashboard file. - [x] Every dashboard exposes `schemaVersion`, `panels`, `time`, `title`, `uid`. - [x] Panel IDs are unique within each dashboard (including nested hot-path panels). - [x] No overlapping gridPos cells in the modified Redis dashboard (checked with a programmatic sweep). - [x] `go build ./...` succeeds after updating the four source comments / doc page that referenced the deleted hotpath JSON by path. - [x] `go test ./...` passes, including `./monitoring/...`. - [ ] Import each of the five JSON files into Grafana (10.x, schemaVersion 41) via UI; every panel renders without errors and the Hot Path row expands/collapses cleanly.

bootjp · 2026-04-23T17:29:55Z

Follow-up on the codex P2 correctness concern raised during this PR's review (ELASTICKV_FSM_SYNC_MODE=nosync bleeding into non-raft callers).

Addressed in #600: splits ApplyMutations / DeletePrefixAt (now unconditionally pebble.Sync) from new ApplyMutationsRaft / DeletePrefixAtRaft entry points that observe s.fsmApplyWriteOpts. kv/fsm.go is now the only caller of the *Raft variants, so the nosync knob cannot weaken durability for the CatalogStore.Save / EnsureCatalogSnapshot path (or any other direct caller) that has no raft-log replay backstop.

Tests + lint green. See #600 for the before/after table and new coverage.

coderabbitai Bot reviewed Apr 23, 2026

View reviewed changes

Comment thread monitoring/pebble.go

gemini-code-assist Bot reviewed Apr 23, 2026

View reviewed changes

bootjp added 2 commits April 23, 2026 23:08

gemini-code-assist Bot reviewed Apr 23, 2026

View reviewed changes

bootjp enabled auto-merge April 23, 2026 16:21

bootjp merged commit bee7ead into main Apr 23, 2026
7 checks passed

bootjp deleted the perf/fsm-apply-group-commit branch April 23, 2026 16:22

chatgpt-codex-connector Bot reviewed Apr 23, 2026

View reviewed changes

bootjp mentioned this pull request Apr 23, 2026

obs(grafana): add Pebble internals + unsupported-commands panel; refactor Redis/Cluster dashboards #597

Merged

7 tasks

bootjp mentioned this pull request Apr 23, 2026

fix(store): scope FSM_SYNC_MODE=nosync to raft-apply callers only #600

Open

5 tasks

		// env var. Invoking again is safe and idempotent: the new label goes to
		// 1 and all previously-set labels go to 0.

Conversation

bootjp commented Apr 23, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Root cause

Durability argument

Env var + default

Test plan

Benchmark

Related

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

bootjp commented Apr 23, 2026

1. gosec G115 in lsm_store_sync_mode_benchmark_test.go (coderabbit / reviewdog)

2. SetFSMApplySyncMode unknown-label handling (coderabbit minor + gemini medium)

3. Global-state test helper setFSMApplyWriteOptsForTest (gemini medium) — deferred

Uh oh!

chatgpt-codex-connector Bot commented Apr 23, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

bootjp commented Apr 23, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot commented Apr 23, 2026

Uh oh!

bootjp commented Apr 23, 2026

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

bootjp commented Apr 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

bootjp commented Apr 23, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 23, 2026 •

edited

Loading

1. gosec G115 in `lsm_store_sync_mode_benchmark_test.go` (coderabbit / reviewdog)

2. `SetFSMApplySyncMode` unknown-label handling (coderabbit minor + gemini medium)

3. Global-state test helper `setFSMApplyWriteOptsForTest` (gemini medium) — deferred