fix(kv): make txn abort idempotent when rollback marker exists#581
fix(kv): make txn abort idempotent when rollback marker exists#581
Conversation
Production log spam: "secondary write failed" ... "write conflict" key: !txn|rb|!redis|ttl|<BullMQ stalled-check key>+<startTS> Root cause: the 2PC abort path is not idempotent. Once an abort has run to completion, the rollback marker !txn|rb|<primaryKey>+<startTS> is present at commitTS = abortTS. A second abort of the same (primaryKey, startTS) pair — from a concurrent lock-resolver race, a retry, or a dualwrite async replay — rebuilds the same Delete mutations on the already-tombstoned lock/intent keys and a duplicate Put on the rollback marker. Every one of those has a latestCommitTS = abortTS > startTS so MVCC checkConflicts returns ErrWriteConflict. The rollback marker's contract is "this txn was aborted". Its payload is a deterministic single byte (txnRollbackVersion), so multiple identical writes carry no semantic difference. The work the retry tries to do has already been done atomically in the first apply (ApplyMutations is a single pebble batch), so skipping the retry is equivalent to a second-writer-wins + idempotent apply, at no cost. Fix: probe txnRollbackKey at the top of handleAbortRequest. When it's already present return nil without enqueuing any mutations. Cheap GetAt on the hot abort path; the common case (fresh abort, marker absent) pays one extra point lookup which the pebble block cache will serve hot. Safety argument: the rollback marker appears in the store only via ApplyMutations, which writes it atomically together with the lock/intent deletes. If the marker is visible at readTS ∞, the cleanup was visible too. There is no partial-abort state where the marker exists but the locks remain. Test: TestFSMAbort_SecondAbortIsIdempotent (renamed from the prior TestFSMAbort_SecondAbortSameTimestampConflicts, whose assertion was exactly the bug this patch fixes). Pins both same-abortTS retry and later-abortTS retry (HLC-monotonic, the prod resolver-race path).
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
|
Warning Rate limit exceeded
Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 54 minutes and 3 seconds. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
📝 WalkthroughWalkthroughThe pull request adds idempotency to the ABORT operation in the key-value store's FSM by implementing an early-exit check in Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~8 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
kv/fsm.go (1)
528-564:⚠️ Potential issue | 🔴 CriticalDon’t let the rollback marker suppress outstanding secondary cleanup.
Line 541 returns nil solely because
txnRollbackKey(primaryKey, startTS)exists, but that marker is written when the current abort batch includes the primary key; it does not prove every secondary key in later abort batches was cleaned. A primary-only abort can create the marker, then a later secondary-key abort will be skipped here and leave that secondary lock/intent behind.Please only short-circuit after verifying the requested keys are already resolved, or continue cleaning matching outstanding locks while suppressing only the duplicate rollback-marker write. Add a regression like: prepare
{primary, secondary}→ abort{primary}→ abort{secondary}→ asserttxnLockKey(secondary)andtxnIntentKey(secondary)are gone.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@kv/fsm.go` around lines 528 - 564, The current early return that checks f.store.GetAt(txnRollbackKey(...)) must be removed because the rollback marker only proves the primary was handled, not that all secondary keys are cleaned; instead, keep processing mutations (uniq, buildAbortCleanupStoreMutations, ApplyMutations) but suppress only the duplicate rollback-marker Put: after calling buildAbortCleanupStoreMutations (or inside appendRollbackRecord), detect if the rollback marker already exists and if so do not add the txnRollbackKey Put (or strip it out of storeMuts) while still applying remaining storeMuts and returning their result; also add the regression test you described (prepare {primary, secondary} → abort {primary} → abort {secondary} → assert txnLockKey(secondary)/txnIntentKey(secondary) are removed). Ensure references: txnRollbackKey, f.store.GetAt, uniqueMutations, buildAbortCleanupStoreMutations, appendRollbackRecord, ApplyMutations.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@kv/fsm.go`:
- Around line 528-564: The current early return that checks
f.store.GetAt(txnRollbackKey(...)) must be removed because the rollback marker
only proves the primary was handled, not that all secondary keys are cleaned;
instead, keep processing mutations (uniq, buildAbortCleanupStoreMutations,
ApplyMutations) but suppress only the duplicate rollback-marker Put: after
calling buildAbortCleanupStoreMutations (or inside appendRollbackRecord), detect
if the rollback marker already exists and if so do not add the txnRollbackKey
Put (or strip it out of storeMuts) while still applying remaining storeMuts and
returning their result; also add the regression test you described (prepare
{primary, secondary} → abort {primary} → abort {secondary} → assert
txnLockKey(secondary)/txnIntentKey(secondary) are removed). Ensure references:
txnRollbackKey, f.store.GetAt, uniqueMutations, buildAbortCleanupStoreMutations,
appendRollbackRecord, ApplyMutations.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: b14aecdd-e92c-4157-aecf-7890bf4f58ba
📒 Files selected for processing (2)
kv/fsm.gokv/fsm_abort_test.go
There was a problem hiding this comment.
Pull request overview
This PR aims to make transaction ABORT handling idempotent in the KV FSM to prevent abort-retry races from producing MVCC write conflicts (notably on !txn|rb|... rollback-marker keys) and related log spam in production.
Changes:
- Add an early rollback-marker existence probe in
handleAbortRequestto short-circuit repeat aborts. - Update/rename the abort FSM test to assert abort retries (same and later abortTS) are no-ops.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
kv/fsm.go |
Adds rollback-marker existence check to treat repeated aborts as idempotent. |
kv/fsm_abort_test.go |
Replaces the prior “second abort conflicts” test with an idempotency test for abort retries. |
| // Idempotency short-circuit: if the rollback marker for this | ||
| // (primaryKey, startTS) already exists, a previous abort already | ||
| // completed the whole cleanup atomically (ApplyMutations writes the | ||
| // rollback marker together with the lock/intent deletes in one | ||
| // batch, so the marker's presence proves cleanup ran). Without this | ||
| // guard a retry or a concurrent second lock-resolver would re-emit | ||
| // Delete mutations on already-tombstoned lock/intent keys and a | ||
| // duplicate rollback-marker Put — all three would be rejected by | ||
| // the MVCC store as write conflicts (latestCommitTS > startTS) and | ||
| // surface in prod as "secondary write failed" log spam without | ||
| // changing any state. Rollback markers are deterministic | ||
| // ({txnRollbackVersion}) so second-writer-wins would be equivalent | ||
| // anyway; skipping the work is simpler and cheaper. | ||
| if _, err := f.store.GetAt(ctx, txnRollbackKey(meta.PrimaryKey, startTS), ^uint64(0)); err == nil { | ||
| return nil | ||
| } else if !errors.Is(err, store.ErrKeyNotFound) { | ||
| return errors.WithStack(err) | ||
| } |
There was a problem hiding this comment.
The rollback-marker presence does not prove that all lock/intent cleanup for the current abort request’s keys has already run. In particular, ShardStore.tryAbortExpiredPrimary issues an ABORT with only the primary key (kv/shard_store.go:1093), which writes the rollback marker on the primary shard; subsequent per-key aborts for other keys on that same shard (e.g., resolveTxnLockForKey / lock resolver) must still delete those keys’ lock/intent entries. With this unconditional early-return, any later ABORT request targeting a non-primary key that hashes to the primary shard will no-op once the marker exists, potentially leaving orphaned locks/intents indefinitely.
Suggested fix: only short-circuit when the abort request is known to be a pure retry of a prior same-keyset abort. A minimal safe change is to gate this check on whether the request includes the primary key (abortingPrimary), or otherwise avoid returning early and instead skip only the rollback-marker Put while still running cleanup for the requested keys.
…n lock absent The previous fix short-circuited the whole abort request on rollback-marker presence, which can leave secondaries orphaned: ShardStore.tryAbortExpiredPrimary issues an ABORT with only the primary key, writing the rollback marker; a later lock-resolver abort for a secondary (same primaryKey, same startTS) would then see the marker and skip that secondary lock/intent cleanup. Restructure idempotency to be per-key: - Remove the broad short-circuit in handleAbortRequest. - shouldClearAbortKey now returns false when the lock is missing (lock and intent are always written/deleted together, so lock-missing iff intent-missing; re-emitting Deletes on tombstoned keys would just trigger MVCC conflicts). - appendRollbackRecord skips the Put if the marker is already present (idempotent; skip commit-wins check on this path since commit must be absent when the marker exists). Add regression tests: SecondAbortDifferentKeysCleansRemainder reproduces the orphan scenario (fails without the fix), LockResolverRaceLeavesNoOrphan simulates the full prod flow, SameKeysIsIdempotent preserves the original retry-safety invariant.
|
|
Summary
Fix a 2PC abort-retry race that surfaces in prod as
"secondary write failed" ... "write conflict"log spam on!txn|rb|rollback-marker keys.Root cause
handleAbortRequestis not idempotent. Once a(primaryKey, startTS)pair has been aborted, the rollback marker!txn|rb|<primaryKey>+<startTS>sits atcommitTS = abortTS. A second abort of the same pair — from a concurrent lock resolver, a retry, or a dualwrite async replay — rebuilds:Deleteon the already-tombstoned lock / intent keysPuton the already-present rollback markerEvery mutation has
latestCommitTS = abortTS > startTS, so MVCCcheckConflictsrejects all three asErrWriteConflict.TestFSMAbort_SecondAbortSameTimestampConflictswas literally pinning this bug's current behaviour.Why idempotent is safe
The rollback marker payload is a deterministic single byte (
txnRollbackVersion), so multiple writes are byte-identical. All mutations in the first abort commit atomically via a singleApplyMutations→ pebble batch, so if the marker is visible the lock/intent cleanup is visible too — there is no partial-abort state.Fix
Probe
txnRollbackKeyat the top ofhandleAbortRequest. If present, returnnilwithout enqueuing any mutations. CheapGetAton the hot abort path; the common case (fresh abort, marker absent) pays one extra block-cache point lookup.Test plan
go test -race -count=1 -short ./kv/...(3.8s) greenTestFSMAbort_SecondAbortIsIdempotentpins both same-abortTS retry and later-abortTS retry (HLC-monotonic, the prod lock-resolver race path)"secondary write failed"log rate on!txn|rb|should drop to zeroRelates to the BullMQ
stalled-checktraffic class across therelationship,deliver,objectStorage, and webhook queues.Summary by CodeRabbit
Bug Fixes
Tests