fix(kv): make txn abort idempotent when rollback marker exists by bootjp · Pull Request #581 · bootjp/elastickv

bootjp · 2026-04-22T05:03:50Z

Summary

Fix a 2PC abort-retry race that surfaces in prod as "secondary write failed" ... "write conflict" log spam on !txn|rb| rollback-marker keys.

"msg":"secondary write failed","cmd":"EVALSHA",
"err":"<string>:118: key: !txn|rb|!redis|ttl|misskey.bootjp.me:queue:...:stalled-check\u0001\ufffd-\u00039\ufffd\u0000\u0000: write conflict"

Root cause

handleAbortRequest is not idempotent. Once a (primaryKey, startTS) pair has been aborted, the rollback marker !txn|rb|<primaryKey>+<startTS> sits at commitTS = abortTS. A second abort of the same pair — from a concurrent lock resolver, a retry, or a dualwrite async replay — rebuilds:

Delete on the already-tombstoned lock / intent keys
Put on the already-present rollback marker

Every mutation has latestCommitTS = abortTS > startTS, so MVCC checkConflicts rejects all three as ErrWriteConflict. TestFSMAbort_SecondAbortSameTimestampConflicts was literally pinning this bug's current behaviour.

Why idempotent is safe

The rollback marker payload is a deterministic single byte (txnRollbackVersion), so multiple writes are byte-identical. All mutations in the first abort commit atomically via a single ApplyMutations → pebble batch, so if the marker is visible the lock/intent cleanup is visible too — there is no partial-abort state.

Fix

Probe txnRollbackKey at the top of handleAbortRequest. If present, return nil without enqueuing any mutations. Cheap GetAt on the hot abort path; the common case (fresh abort, marker absent) pays one extra block-cache point lookup.

Test plan

go test -race -count=1 -short ./kv/... (3.8s) green
TestFSMAbort_SecondAbortIsIdempotent pins both same-abortTS retry and later-abortTS retry (HLC-monotonic, the prod lock-resolver race path)
Deploy; "secondary write failed" log rate on !txn|rb| should drop to zero

Relates to the BullMQ stalled-check traffic class across the relationship, deliver, objectStorage, and webhook queues.

Summary by CodeRabbit

Bug Fixes
- Abort operations are now idempotent. Retrying an abort request with the same or later timestamp safely returns without duplicate processing or write conflicts.
Tests
- Updated abort tests to verify idempotent retry behavior.

Production log spam: "secondary write failed" ... "write conflict" key: !txn|rb|!redis|ttl|<BullMQ stalled-check key>+<startTS> Root cause: the 2PC abort path is not idempotent. Once an abort has run to completion, the rollback marker !txn|rb|<primaryKey>+<startTS> is present at commitTS = abortTS. A second abort of the same (primaryKey, startTS) pair — from a concurrent lock-resolver race, a retry, or a dualwrite async replay — rebuilds the same Delete mutations on the already-tombstoned lock/intent keys and a duplicate Put on the rollback marker. Every one of those has a latestCommitTS = abortTS > startTS so MVCC checkConflicts returns ErrWriteConflict. The rollback marker's contract is "this txn was aborted". Its payload is a deterministic single byte (txnRollbackVersion), so multiple identical writes carry no semantic difference. The work the retry tries to do has already been done atomically in the first apply (ApplyMutations is a single pebble batch), so skipping the retry is equivalent to a second-writer-wins + idempotent apply, at no cost. Fix: probe txnRollbackKey at the top of handleAbortRequest. When it's already present return nil without enqueuing any mutations. Cheap GetAt on the hot abort path; the common case (fresh abort, marker absent) pays one extra point lookup which the pebble block cache will serve hot. Safety argument: the rollback marker appears in the store only via ApplyMutations, which writes it atomically together with the lock/intent deletes. If the marker is visible at readTS ∞, the cleanup was visible too. There is no partial-abort state where the marker exists but the locks remain. Test: TestFSMAbort_SecondAbortIsIdempotent (renamed from the prior TestFSMAbort_SecondAbortSameTimestampConflicts, whose assertion was exactly the bug this patch fixes). Pins both same-abortTS retry and later-abortTS retry (HLC-monotonic, the prod resolver-race path).

gemini-code-assist · 2026-04-22T05:03:54Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

coderabbitai · 2026-04-22T05:04:01Z

Warning

Rate limit exceeded

@bootjp has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 54 minutes and 3 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 54 minutes and 3 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 6d87fbdf-0c1e-4523-9fb6-1213daf2b0e9

📥 Commits

Reviewing files that changed from the base of the PR and between 6457c50 and 8db6aba.

📒 Files selected for processing (2)

kv/fsm.go
kv/fsm_abort_test.go

📝 Walkthrough

Walkthrough

The pull request adds idempotency to the ABORT operation in the key-value store's FSM by implementing an early-exit check in handleAbortRequest. If a rollback marker already exists for the given primary key and start timestamp, the function returns immediately without reprocessing abort cleanup or mutations. Non-ErrKeyNotFound errors during this check are propagated.

Changes

Cohort / File(s)	Summary
FSM Abort Idempotency `kv/fsm.go`	Added short-circuit logic to `handleAbortRequest` that queries for an existing rollback marker key; if found, returns nil immediately; if `GetAt` fails with an error other than `store.ErrKeyNotFound`, wraps and returns the error.
Abort Idempotency Test `kv/fsm_abort_test.go`	Renamed test to `TestFSMAbort_SecondAbortIsIdempotent` and updated assertions to verify that repeated ABORT calls with the same or later `abortTS` return nil (idempotent behavior) instead of conflicting.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

Poem

🐰 An ABORT came twice to the store,
We'd clash and we'd scream and we'd roar,
But now with a peek at the marker so sleek,
We nod and we hop—idempotent encore! ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'fix(kv): make txn abort idempotent when rollback marker exists' directly summarizes the main change: adding idempotency to transaction abort when a rollback marker already exists.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/idempotent-rollback-marker

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

kv/fsm.go (1)
528-564: ⚠️ Potential issue | 🔴 Critical

Don’t let the rollback marker suppress outstanding secondary cleanup.

Line 541 returns nil solely because txnRollbackKey(primaryKey, startTS) exists, but that marker is written when the current abort batch includes the primary key; it does not prove every secondary key in later abort batches was cleaned. A primary-only abort can create the marker, then a later secondary-key abort will be skipped here and leave that secondary lock/intent behind.

Please only short-circuit after verifying the requested keys are already resolved, or continue cleaning matching outstanding locks while suppressing only the duplicate rollback-marker write. Add a regression like: prepare {primary, secondary} → abort {primary} → abort {secondary} → assert txnLockKey(secondary) and txnIntentKey(secondary) are gone.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@kv/fsm.go` around lines 528 - 564, The current early return that checks
f.store.GetAt(txnRollbackKey(...)) must be removed because the rollback marker
only proves the primary was handled, not that all secondary keys are cleaned;
instead, keep processing mutations (uniq, buildAbortCleanupStoreMutations,
ApplyMutations) but suppress only the duplicate rollback-marker Put: after
calling buildAbortCleanupStoreMutations (or inside appendRollbackRecord), detect
if the rollback marker already exists and if so do not add the txnRollbackKey
Put (or strip it out of storeMuts) while still applying remaining storeMuts and
returning their result; also add the regression test you described (prepare
{primary, secondary} → abort {primary} → abort {secondary} → assert
txnLockKey(secondary)/txnIntentKey(secondary) are removed). Ensure references:
txnRollbackKey, f.store.GetAt, uniqueMutations, buildAbortCleanupStoreMutations,
appendRollbackRecord, ApplyMutations.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@kv/fsm.go`:
- Around line 528-564: The current early return that checks
f.store.GetAt(txnRollbackKey(...)) must be removed because the rollback marker
only proves the primary was handled, not that all secondary keys are cleaned;
instead, keep processing mutations (uniq, buildAbortCleanupStoreMutations,
ApplyMutations) but suppress only the duplicate rollback-marker Put: after
calling buildAbortCleanupStoreMutations (or inside appendRollbackRecord), detect
if the rollback marker already exists and if so do not add the txnRollbackKey
Put (or strip it out of storeMuts) while still applying remaining storeMuts and
returning their result; also add the regression test you described (prepare
{primary, secondary} → abort {primary} → abort {secondary} → assert
txnLockKey(secondary)/txnIntentKey(secondary) are removed). Ensure references:
txnRollbackKey, f.store.GetAt, uniqueMutations, buildAbortCleanupStoreMutations,
appendRollbackRecord, ApplyMutations.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: b14aecdd-e92c-4157-aecf-7890bf4f58ba

📥 Commits

Reviewing files that changed from the base of the PR and between 76ea3c4 and 6457c50.

📒 Files selected for processing (2)

kv/fsm.go
kv/fsm_abort_test.go

Copilot

Pull request overview

This PR aims to make transaction ABORT handling idempotent in the KV FSM to prevent abort-retry races from producing MVCC write conflicts (notably on !txn|rb|... rollback-marker keys) and related log spam in production.

Changes:

Add an early rollback-marker existence probe in handleAbortRequest to short-circuit repeat aborts.
Update/rename the abort FSM test to assert abort retries (same and later abortTS) are no-ops.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File	Description
`kv/fsm.go`	Adds rollback-marker existence check to treat repeated aborts as idempotent.
`kv/fsm_abort_test.go`	Replaces the prior “second abort conflicts” test with an idempotency test for abort retries.

Copilot · 2026-04-22T07:09:41Z

+	// Idempotency short-circuit: if the rollback marker for this
+	// (primaryKey, startTS) already exists, a previous abort already
+	// completed the whole cleanup atomically (ApplyMutations writes the
+	// rollback marker together with the lock/intent deletes in one
+	// batch, so the marker's presence proves cleanup ran). Without this
+	// guard a retry or a concurrent second lock-resolver would re-emit
+	// Delete mutations on already-tombstoned lock/intent keys and a
+	// duplicate rollback-marker Put — all three would be rejected by
+	// the MVCC store as write conflicts (latestCommitTS > startTS) and
+	// surface in prod as "secondary write failed" log spam without
+	// changing any state. Rollback markers are deterministic
+	// ({txnRollbackVersion}) so second-writer-wins would be equivalent
+	// anyway; skipping the work is simpler and cheaper.
+	if _, err := f.store.GetAt(ctx, txnRollbackKey(meta.PrimaryKey, startTS), ^uint64(0)); err == nil {
+		return nil
+	} else if !errors.Is(err, store.ErrKeyNotFound) {
+		return errors.WithStack(err)
+	}


The rollback-marker presence does not prove that all lock/intent cleanup for the current abort request’s keys has already run. In particular, ShardStore.tryAbortExpiredPrimary issues an ABORT with only the primary key (kv/shard_store.go:1093), which writes the rollback marker on the primary shard; subsequent per-key aborts for other keys on that same shard (e.g., resolveTxnLockForKey / lock resolver) must still delete those keys’ lock/intent entries. With this unconditional early-return, any later ABORT request targeting a non-primary key that hashes to the primary shard will no-op once the marker exists, potentially leaving orphaned locks/intents indefinitely.

Suggested fix: only short-circuit when the abort request is known to be a pure retry of a prior same-keyset abort. A minimal safe change is to gate this check on whether the request includes the primary key (abortingPrimary), or otherwise avoid returning early and instead skip only the rollback-marker Put while still running cleanup for the requested keys.

…n lock absent The previous fix short-circuited the whole abort request on rollback-marker presence, which can leave secondaries orphaned: ShardStore.tryAbortExpiredPrimary issues an ABORT with only the primary key, writing the rollback marker; a later lock-resolver abort for a secondary (same primaryKey, same startTS) would then see the marker and skip that secondary lock/intent cleanup. Restructure idempotency to be per-key: - Remove the broad short-circuit in handleAbortRequest. - shouldClearAbortKey now returns false when the lock is missing (lock and intent are always written/deleted together, so lock-missing iff intent-missing; re-emitting Deletes on tombstoned keys would just trigger MVCC conflicts). - appendRollbackRecord skips the Put if the marker is already present (idempotent; skip commit-wins check on this path since commit must be absent when the marker exists). Add regression tests: SecondAbortDifferentKeysCleansRemainder reproduces the orphan scenario (fails without the fix), LockResolverRaceLeavesNoOrphan simulates the full prod flow, SameKeysIsIdempotent preserves the original retry-safety invariant.

bootjp · 2026-04-22T07:39:28Z

coderabbitai Bot reviewed Apr 22, 2026

View reviewed changes

bootjp requested a review from Copilot April 22, 2026 07:04

Copilot started reviewing on behalf of bootjp April 22, 2026 07:05 View session

Copilot AI reviewed Apr 22, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(kv): make txn abort idempotent when rollback marker exists#581

fix(kv): make txn abort idempotent when rollback marker exists#581
bootjp wants to merge 2 commits intomainfrom
fix/idempotent-rollback-marker

bootjp commented Apr 22, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

gemini-code-assist Bot commented Apr 22, 2026

Uh oh!

coderabbitai Bot commented Apr 22, 2026 •

edited

Loading

Rate limit exceeded

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 22, 2026

Uh oh!

bootjp commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

bootjp commented Apr 22, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root cause

Why idempotent is safe

Fix

Test plan

Summary by CodeRabbit

Uh oh!

gemini-code-assist Bot commented Apr 22, 2026

Uh oh!

coderabbitai Bot commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

bootjp commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

bootjp commented Apr 22, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 22, 2026 •

edited

Loading