Skip to content

fix(kv): make txn abort idempotent when rollback marker exists#581

Open
bootjp wants to merge 2 commits intomainfrom
fix/idempotent-rollback-marker
Open

fix(kv): make txn abort idempotent when rollback marker exists#581
bootjp wants to merge 2 commits intomainfrom
fix/idempotent-rollback-marker

Conversation

@bootjp
Copy link
Copy Markdown
Owner

@bootjp bootjp commented Apr 22, 2026

Summary

Fix a 2PC abort-retry race that surfaces in prod as "secondary write failed" ... "write conflict" log spam on !txn|rb| rollback-marker keys.

"msg":"secondary write failed","cmd":"EVALSHA",
"err":"<string>:118: key: !txn|rb|!redis|ttl|misskey.bootjp.me:queue:...:stalled-check\u0001\ufffd-\u00039\ufffd\u0000\u0000: write conflict"

Root cause

handleAbortRequest is not idempotent. Once a (primaryKey, startTS) pair has been aborted, the rollback marker !txn|rb|<primaryKey>+<startTS> sits at commitTS = abortTS. A second abort of the same pair — from a concurrent lock resolver, a retry, or a dualwrite async replay — rebuilds:

  • Delete on the already-tombstoned lock / intent keys
  • Put on the already-present rollback marker

Every mutation has latestCommitTS = abortTS > startTS, so MVCC checkConflicts rejects all three as ErrWriteConflict. TestFSMAbort_SecondAbortSameTimestampConflicts was literally pinning this bug's current behaviour.

Why idempotent is safe

The rollback marker payload is a deterministic single byte (txnRollbackVersion), so multiple writes are byte-identical. All mutations in the first abort commit atomically via a single ApplyMutations → pebble batch, so if the marker is visible the lock/intent cleanup is visible too — there is no partial-abort state.

Fix

Probe txnRollbackKey at the top of handleAbortRequest. If present, return nil without enqueuing any mutations. Cheap GetAt on the hot abort path; the common case (fresh abort, marker absent) pays one extra block-cache point lookup.

Test plan

  • go test -race -count=1 -short ./kv/... (3.8s) green
  • TestFSMAbort_SecondAbortIsIdempotent pins both same-abortTS retry and later-abortTS retry (HLC-monotonic, the prod lock-resolver race path)
  • Deploy; "secondary write failed" log rate on !txn|rb| should drop to zero

Relates to the BullMQ stalled-check traffic class across the relationship, deliver, objectStorage, and webhook queues.

Summary by CodeRabbit

  • Bug Fixes

    • Abort operations are now idempotent. Retrying an abort request with the same or later timestamp safely returns without duplicate processing or write conflicts.
  • Tests

    • Updated abort tests to verify idempotent retry behavior.

Production log spam:
  "secondary write failed" ... "write conflict"
  key: !txn|rb|!redis|ttl|<BullMQ stalled-check key>+<startTS>

Root cause: the 2PC abort path is not idempotent. Once an abort has
run to completion, the rollback marker !txn|rb|<primaryKey>+<startTS>
is present at commitTS = abortTS. A second abort of the same
(primaryKey, startTS) pair — from a concurrent lock-resolver race, a
retry, or a dualwrite async replay — rebuilds the same Delete
mutations on the already-tombstoned lock/intent keys and a duplicate
Put on the rollback marker. Every one of those has a latestCommitTS
= abortTS > startTS so MVCC checkConflicts returns
ErrWriteConflict.

The rollback marker's contract is "this txn was aborted". Its
payload is a deterministic single byte (txnRollbackVersion), so
multiple identical writes carry no semantic difference. The work
the retry tries to do has already been done atomically in the first
apply (ApplyMutations is a single pebble batch), so skipping the
retry is equivalent to a second-writer-wins + idempotent apply, at
no cost.

Fix: probe txnRollbackKey at the top of handleAbortRequest. When
it's already present return nil without enqueuing any mutations.
Cheap GetAt on the hot abort path; the common case (fresh abort,
marker absent) pays one extra point lookup which the pebble block
cache will serve hot.

Safety argument: the rollback marker appears in the store only via
ApplyMutations, which writes it atomically together with the
lock/intent deletes. If the marker is visible at readTS ∞, the
cleanup was visible too. There is no partial-abort state where the
marker exists but the locks remain.

Test: TestFSMAbort_SecondAbortIsIdempotent (renamed from the prior
TestFSMAbort_SecondAbortSameTimestampConflicts, whose assertion was
exactly the bug this patch fixes). Pins both same-abortTS retry and
later-abortTS retry (HLC-monotonic, the prod resolver-race path).
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 22, 2026

Warning

Rate limit exceeded

@bootjp has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 54 minutes and 3 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 54 minutes and 3 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 6d87fbdf-0c1e-4523-9fb6-1213daf2b0e9

📥 Commits

Reviewing files that changed from the base of the PR and between 6457c50 and 8db6aba.

📒 Files selected for processing (2)
  • kv/fsm.go
  • kv/fsm_abort_test.go
📝 Walkthrough

Walkthrough

The pull request adds idempotency to the ABORT operation in the key-value store's FSM by implementing an early-exit check in handleAbortRequest. If a rollback marker already exists for the given primary key and start timestamp, the function returns immediately without reprocessing abort cleanup or mutations. Non-ErrKeyNotFound errors during this check are propagated.

Changes

Cohort / File(s) Summary
FSM Abort Idempotency
kv/fsm.go
Added short-circuit logic to handleAbortRequest that queries for an existing rollback marker key; if found, returns nil immediately; if GetAt fails with an error other than store.ErrKeyNotFound, wraps and returns the error.
Abort Idempotency Test
kv/fsm_abort_test.go
Renamed test to TestFSMAbort_SecondAbortIsIdempotent and updated assertions to verify that repeated ABORT calls with the same or later abortTS return nil (idempotent behavior) instead of conflicting.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

Poem

🐰 An ABORT came twice to the store,
We'd clash and we'd scream and we'd roar,
But now with a peek at the marker so sleek,
We nod and we hop—idempotent encore! ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'fix(kv): make txn abort idempotent when rollback marker exists' directly summarizes the main change: adding idempotency to transaction abort when a rollback marker already exists.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/idempotent-rollback-marker

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
kv/fsm.go (1)

528-564: ⚠️ Potential issue | 🔴 Critical

Don’t let the rollback marker suppress outstanding secondary cleanup.

Line 541 returns nil solely because txnRollbackKey(primaryKey, startTS) exists, but that marker is written when the current abort batch includes the primary key; it does not prove every secondary key in later abort batches was cleaned. A primary-only abort can create the marker, then a later secondary-key abort will be skipped here and leave that secondary lock/intent behind.

Please only short-circuit after verifying the requested keys are already resolved, or continue cleaning matching outstanding locks while suppressing only the duplicate rollback-marker write. Add a regression like: prepare {primary, secondary} → abort {primary} → abort {secondary} → assert txnLockKey(secondary) and txnIntentKey(secondary) are gone.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@kv/fsm.go` around lines 528 - 564, The current early return that checks
f.store.GetAt(txnRollbackKey(...)) must be removed because the rollback marker
only proves the primary was handled, not that all secondary keys are cleaned;
instead, keep processing mutations (uniq, buildAbortCleanupStoreMutations,
ApplyMutations) but suppress only the duplicate rollback-marker Put: after
calling buildAbortCleanupStoreMutations (or inside appendRollbackRecord), detect
if the rollback marker already exists and if so do not add the txnRollbackKey
Put (or strip it out of storeMuts) while still applying remaining storeMuts and
returning their result; also add the regression test you described (prepare
{primary, secondary} → abort {primary} → abort {secondary} → assert
txnLockKey(secondary)/txnIntentKey(secondary) are removed). Ensure references:
txnRollbackKey, f.store.GetAt, uniqueMutations, buildAbortCleanupStoreMutations,
appendRollbackRecord, ApplyMutations.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@kv/fsm.go`:
- Around line 528-564: The current early return that checks
f.store.GetAt(txnRollbackKey(...)) must be removed because the rollback marker
only proves the primary was handled, not that all secondary keys are cleaned;
instead, keep processing mutations (uniq, buildAbortCleanupStoreMutations,
ApplyMutations) but suppress only the duplicate rollback-marker Put: after
calling buildAbortCleanupStoreMutations (or inside appendRollbackRecord), detect
if the rollback marker already exists and if so do not add the txnRollbackKey
Put (or strip it out of storeMuts) while still applying remaining storeMuts and
returning their result; also add the regression test you described (prepare
{primary, secondary} → abort {primary} → abort {secondary} → assert
txnLockKey(secondary)/txnIntentKey(secondary) are removed). Ensure references:
txnRollbackKey, f.store.GetAt, uniqueMutations, buildAbortCleanupStoreMutations,
appendRollbackRecord, ApplyMutations.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: b14aecdd-e92c-4157-aecf-7890bf4f58ba

📥 Commits

Reviewing files that changed from the base of the PR and between 76ea3c4 and 6457c50.

📒 Files selected for processing (2)
  • kv/fsm.go
  • kv/fsm_abort_test.go

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aims to make transaction ABORT handling idempotent in the KV FSM to prevent abort-retry races from producing MVCC write conflicts (notably on !txn|rb|... rollback-marker keys) and related log spam in production.

Changes:

  • Add an early rollback-marker existence probe in handleAbortRequest to short-circuit repeat aborts.
  • Update/rename the abort FSM test to assert abort retries (same and later abortTS) are no-ops.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
kv/fsm.go Adds rollback-marker existence check to treat repeated aborts as idempotent.
kv/fsm_abort_test.go Replaces the prior “second abort conflicts” test with an idempotency test for abort retries.

Comment thread kv/fsm.go Outdated
Comment on lines +528 to +545
// Idempotency short-circuit: if the rollback marker for this
// (primaryKey, startTS) already exists, a previous abort already
// completed the whole cleanup atomically (ApplyMutations writes the
// rollback marker together with the lock/intent deletes in one
// batch, so the marker's presence proves cleanup ran). Without this
// guard a retry or a concurrent second lock-resolver would re-emit
// Delete mutations on already-tombstoned lock/intent keys and a
// duplicate rollback-marker Put — all three would be rejected by
// the MVCC store as write conflicts (latestCommitTS > startTS) and
// surface in prod as "secondary write failed" log spam without
// changing any state. Rollback markers are deterministic
// ({txnRollbackVersion}) so second-writer-wins would be equivalent
// anyway; skipping the work is simpler and cheaper.
if _, err := f.store.GetAt(ctx, txnRollbackKey(meta.PrimaryKey, startTS), ^uint64(0)); err == nil {
return nil
} else if !errors.Is(err, store.ErrKeyNotFound) {
return errors.WithStack(err)
}
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rollback-marker presence does not prove that all lock/intent cleanup for the current abort request’s keys has already run. In particular, ShardStore.tryAbortExpiredPrimary issues an ABORT with only the primary key (kv/shard_store.go:1093), which writes the rollback marker on the primary shard; subsequent per-key aborts for other keys on that same shard (e.g., resolveTxnLockForKey / lock resolver) must still delete those keys’ lock/intent entries. With this unconditional early-return, any later ABORT request targeting a non-primary key that hashes to the primary shard will no-op once the marker exists, potentially leaving orphaned locks/intents indefinitely.

Suggested fix: only short-circuit when the abort request is known to be a pure retry of a prior same-keyset abort. A minimal safe change is to gate this check on whether the request includes the primary key (abortingPrimary), or otherwise avoid returning early and instead skip only the rollback-marker Put while still running cleanup for the requested keys.

Copilot uses AI. Check for mistakes.
…n lock absent

The previous fix short-circuited the whole abort request on rollback-marker
presence, which can leave secondaries orphaned: ShardStore.tryAbortExpiredPrimary
issues an ABORT with only the primary key, writing the rollback marker; a later
lock-resolver abort for a secondary (same primaryKey, same startTS) would then
see the marker and skip that secondary lock/intent cleanup.

Restructure idempotency to be per-key:
- Remove the broad short-circuit in handleAbortRequest.
- shouldClearAbortKey now returns false when the lock is missing (lock and
  intent are always written/deleted together, so lock-missing iff intent-missing;
  re-emitting Deletes on tombstoned keys would just trigger MVCC conflicts).
- appendRollbackRecord skips the Put if the marker is already present
  (idempotent; skip commit-wins check on this path since commit must be
  absent when the marker exists).

Add regression tests: SecondAbortDifferentKeysCleansRemainder reproduces the
orphan scenario (fails without the fix), LockResolverRaceLeavesNoOrphan
simulates the full prod flow, SameKeysIsIdempotent preserves the original
retry-safety invariant.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 22, 2026

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants