Conversation
…add transfer-in-progress sentinel Addresses CodeRabbit and gemini feedback on PR #558. - kv/coordinator.go LeaseRead + kv/sharded_coordinator.go groupLeaseRead: mirror the dispatch-path fix by invalidating the lease only when isLeadershipLossError reports a real leadership signal. Previously a LinearizableRead deadline or transient transport error would force the remainder of the lease window onto the slow path, reproducing the same regression the write-path fix addressed. - internal/raftengine/etcd/engine.go: add errLeadershipTransferInProgress marked with raftengine.ErrLeadershipTransferInProgress, and fail Propose fast when BasicStatus().LeadTransferee != 0 so callers see an errors.Is- matchable error instead of hanging on a proposal etcd/raft silently drops during transfer. - Refresh the stale comment on refreshLeaseAfterDispatch to reflect the filtered invalidation contract. - Tests: add the hashicorp ErrLeadershipTransferInProgress marked case to kv.TestIsLeadershipLossError and pin errLeadershipTransferInProgress in the etcd sentinel test.
|
/gemini review |
|
Warning Rate limit exceeded
Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 40 minutes and 34 seconds. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (5)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request introduces the errLeadershipTransferInProgress error and a fast-fail check in the etcd raft engine to prevent proposal hangs during leadership transfers. It also updates KV coordinators to only invalidate leases on actual leadership loss, avoiding performance degradation from transient errors. Review feedback identifies that ReadIndex and configuration change paths are also susceptible to similar hangs and should include the same check for consistency.
| if e.rawNode.BasicStatus().LeadTransferee != 0 { | ||
| req.done <- proposalResult{err: errors.WithStack(errLeadershipTransferInProgress)} | ||
| return | ||
| } |
There was a problem hiding this comment.
While this fast-fail check correctly prevents proposals from hanging during a leadership transfer, the same issue affects ReadIndex requests in handleRead. In etcd/raft, MsgReadIndex is also silently dropped when leadTransferee != 0, which will cause LinearizableRead (and consequently LeaseRead) to hang until the context deadline.
Since this PR specifically aims to improve the LeaseRead path, and LeaseRead relies on LinearizableRead (which calls handleRead), missing this check in handleRead (around line 1051) will leave the read path susceptible to hangs during transfers. Consider applying a similar check to handleRead and the configuration change paths (handleAddVoter, handleRemoveServer) for consistency. When modifying these, ensure that handleAddVoter synchronously publishes the configuration snapshot before advertising membership, and handleRemoveServer triggers a graceful shutdown if the node removes itself, as per repository standards.
References
- When adding a new voter to a raft cluster, synchronously publish the configuration snapshot before advertising the new membership to prevent the voter from receiving a stale snapshot.
- When a raft node processes a committed configuration change that removes itself from the cluster, it should trigger a graceful shutdown of its own raft engine.
There was a problem hiding this comment.
Code Review
This pull request introduces a new error sentinel, ErrLeadershipTransferInProgress, to explicitly signal when etcd/raft drops proposals during a leadership transfer. It also refines lease invalidation logic in the KV coordinators to only trigger on genuine leadership loss signals, preventing transient errors from forcing subsequent reads onto the slow path. Feedback suggests extending the leadership transfer check to other Raft entry points for consistency and correcting the documentation regarding proposal drops.
| // etcd/raft silently drops proposals while a leadership transfer is | ||
| // in flight (LeadTransferee != 0). Surface this as a distinct error | ||
| // so callers (lease-read invalidation, proxy retry, etc.) can | ||
| // recognise it via errors.Is(err, raftengine.ErrLeadershipTransferInProgress) | ||
| // instead of hanging on an ack that will never come. | ||
| if e.rawNode.BasicStatus().LeadTransferee != 0 { | ||
| req.done <- proposalResult{err: errors.WithStack(errLeadershipTransferInProgress)} | ||
| return | ||
| } |
There was a problem hiding this comment.
The check for LeadTransferee != 0 is a valuable addition for error mapping. To ensure consistency and prevent race conditions (as per repository guidelines for leadership-dependent operations), consider the following:
-
Consistency: Apply this check to all entry points interacting with the Raft loop (
handleRead,handleAddVoter,handleRemoveServer) after any necessary locks are acquired. This ensures consistent leadership-loss signaling and prevents operations from proceeding during a leadership transition. -
Comment Accuracy: The comment mentions "hanging on an ack that will never come". However,
etcd/raft'sRawNode.ProposereturnsErrProposalDroppedwhen a transfer is in progress, which is already handled at line 1042. The primary benefit here is the explicit mapping to the shared sentinel.
| // etcd/raft silently drops proposals while a leadership transfer is | |
| // in flight (LeadTransferee != 0). Surface this as a distinct error | |
| // so callers (lease-read invalidation, proxy retry, etc.) can | |
| // recognise it via errors.Is(err, raftengine.ErrLeadershipTransferInProgress) | |
| // instead of hanging on an ack that will never come. | |
| if e.rawNode.BasicStatus().LeadTransferee != 0 { | |
| req.done <- proposalResult{err: errors.WithStack(errLeadershipTransferInProgress)} | |
| return | |
| } | |
| // etcd/raft drops proposals while a leadership transfer is in flight | |
| // (LeadTransferee != 0). Surface this as a distinct error so callers | |
| // (lease-read invalidation, proxy retry, etc.) can recognize it via | |
| // errors.Is(err, raftengine.ErrLeadershipTransferInProgress) instead | |
| // of receiving a generic proposal-dropped error. | |
| if e.rawNode.BasicStatus().LeadTransferee != 0 { | |
| req.done <- proposalResult{err: errors.WithStack(errLeadershipTransferInProgress)} | |
| return | |
| } |
References
- To prevent race conditions in read-modify-write operations that require leadership, verify leadership status after acquiring the lock that protects the operation.
Addresses gemini feedback on PR #559. etcd/raft silently drops MsgReadIndex while LeadTransferee != 0, and RawNode.ReadIndex does not return an error on drop. Without a fast- fail, LinearizableRead/LeaseRead would block on req.done until the ctx deadline (~read timeout), reintroducing the exact hang the Propose-side check was meant to prevent but on the read path. Also refresh the handleProposal comment: etcd/raft returns ErrProposalDropped (not a hang). The value of the fast-fail is the explicit mapping to the shared ErrLeadershipTransferInProgress sentinel, not avoiding a hang.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces explicit handling for leadership transfers within the etcd raft engine by implementing the errLeadershipTransferInProgress sentinel error. This allows the engine to fast-fail proposals and read requests during an active transfer, preventing callers from blocking until a deadline. Furthermore, the lease invalidation logic in both the Coordinate and ShardGroup has been updated to only trigger upon confirmed leadership loss signals, preventing transient errors from unnecessarily forcing reads onto the slower linearizable path. I have no feedback to provide as no review comments were present.
Summary
Follow-up to #558 addressing remaining CodeRabbit + gemini feedback.
LeaseRead+ kv/sharded_coordinator.gogroupLeaseRead: mirror the dispatch-path fix — invalidate the lease only whenisLeadershipLossErrorreports a real leadership signal. Previously aLinearizableReaddeadline or transient transport error would force the remainder of the lease window onto the slow path, reproducing the same regression the write-path fix in fix(lease-read): only invalidate lease on leadership-loss errors #558 addressed.errLeadershipTransferInProgressmarked withraftengine.ErrLeadershipTransferInProgress, and fail Propose fast whenBasicStatus().LeadTransferee != 0so callers see anerrors.Is-matchable error instead of hanging on a proposal that etcd/raft silently drops during transfer.refreshLeaseAfterDispatchto reflect the filtered invalidation contract.ErrLeadershipTransferInProgressmarked case toTestIsLeadershipLossErrorand pinerrLeadershipTransferInProgressin the etcd sentinel test.Test plan
go test -race ./kv/... ./internal/raftengine/...golangci-lintclean