feat(sqs): leadership-refusal hook + flag flip (Phase 3.D PR 4-B-3b) by bootjp · Pull Request #723 · bootjp/elastickv

bootjp · 2026-04-30T14:35:34Z

Summary

Phase 3.D PR 4-B-3b — closes the routing+leadership-refusal pair the §11 PR 4 contract requires before a binary is "marked htfifo-eligible". With this PR landed, every node's /sqs_health advertises htfifo and the §8 downgrade-protection safeguard is in place. PR 5 (next) lifts the PartitionCount > 1 dormancy gate and wires PollSQSHTFIFOCapability (#721) into the CreateQueue gate in the same commit.

What's added

raftengine: leader-acquired observer (mirror of leader-loss)

raftengine.Admin gains RegisterLeaderAcquiredCallback. Same contract as RegisterLeaderLossCallback (non-blocking, panic-contained, sentinel-pointer deregister) but fires on the previous!=Leader → status==Leader edge in refreshStatus.
The etcd backend's slot-management is now shared between leader-loss and leader-acquired via registerLeaderCallback + gatherLeaderCallbacks helpers — this satisfies the dupl lint check and keeps both paths consistent in one place.
7 observer tests mirror the existing leader-loss tests: panic-containment, empty-list safety, deregister removal, idempotence, nil-fn safe, nil-receiver safe, identical-fn disambiguation.

main: SQS leadership-refusal hook

main_sqs_leadership_refusal.go:
- installSQSLeadershipRefusal(ctx, admin, gid, partitionedGroups, advertisesHTFIFO, logger) func() — startup check + per-acquisition observer.
- installSQSLeadershipRefusalAcrossGroups(...) — composite installer iterating every shard runtime.
- partitionedGroupSet(partitionMap, logger) — flattens --sqsFifoPartitionMap into the {gid → bool} set the hook consumes.
- sqsAdvertisesHTFIFO() — wraps adapter.AdvertisesHTFIFO().
run() installs the composite refusal across runtimes after the coordinator is built; deregister flows through cleanup.
TransferLeadership runs in a goroutine because the leader-acquired callback contract is non-blocking — a synchronous admin RPC inside the callback would stall refreshStatus.
11 helper tests cover the contract: htfifo no-op, no-partitioned-queue no-op, startup-already-leader refuses, startup-follower waits, per-acquisition fires, deregister propagates, transfer-error logged, nil-admin safe, partitionedGroupSet flatten / empty-input / malformed-ref-skip.

adapter: `AdvertisesHTFIFO` + flag flip

adapter.AdvertisesHTFIFO() exposes the package-private flag.
htfifoCapabilityAdvertised = false → true. Both the routing wiring (PR 4-B-2 feat(sqs): partition resolver for HT-FIFO routing (Phase 3.D PR 4-B-2) #715) and the leadership-refusal hook (this PR) are now in the binary, so the design's "marked htfifo-eligible" bar is met.

What's still gated

PR 5 lifts the PartitionCount > 1 dormancy gate AND wires PollSQSHTFIFOCapability (#721) into the CreateQueue gate in the same commit. Until PR 5 lands, no partitioned queue can land in production — the leadership-refusal hook is dormant in the happy-path runtime (every binary past this PR advertises htfifo, and the per-group early return keeps the hook out of the hot path).

Test plan

go test -race ./internal/raftengine/etcd/ — 7 new + existing tests pass.
go test -race ./kv/ — existing 30+ tests pass (verified the tests that previously failed due to the misplaced interface method now pass).
go test -race ./adapter/ + go test -race . — all pass.
golangci-lint ./kv/... ./adapter/... ./internal/raftengine/... . — clean.

Self-review (per CLAUDE.md)

Data loss — control-plane only; no FSM/Pebble/retention path. The hook calls TransferLeadership which is itself an admin action with the same data-loss profile as a graceful manual transfer. No issue.
Concurrency / distributed failures — leader-acquired callback contract mirrors leader-loss (non-blocking, panic-contained, sentinel-pointer deregister). refuse() offloads TransferLeadership to a goroutine so refreshStatus stays non-blocking. Multiple goroutines firing for the same group serialize on raft's admin channel; worst case is one redundant transfer attempt, which is idempotent on the raft side. No issue.
Performance — leader-acquired callbacks fire only on the transition edge (rare event); no per-request hot path cost. The early return on advertisesHTFIFO=true means production-binary hosts pay zero overhead. No issue.
Data consistency — the hook protects against the §8 downgrade scenario: a node rolled back to a pre-htfifo binary that still gets elected leader of a partitioned-queue shard would otherwise read/write under the legacy keyspace and silently corrupt the queue. The hook steps it down via TransferLeadership before any client request lands. No issue.
Test coverage — 7 raftengine observer tests + 11 main-side helper tests (18 new). Existing kv lease tests confirm the misplaced-interface bug from first draft was caught and fixed.

Summary by CodeRabbit

New Features
- SQS health endpoint now advertises HT-FIFO capability status for client detection.
- Raft administration adds leader-acquired callbacks so systems can react immediately when a node becomes leader.
- SQS FIFO queue groups can automatically refuse and transfer leadership when HT-FIFO capability is not available.
Tests
- Added comprehensive tests for leader-acquired callbacks, deregistration, panic isolation, and SQS leadership-refusal behavior.

Closes the Phase 3.D PR 4-B-2 → PR 5 chain by adding the §8 leadership-refusal safeguard and flipping htfifoCapabilityAdvertised to true. This is the final piece of the routing+leadership-refusal pair the §11 PR 4 contract requires before a binary is "marked htfifo-eligible". What changes raftengine: leader-acquired observer (mirror of leader-loss) - raftengine.Admin gains RegisterLeaderAcquiredCallback. Same contract as RegisterLeaderLossCallback (non-blocking, panic- contained, sentinel-pointer deregister) but fires on the previous!=Leader → status==Leader edge instead of leaving the leader role. - etcd backend: new leaderAcquiredCbs slice + mutex; fires from refreshStatus on the leader-acquired edge AFTER e.isLeader is published, so a callback that calls engine.State() observes StateLeader. - register / fire helpers extracted (registerLeaderCallback, gatherLeaderCallbacks) so the leader-loss and leader-acquired paths share one slot management implementation. The dupl lint warning that triggered on first draft is the test that keeps this consolidated. - Test coverage: leader_acquired_callback_test.go mirrors leader_loss_callback_test.go — panic containment, empty-list safety, deregister removal, deregister idempotence, nil-fn safety, nil-receiver safety, identical-fn disambiguation. main: SQS leadership-refusal hook - main_sqs_leadership_refusal.go: installSQSLeadershipRefusal + installSQSLeadershipRefusalAcrossGroups + partitionedGroupSet. - On install, if the engine is currently leader of a group hosting a partitioned queue and the binary lacks htfifo, the hook calls TransferLeadership immediately. The leader-acquired observer then keeps catching future transitions for the same group. - TransferLeadership runs in a goroutine because the leader-acquired callback contract is non-blocking — a synchronous admin RPC inside the callback would stall refreshStatus. - sqsLeadershipController is the small interface the helper accepts (subset of raftengine.Admin) so test doubles don't have to satisfy the full Admin surface. - run() wires installSQSLeadershipRefusalAcrossGroups after the coordinator is built; the composite deregister flows through cleanup. adapter: AdvertisesHTFIFO + flag flip - adapter.AdvertisesHTFIFO() reports the htfifo capability flag so main.go can read it without touching the package-private constant. - htfifoCapabilityAdvertised = false → true. Both the routing wiring (PR 4-B-2 #715) and the leadership-refusal hook (this PR) are now in the binary, so the design's "marked htfifo-eligible" bar is met. What's still gated PR 5 lifts the PartitionCount > 1 dormancy gate AND wires PollSQSHTFIFOCapability (PR 4-B-3a #721) into the CreateQueue gate in the same commit. Until PR 5 lands, no partitioned queue can land in production — the leadership-refusal hook is dormant in the happy-path runtime (every binary past this PR advertises htfifo, and the per-group early return keeps the hook out of the hot path). Self-review (per CLAUDE.md) 1. Data loss — control-plane only; no FSM/Pebble/retention path. The hook calls TransferLeadership which is itself an admin action with the same data-loss profile as a graceful manual transfer. No issue. 2. Concurrency / distributed failures — leader-acquired callback contract mirrors leader-loss (non-blocking, panic-contained, sentinel-pointer deregister). refuse() offloads the actual TransferLeadership to a goroutine so refreshStatus stays non-blocking. Multiple goroutines calling refuse() for the same group queue serially in raft's admin channel; the worst case is one redundant transfer attempt, which is idempotent on the raft side. No issue. 3. Performance — leader-acquired callbacks fire only on the transition edge (rare event); no per-request hot path cost. The early return on advertisesHTFIFO=true means production-binary hosts pay zero overhead. No issue. 4. Data consistency — the hook protects against the §8 downgrade scenario: a node rolled back to a pre-htfifo binary that still gets elected leader of a partitioned-queue shard would otherwise read/write under the legacy keyspace and silently corrupt the queue. The hook steps it down via TransferLeadership before any client request lands. No issue. 5. Test coverage — 7 raftengine observer tests (mirror of leader-loss panic / empty / deregister / idempotence / nil guards / sentinel-pointer disambiguation) + 11 main-side helper tests (htfifo no-op, no-partitioned-queue no-op, startup-already-leader refuses, startup-follower waits, per-acquisition fires, deregister propagates, transfer error logged, nil-admin safe, partitionedGroupSet flatten / empty / malformed).

bootjp · 2026-04-30T14:35:44Z

@claude review

Phase 3.D PR 4-B-3b — leadership-refusal hook (raftengine leader-acquired observer + main-side helper) and htfifoCapabilityAdvertised flag flip. Closes the §11 PR 4 routing+leadership-refusal pair so the binary is now "marked htfifo-eligible". PR 5 lifts the dormancy gate next.

claude · 2026-04-30T14:36:10Z

Claude finished @bootjp's task in 3m 33s —— View job

Review: Phase 3.D PR 4-B-3b — Leadership-Refusal Hook + Flag Flip

Finding 1 — Bug: `invokeLeaderAcquiredCallback` missing structured log fields and stack trace

internal/raftengine/etcd/engine.go:1088-1096

func (e *Engine) invokeLeaderAcquiredCallback(fn func()) {
    defer func() {
        if r := recover(); r != nil {
            slog.Error("etcd raft engine: leader-acquired callback panic",
                "recover", r)   // ← missing node_id, raft_node_id, stack
        }
    }()
    fn()
}

Its counterpart invokeLeaderLossCallback (lines 1098–1122) logs four structured fields: node_id, raft_node_id, panic, and debug.Stack(). The acquired variant logs only the recovered value. In production — where the leadership-refusal callback itself panics — an operator has no node identity or stack to diagnose the problem. This should mirror the leader-loss pattern exactly.

Fix this →

Finding 2 — Minor: TOCTOU window in `installSQSLeadershipRefusal` startup check

main_sqs_leadership_refusal.go:97-103

if admin.State() == raftengine.StateLeader {
    refuse()
}
return admin.RegisterLeaderAcquiredCallback(refuse)

There is a narrow window between the admin.State() read and RegisterLeaderAcquiredCallback completing. If the node wins a Raft election in that window, refreshStatus calls fireLeaderAcquiredCallbacks before refuse is registered — the hook misses that acquisition. The node stays as leader of a partitioned-queue group indefinitely until a re-election occurs (e.g. network partition, transfer triggered by something else). Probability is extremely low (microsecond window), but this is a safety-enforcement hook so the gap is worth a comment.

The simplest mitigation is a second State() check after registration and an immediate refuse() call if the node turned leader in the interim. Alternatively, a comment explicitly acknowledging the invariant and why it is acceptable would be sufficient.

Finding 3 — Observation: `installSQSLeadershipRefusalAcrossGroups` nil/skip branches not covered

main_sqs_leadership_refusal.go:173-190

The nil-engine check (rt == nil || rt.engine == nil) and the interface-assertion-miss log path ("engine does not implement leader-acquired observer") are meaningful defensive branches with no direct test. The installSQSLeadershipRefusal path IS tested exhaustively; it's only the iterator's guard clauses that lack coverage. Not blocking, but worth a table-driven test against a nil-runtime or a stub engine that fails the type assertion.

Architecture / design passes (per CLAUDE.md)

Data loss — control-plane only; TransferLeadership has the same data-loss profile as a graceful manual transfer. No concern.

Concurrency / distributed failures — The shared registerLeaderCallback/gatherLeaderCallbacks helper correctly snapshots the callback list under a lock before firing, so a registration or deregistration concurrent with a fire event cannot cause a torn-read of in-flight callback state. The once.Do wrapper on deregister makes repeated calls safe. The goroutine in refuse() correctly offloads TransferLeadership so refreshStatus is never stalled. Cleanup ordering is correct (LIFO, deregister added last → runs first, before engine close).

Performance — callback fires only on the previous!=Leader → status==Leader edge (rare). The advertisesHTFIFO=true early return is in the hot call sites, so post-PR-4-B-3b production binaries pay exactly zero overhead.

Data consistency — the §8 downgrade-protection invariant is correctly closed: a rolled-back binary lacking the htfifo data plane can never hold leadership of a partitioned-queue shard, because either (a) it is already leader at install time → refuse() fires immediately, or (b) it wins a future election → RegisterLeaderAcquiredCallback fires refuse(). Finding 2 identifies a narrow TOCTOU window in case (a)→(b) boundary.

Test coverage — 7 raftengine observer tests mirror every leader-loss contract; 11 main-side tests cover all significant branches. The awaitTransferCalls polling loop (1 s deadline, 5 ms tick) is generous and race-free. The only gaps are Finding 1 (untestable at unit level, but the log difference is a maintenance concern) and Finding 3.

Positive notes

The registerLeaderCallback/gatherLeaderCallbacks abstraction is a clean solution to the dupl lint constraint — sharing slot management between leader-loss and leader-acquired keeps the two paths symmetric with one code path to audit.
Sentinel-pointer deregister correctly disambiguates identical fn values and the leaderCallbackSlot{} zero-clear on removal prevents GC pinning of the removed coordinator.
htfifoCapabilityAdvertised = true as a const with the justification comment in adapter/sqs.go is the right call; a var would invite accidental mutation.
Shutdown LIFO ordering for leadershipRefusalDeregister (runs before rt.Close()) is correct.

Summary: One clear fix needed (Finding 1 — missing log fields in panic recovery). Finding 2 is a minor robustness gap worth at least a comment. Everything else is solid.

coderabbitai · 2026-04-30T14:36:59Z

📝 Walkthrough

Walkthrough

Enables HT-FIFO capability advertisement in the SQS adapter, adds a leader-acquired callback API to the Raft engine, and installs an SQS leadership-refusal observer that forces non-HT-FIFO nodes hosting partitioned FIFO groups to transfer leadership.

Changes

Cohort / File(s)	Summary
SQS Capability Advertisement `adapter/sqs.go`	Flip `htfifoCapabilityAdvertised` to `true` and add exported `AdvertisesHTFIFO() bool` accessor.
Raft Engine Admin API & Impl `internal/raftengine/engine.go`, `internal/raftengine/etcd/engine.go`	Add `RegisterLeaderAcquiredCallback(fn func()) (deregister func())` to Admin; implement registration, storage, panic-contained synchronous firing on leader-acquire edge; refactor shared leader-callback helpers.
Raft Engine Tests `internal/raftengine/etcd/leader_acquired_callback_test.go`, `internal/raftadmin/server_test.go`	Add tests covering leader-acquired callback semantics (panic containment, deregistration idempotency, nil-safety, duplicate-fn disambiguation) and add test-only stub to satisfy server engine assertions.
SQS Leadership Refusal Feature & Tests `main.go`, `main_sqs_leadership_refusal.go`, `main_sqs_leadership_refusal_test.go`	Add per-group leadership-refusal installer that (when binary does not advertise HT-FIFO) registers leader-acquired observers or immediately transfers leadership if already leader; implement TOCTOU-safe checks, composite deregister fan-out, and comprehensive tests including fake controller and edge cases.

Sequence Diagram

sequenceDiagram
    participant Coord as Coordinator
    participant Node as Runtime / Node
    participant Eng as Raft Engine
    participant SQS as SQS Refusal Hook

    Coord->>Node: Start runtime (with partition map)
    Node->>Eng: WithPartitionResolver(...) initialised
    alt Binary advertises HT-FIFO
        Note over Coord,Eng: No refusal observers installed
    else Binary does NOT advertise HT-FIFO
        Coord->>Eng: installSQSLeadershipRefusalAcrossGroups(...)
        Eng->>Eng: RegisterLeaderAcquiredCallback(refusalFn)
        alt Engine is already Leader
            Eng->>SQS: refusalFn invoked immediately (sync)
            activate SQS
            SQS->>Eng: TransferLeadership() (async)
            deactivate SQS
        else Starts as Follower
            Note over Eng: Wait for leader-acquire event
        end
        alt Later acquires leadership
            Eng->>Eng: refreshStatus → detects Leader
            Eng->>SQS: fire registered refusalFn
            activate SQS
            SQS->>Eng: TransferLeadership() (async)
            deactivate SQS
        end
    end

Estimated Code Review Effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly Related PRs

feat(sqs): partition resolver for HT-FIFO routing (Phase 3.D PR 4-B-2) #715: Overlaps partition-resolver and coordinator integration touched by SQS leadership-refusal wiring.
docs(sqs): proposals for Phase 3.C (throttling) + 3.D (split-queue FIFO) #664: Related changes to HT-FIFO capability advertisement and leadership-refusal design; directly connected to adapter and callback usage.
docs: propose SQS-compatible adapter design #606: Prior SQS adapter work touching adapter/sqs.go and SQS endpoints; relevant to capability advertisement.

Poem

🐰 I sniffed the queues and found a hop,
Leaders learned quick — then I said "stop!"
HT-FIFO sings, others step down,
Callbacks cheer across the town,
A rabbit's hop keeps queues from a flop. 🥕

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately summarizes the main changes: adding a leadership-refusal hook for SQS and flipping the HTFIFO capability flag, which are the core objectives.
Docstring Coverage	✅ Passed	Docstring coverage is 96.30% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/sqs-htfifo-leadership-refusal

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Review rate limit: 0/1 reviews remaining, refill in 60 minutes.}

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: fe99a436e5

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-30T14:39:02Z

+// mutex so callers can fire them without holding the lock.
+// Mirrors the snapshot-then-fire pattern used by the per-callback
+// invoke helpers.
+func gatherLeaderCallbacks(mu *sync.Mutex, cbs []leaderCallbackSlot) []func() {


Lock callback slice before taking snapshot

gatherLeaderCallbacks takes cbs by value, so both fireLeaderLossCallbacks and fireLeaderAcquiredCallbacks read e.leader* Cbs outside the mutex before mu.Lock() runs. That regresses the previous lock-then-copy behavior and introduces a real data race when a callback is registered/deregistered concurrently with refreshStatus firing callbacks (e.g., startup/shutdown churn), which can yield undefined behavior or race-detector failures. Please move the slice read under the lock (e.g., pass a pointer to the slice and dereference only after locking).

Useful? React with 👍 / 👎.

gemini-code-assist

Code Review

This pull request enables the SQS HT-FIFO capability and implements a leadership-refusal safeguard by introducing a leader-acquisition callback mechanism in the Raft engine. This allows nodes to automatically transfer leadership if they are ineligible to host partitioned queues. Review feedback highlighted critical data races in the shared callback firing logic and suggested enhancing panic recovery logs with node identity and stack traces for better production debugging.

gemini-code-assist · 2026-04-30T14:39:38Z

+// invokeLeaderLossCallback) so a bug in one holder cannot break
+// subsequent callbacks or crash the process.
+func (e *Engine) fireLeaderLossCallbacks() {
+	for _, fn := range gatherLeaderCallbacks(&e.leaderLossCbsMu, e.leaderLossCbs) {


This line contains a data race. The slice header of e.leaderLossCbs is read here to be passed by value to gatherLeaderCallbacks, but this read is not protected by e.leaderLossCbsMu. If another goroutine modifies the slice (e.g., via RegisterLeaderLossCallback) concurrently, it can lead to undefined behavior or a crash. The slice should be passed by pointer so that gatherLeaderCallbacks can read the header under the lock.

Suggested change

for _, fn := range gatherLeaderCallbacks(&e.leaderLossCbsMu, e.leaderLossCbs) {

for _, fn := range gatherLeaderCallbacks(&e.leaderLossCbsMu, &e.leaderLossCbs) {

References

To prevent race conditions and the use of stale data, ensure that shared variables are accessed and checked while holding the appropriate lock.

gemini-code-assist · 2026-04-30T14:39:39Z

+func gatherLeaderCallbacks(mu *sync.Mutex, cbs []leaderCallbackSlot) []func() {
+	mu.Lock()
+	out := make([]func(), len(cbs))
+	for i, c := range cbs {
+		out[i] = c.fn
+	}
+	mu.Unlock()
+	return out
 }


To prevent data races on the slice header, this helper should accept a pointer to the slice. This allows the function to access the slice's pointer, length, and capacity safely while holding the mutex.

Suggested change

func gatherLeaderCallbacks(mu *sync.Mutex, cbs []leaderCallbackSlot) []func() {

mu.Lock()

out := make([]func(), len(cbs))

for i, c := range cbs {

out[i] = c.fn

}

mu.Unlock()

return out

}

func gatherLeaderCallbacks(mu *sync.Mutex, cbs *[]leaderCallbackSlot) []func() {

mu.Lock()

defer mu.Unlock()

out := make([]func(), len(*cbs))

for i, c := range *cbs {

out[i] = c.fn

}

return out

}

References

To prevent race conditions and the use of stale data, ensure that shared variables are accessed and checked while holding the appropriate lock.

gemini-code-assist · 2026-04-30T14:39:39Z

+// synchronously. Same panic-containment + non-blocking contract
+// as fireLeaderLossCallbacks.
+func (e *Engine) fireLeaderAcquiredCallbacks() {
+	for _, fn := range gatherLeaderCallbacks(&e.leaderAcquiredCbsMu, e.leaderAcquiredCbs) {


This line also has a data race. e.leaderAcquiredCbs is read without holding e.leaderAcquiredCbsMu before being passed to gatherLeaderCallbacks. It should be passed by pointer instead.

Suggested change

for _, fn := range gatherLeaderCallbacks(&e.leaderAcquiredCbsMu, e.leaderAcquiredCbs) {

for _, fn := range gatherLeaderCallbacks(&e.leaderAcquiredCbsMu, &e.leaderAcquiredCbs) {

References

To prevent race conditions and the use of stale data, ensure that shared variables are accessed and checked while holding the appropriate lock.

gemini-code-assist · 2026-04-30T14:39:39Z

+func (e *Engine) invokeLeaderAcquiredCallback(fn func()) {
+	defer func() {
+		if r := recover(); r != nil {
+			slog.Error("etcd raft engine: leader-acquired callback panic",
+				"recover", r)
+		}
+	}()
+	fn()
+}


The panic recovery logic here is less informative than its counterpart in invokeLeaderLossCallback. It should include the node identity and a stack trace to facilitate debugging of policy hook failures in production.

func (e *Engine) invokeLeaderAcquiredCallback(fn func()) { defer func() { if r := recover(); r != nil { slog.Error("etcd raft engine: leader-acquired callback panicked", slog.String("node_id", e.localID), slog.Uint64("raft_node_id", e.nodeID), slog.Any("panic", r), slog.String("stack", string(debug.Stack())), ) } }() fn() }

References

Include node identity in logs and metadata to ensure correct context and prevent incorrect metadata reuse.

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@internal/raftengine/etcd/engine.go`:
- Around line 1028-1039: gatherLeaderCallbacks currently takes the callback
slice by value and locks after reading it, which races with concurrent
register/deregister; change gatherLeaderCallbacks to lock before reading the
slice (accept the slice as a pointer or otherwise take a locked reference) so
the snapshot of []leaderCallbackSlot is performed while mu is held, and update
callers fireLeaderLossCallbacks and fireLeaderAcquiredCallbacks to pass
&e.leaderLossCbs and &e.leaderAcquiredCbs (using the same mu) so the function
reads a consistent slice of leaderCallbackSlot entries (and then returns the
copied []func() for use outside the lock).

In `@main_sqs_leadership_refusal.go`:
- Around line 97-103: Register the leader-acquired callback before doing the
immediate startup leader check: call
admin.RegisterLeaderAcquiredCallback(refuse) first to subscribe to future
acquisitions, then check admin.State() == raftengine.StateLeader and call
refuse() if true; this ensures the refuse() callback isn't missed if leadership
is gained between the check and registration. Reference:
admin.RegisterLeaderAcquiredCallback, admin.State(), refuse().

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 1c809151-22b1-4912-9ca5-c83d5b143f5f

📥 Commits

Reviewing files that changed from the base of the PR and between e9f33eb and fe99a43.

📒 Files selected for processing (7)

adapter/sqs.go
internal/raftengine/engine.go
internal/raftengine/etcd/engine.go
internal/raftengine/etcd/leader_acquired_callback_test.go
main.go
main_sqs_leadership_refusal.go
main_sqs_leadership_refusal_test.go

PR #723 round 1 review caught four items. This commit lands all four (one is a real concurrency bug I introduced when refactoring the leader-callback duplication). 1) P1 (Codex) / HIGH (Gemini) — data race in gatherLeaderCallbacks My round 1 refactor extracted gatherLeaderCallbacks(mu, cbs []leaderCallbackSlot). Passing the slice by value means the slice header (pointer, length, capacity) is dereferenced at the call site — i.e. BEFORE mu.Lock() runs inside the helper. Concurrent registerLeaderCallback running on another goroutine could mutate the header while the caller is reading it, triggering -race detector failures or undefined behaviour. Fix: gatherLeaderCallbacks now takes *[]leaderCallbackSlot (pointer to slice). The slice is dereferenced INSIDE the locked section, closing the race. Both fireLeaderLossCallbacks and fireLeaderAcquiredCallbacks updated to pass &e.leader*Cbs. This regresses to the pre-refactor "lock-then-copy" semantics the original RegisterLeaderLossCallback had. The lesson: passing a slice by value to a "thread-safe" helper does not give you mutex safety, because the function-argument evaluation happens before the function body locks anything. 2) MEDIUM (Gemini) / Claude finding 1 — invokeLeaderAcquiredCallback log fields Mirror invokeLeaderLossCallback's slog.Error fields: slog.String("node_id", e.localID) slog.Uint64("raft_node_id", e.nodeID) slog.Any("panic", r) slog.String("stack", string(debug.Stack())) Without these, an SQS leadership-refusal hook panicking in production would leave operators with only the recovered value to grep on. Same fields the leader-loss path has, so cross- family triage is consistent. 3) Claude finding 2 — TOCTOU window in startup leader check installSQSLeadershipRefusal previously did: if admin.State() == StateLeader { refuse() } return admin.RegisterLeaderAcquiredCallback(refuse) Window: the engine could win an election between State() and RegisterLeaderAcquiredCallback returning. refreshStatus would fire fireLeaderAcquiredCallbacks before refuse is in the slice, and the hook would miss that acquisition. Fix: post-registration State() re-check. If the install-time read returned follower but the post-registration read returns leader, refuse() fires for the in-flight election. refuse() is idempotent (TransferLeadership becomes a no-op once a transfer is already in flight), so a double-invocation across the boundary is safe. 4) Claude finding 3 — installSQSLeadershipRefusalAcrossGroups branches Three new tests: - TestInstallSQSLeadershipRefusalAcrossGroups_ComposesDeregisters: multi-group install + composite deregister fan-out. - TestInstallSQSLeadershipRefusalAcrossGroups_NoPartitionedQueueIsNoOp: empty partition map → no per-group install runs. - TestInstallSQSLeadershipRefusalAcrossGroups_SkipsNilRuntimes: nil-runtime / nil-engine guard in the iterator. - TestInstallSQSLeadershipRefusalAcrossGroups_SkipsEngineWithoutObserver: type-assertion-miss log path (engine that doesn't satisfy sqsLeadershipController). Plus TestInstallSQSLeadershipRefusal_TOCTOUWindowCovered which pins fix #3 directly: a fake controller whose State() flips during RegisterLeaderAcquiredCallback exercises the post-registration re-check. 5) Audit per the lessons-learned discipline The semantic change here is the gatherLeaderCallbacks signature. grep -rn confirmed only fireLeaderLossCallbacks and fireLeaderAcquiredCallbacks call it — both are local to this file, both updated together. No external callers to audit.

bootjp · 2026-04-30T14:46:46Z

@claude review

Round 2 — all four round 1 findings addressed:

P1 (Codex) / HIGH (Gemini) — real data race in gatherLeaderCallbacks. My round 1 refactor passed the slice by value (cbs []leaderCallbackSlot), so the slice header was dereferenced at the call site BEFORE mu.Lock() ran inside the helper. Concurrent registerLeaderCallback could mutate the header during that window. Fix: signature changed to *[]leaderCallbackSlot; both call sites updated. The dereference now happens inside the locked section.

MEDIUM (Gemini) / Claude finding 1 — invokeLeaderAcquiredCallback panic log now matches invokeLeaderLossCallback (node_id / raft_node_id / panic / debug.Stack).

Claude finding 2 (TOCTOU) — installSQSLeadershipRefusal now re-checks State() after RegisterLeaderAcquiredCallback returns. If the engine wins an election in the registration window, refuse() fires for that in-flight acquisition. refuse() is idempotent so a double-fire across the boundary is safe.

Claude finding 3 (branch coverage) — 4 new installSQSLeadershipRefusalAcrossGroups tests cover composite deregister fan-out, empty partition map, nil-runtime / nil-engine guard, and engine-doesn't-implement-observer. Plus TestInstallSQSLeadershipRefusal_TOCTOUWindowCovered directly exercises fix #2.

Lessons-learned audit: semantic change is gatherLeaderCallbacks signature. grep -rn confirmed only the two fire*Callbacks methods call it; both updated together. No external callers.

go test -race ./kv/... ./internal/raftengine/etcd/ . and golangci-lint clean.

claude · 2026-04-30T14:47:10Z

Claude finished @bootjp's task in 3m 7s —— View job

Review: Round 2 — Phase 3.D PR 4-B-3b

Read changed files (post-634fbcf)
Verify fix Configure Renovate #1: gatherLeaderCallbacks slice pointer race
Verify fix Update dependency go to v1.21.4 #2: invokeLeaderAcquiredCallback panic log parity
Verify fix Update docker/setup-qemu-action action to v3 #3: TOCTOU order flip in installSQSLeadershipRefusal
Verify fix Dependency Dashboard #4: new branch-coverage tests
Check for any new issues introduced by the fixes
Post final review

All four Round 1 findings are correctly addressed. No new issues introduced.

Fix #1 — `gatherLeaderCallbacks` slice pointer race ✅

internal/raftengine/etcd/engine.go:1039

Signature is now gatherLeaderCallbacks(mu *sync.Mutex, cbs *[]leaderCallbackSlot). The slice header (pointer, length, capacity) is now dereferenced as *cbs inside mu.Lock(). Both callers pass &e.leaderLossCbs and &e.leaderAcquiredCbs. The comment at lines 1028–1038 explains the race and traces it to the Codex P1 / Gemini HIGH finding — useful for future auditors. Clean fix.

Fix #2 — `invokeLeaderAcquiredCallback` panic log parity ✅

internal/raftengine/etcd/engine.go:1095–1113

The recovery block now logs slog.String("node_id", e.localID), slog.Uint64("raft_node_id", e.nodeID), slog.Any("panic", r), and slog.String("stack", string(debug.Stack())). Leader-acquired and leader-loss panic handlers are now structurally identical. The comment at lines 1098–1103 correctly attributes this to the Gemini/Claude finding and explains why the fields matter for production triage.

Fix #3 — TOCTOU order flip in `installSQSLeadershipRefusal` ✅

main_sqs_leadership_refusal.go:97–119

The new sequence:

wasLeaderBefore = admin.State() — pre-registration snapshot
If already leader → refuse() immediately
deregister = admin.RegisterLeaderAcquiredCallback(refuse) — subscribe
if !wasLeaderBefore && admin.State() == raftengine.StateLeader — post-registration check

The !wasLeaderBefore guard on step 4 is correct: it prevents a double-fire in the wasLeaderBefore=true path (where we already called refuse() in step 2), while still catching the race window where the node wins an election between steps 1 and 3.

One residual theoretical race not covered: if wasLeaderBefore=true and the TransferLeadership goroutine is in-flight when the node loses leadership and then wins it again during the microsecond of RegisterLeaderAcquiredCallback's execution — the !wasLeaderBefore guard skips the second check. In that case, the next subsequent election will be caught by the registered callback. This is acceptable: the window is sub-microsecond, refuse() is idempotent, and the callback guarantees eventual step-down.

TestInstallSQSLeadershipRefusal_TOCTOUWindowCovered exercises this directly via stateAfterRegister: raftengine.StateLeader, which flips state to leader exactly when RegisterLeaderAcquiredCallback is called. The test's awaitTransferCallsAtLeastOne is correct given refuse() is asynchronous.

Fix #4 — Branch coverage for `installSQSLeadershipRefusalAcrossGroups` ✅

main_sqs_leadership_refusal_test.go:290–392

Four new tests cover the previously-untested guard paths:

ComposesDeregisters — confirms N per-group install calls and composite deregister fan-out across two runtimes
NoPartitionedQueueIsNoOp — empty partition map returns early with no registrations
SkipsNilRuntimes — nil runtime entry and nil engine entry are skipped; valid entry still gets registered
SkipsEngineWithoutObserver — engineWithoutAcquiredObserver (embeds raftengine.Engine but lacks RegisterLeaderAcquiredCallback) triggers the type-assertion-miss path without panicking

The engineWithoutAcquiredObserver type-assertion approach is the right design for the test double.

Additional observations (non-blocking)

Cleanup LIFO ordering confirmed: CleanupStack.Run() is LIFO (internal/cleanup.go:3). leadershipRefusalDeregister is added at line 336, which is after the rt.Close() group at line 309. LIFO means deregister runs first → callbacks are cleanly removed before the engine closes.

partitionedGroupSet empty-groups edge case: if partitionMap has entries where all queues have empty groups arrays, the function returns a non-nil but empty map[uint64]bool{}. The len(partGroups) == 0 check in installSQSLeadershipRefusalAcrossGroups correctly handles this (returns early). No issue.

adapter/sqs.go flag flip: htfifoCapabilityAdvertised = true as a const is the right choice — a var would be mutable at runtime. AdvertisesHTFIFO() is a clean accessor.

Verdict: All Round 1 blocking issues resolved. The implementation is correct, the tests are comprehensive, and the code is ready to merge.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 634fbcf954

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-30T15:01:55Z

+			if err := admin.TransferLeadership(ctx); err != nil {
+				logger.Warn("sqs: TransferLeadership failed",
+					"group", gid, "err", err)
+			}


Fail closed when leadership transfer is rejected

When advertisesHTFIFO is false and this hook fires, a failed TransferLeadership (for example, no eligible transferee in a single-voter group or peers unavailable) is only logged and then ignored. In that case the node remains leader and continues serving despite the hook deciding it is unsafe to lead, which defeats the downgrade-protection guarantee and can still allow incorrect behavior on partitioned queues.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-04-30T15:01:55Z

+	for queue, routing := range partitionMap {
+		for _, groupRef := range routing.groups {
+			id, err := strconv.ParseUint(groupRef, 10, 64)


Exclude single-partition routes from refusal group set

partitionedGroupSet currently marks groups from every --sqsFifoPartitionMap entry, but PartitionCount=1 is valid input and is not a partitioned queue in the HT-FIFO sense. This makes the refusal hook run for groups that only host single-partition routes, causing unnecessary leadership transfers (and possible avoidable unavailability on constrained groups) without any data-safety benefit.

Useful? React with 👍 / 👎.

CI failure on PR #723 (test (ubuntu-latest)): TestServerMapsEngineAdminMethods returned "add voter is not supported by this raft engine" because the type-assertion at internal/raftadmin/server.go:21 (any(engine).(raftengine.Admin)) silently fell back to admin=nil after Phase 3.D PR 4-B-3b extended the Admin interface with RegisterLeaderAcquiredCallback. The test stub fakeEngine never gained the method, so it stopped satisfying the interface — and every admin RPC mapped through NewServer returned Unimplemented. This is exactly the "semantic-change-without-full-caller-audit" miss the recent lessons-learned discipline targets. PR #723's audit grep was on the production resolver helpers, not on the Admin interface itself; raftadmin's stub slipped through. Fix: add the missing method on the stub. Returns a no-op deregister — raftadmin doesn't exercise the leader-acquired observer (PR 4-B-3b's SQS leadership-refusal lives in main_sqs_leadership_refusal.go), the stub's only obligation is to satisfy the type assertion. go test -race ./... and golangci-lint clean.

bootjp · 2026-05-01T16:20:09Z

@claude review

CI failure fix (round 3):

Failure: test (ubuntu-latest) → TestServerMapsEngineAdminMethods returned "add voter is not supported by this raft engine". Root cause: the type-assertion at internal/raftadmin/server.go:21 (any(engine).(raftengine.Admin)) silently fell back to admin=nil after this PR extended the Admin interface with RegisterLeaderAcquiredCallback. The fakeEngine stub never gained the method, so it stopped satisfying the interface — and every admin RPC mapped through NewServer returned Unimplemented.

Lessons-learned miss: my round 2 audit was grep -rn gatherLeaderCallbacks, scoped to the function I was renaming, not to the Admin interface itself. Should have also done grep -rn 'raftengine.Admin\\b' to catch all type-assertion sites — internal/raftadmin/server.go:21 and adapter/test_util.go:424 were the two hits. Production sites (etcd Engine + adapter test_util's real raftengine) satisfy the new interface; only the raftadmin stub didn't.

Fix: fakeEngine.RegisterLeaderAcquiredCallback returns a no-op deregister. raftadmin doesn't exercise the leader-acquired observer; the stub's only obligation is to satisfy the type assertion.

go test -race ./internal/raftadmin/ ./internal/raftengine/etcd/ ./kv/ . and golangci-lint clean.

claude · 2026-05-01T16:20:23Z

Claude finished @bootjp's task in 1m 33s —— View job

Review: Round 3 — CI Failure Fix (`fakeEngine` stub)

Gather context
Read internal/raftadmin/server.go — type assertion site
Read internal/raftadmin/server_test.go — fakeEngine fix
Read internal/raftengine/engine.go — Admin interface definition
Verify fix correctness and completeness
Post final review

Root cause confirmed. raftengine.Admin (engine.go:170–221) now requires RegisterLeaderAcquiredCallback(fn func()) (deregister func()). The type assertion at server.go:21 (admin, _ := any(engine).(raftengine.Admin)) is a silent comma-ok form — when fakeEngine failed to satisfy the new method, admin silently became nil, and every admin RPC fell into the s.admin == nil → Unimplemented branch. The exact error reported ("add voter is not supported by this raft engine") lines up with server.go:71.

Fix analysis

internal/raftadmin/server_test.go:140–150

func (f *fakeEngine) RegisterLeaderAcquiredCallback(fn func()) func() {
    return func() {}
}

✅ Signature matches the interface exactly. raftengine.Admin declares RegisterLeaderAcquiredCallback(fn func()) (deregister func()) — returning a func() no-op satisfies this.

✅ No-op is appropriate. raftadmin doesn't exercise the leader-acquired observer at all; the real behaviour lives in main_sqs_leadership_refusal.go. The stub's only obligation is making the type assertion succeed so NewServer sets admin to a non-nil value.

✅ stateOnlyEngine is not affected. That second test double (lines 291–315) is only used by currentHealthStatus, which takes raftengine.Engine (the narrower interface). It never passes through NewServer so it doesn't need RegisterLeaderAcquiredCallback. No change needed there.

✅ adapter/test_util.go:424 is not affected. That site uses the real etcd Engine, which already implements the full Admin interface including the new method.

✅ The regression-catching test already existed. TestServerMapsEngineAdminMethods exercises all admin RPCs end-to-end; it's what surfaced the failure. No additional test needed — the fix makes the existing test pass.

One minor style note (non-blocking)

The comment on the stub (7 lines) is more verbose than the project convention ("one short line max" for comments whose WHY is non-obvious). The design-doc references (PR 4-B-3b, §8) are appropriate for an audit trail but will rot as the design evolves. Not blocking — the WHY (silent type-assertion failure from interface extension) is genuinely non-obvious and worth documenting. Consider trimming to the core insight in a follow-up if it bothers you:

// No-op: satisfies raftengine.Admin after RegisterLeaderAcquiredCallback
// was added to the interface; without it the type assertion in NewServer
// silently yields admin=nil and all admin RPCs return Unimplemented.

Verdict: Fix is correct, minimal, and complete. All three rounds of review findings are now resolved. Ready to merge.

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

internal/raftengine/etcd/engine.go (1)

2604-2642: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Use the published post-close state for the leader-acquired edge.

refreshStatus() rewrites the published state to StateShutdown when e.closed is set, but this new branch still keys off the pre-close local status.State. If fail()/shutdown closes the engine and the loop reaches one more refreshStatus(), previous is shutdown while status.State can still be leader, so fireLeaderAcquiredCallbacks() runs during teardown. That violates the callback contract and can spuriously trigger TransferLeadership from main_sqs_leadership_refusal.go against a closed engine.

Proposed fix

 	e.mu.Lock()
 	e.status = status
 	if e.closed {
 		e.status.State = raftengine.StateShutdown
 	}
+	effectiveState := e.status.State
 	// Use the voter count, NOT len(e.peers), so a 1-voter + N-learner
 	// cluster correctly hits the leader-of-1 fast path: a learner does
 	// not vote and is not part of the lease-read denominator. See
 	// docs/design/2026_04_26_proposed_raft_learner.md §4.6.
 	voterCount := e.voterCount
 	e.mu.Unlock()
 
-	e.isLeader.Store(status.State == raftengine.StateLeader)
+	e.isLeader.Store(effectiveState == raftengine.StateLeader)
 
-	if status.State == raftengine.StateLeader && voterCount <= 1 {
+	if effectiveState == raftengine.StateLeader && voterCount <= 1 {
 		e.singleNodeLeaderAckMonoNs.Store(monoclock.Now().Nanos())
 	} else {
 		e.singleNodeLeaderAckMonoNs.Store(0)
 	}
 
-	if status.State == raftengine.StateLeader {
+	if effectiveState == raftengine.StateLeader {
 		e.leaderOnce.Do(func() { close(e.leaderReady) })
 	}
-	if previous != raftengine.StateLeader && status.State == raftengine.StateLeader {
+	if previous != raftengine.StateLeader && effectiveState == raftengine.StateLeader {
 		e.fireLeaderAcquiredCallbacks()
 	}
-	if previous == raftengine.StateLeader && status.State != raftengine.StateLeader {
+	if previous == raftengine.StateLeader && effectiveState != raftengine.StateLeader {
 		e.failPending(errors.WithStack(errNotLeader))
 		e.ackTracker.reset()
 		e.fireLeaderLossCallbacks()
 	}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@internal/raftengine/etcd/engine.go` around lines 2604 - 2642, refreshStatus()
currently mutates e.status to StateShutdown when e.closed but later checks still
use the local variable status, allowing a shutdown path to look like a leader
and call fireLeaderAcquiredCallbacks(); fix by reading the published/post-close
state (e.status) for the leader-related decisions: after unlocking (or
immediately after updating e.status) capture the canonical current state
(e.status.State) into a local like currentState and use that instead of
status.State for e.isLeader.Store, the single-node ack store, the
leaderOnce/leaderReady logic, and the previous->current comparison that gates
e.fireLeaderAcquiredCallbacks().

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@main_sqs_leadership_refusal.go`:
- Around line 143-153: partitionedGroupSet() currently marks every routed group
in partitionMap as partitioned; change it to only add group IDs when the route
is actually partitioned by checking routing.partitionCount > 1 before iterating
routing.groups. Keep the existing strconv.ParseUint error handling and
logger.Warn for non-uint64 groupRef, but wrap the loop that parses and sets
out[id] = true inside an if routing.partitionCount > 1 block (or otherwise skip
routes with partitionCount == 1) so ordinary FIFO groups are not included.

---

Outside diff comments:
In `@internal/raftengine/etcd/engine.go`:
- Around line 2604-2642: refreshStatus() currently mutates e.status to
StateShutdown when e.closed but later checks still use the local variable
status, allowing a shutdown path to look like a leader and call
fireLeaderAcquiredCallbacks(); fix by reading the published/post-close state
(e.status) for the leader-related decisions: after unlocking (or immediately
after updating e.status) capture the canonical current state (e.status.State)
into a local like currentState and use that instead of status.State for
e.isLeader.Store, the single-node ack store, the leaderOnce/leaderReady logic,
and the previous->current comparison that gates e.fireLeaderAcquiredCallbacks().

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: d9c0fca5-2e68-4470-955f-bb8f3a00c4a8

📥 Commits

Reviewing files that changed from the base of the PR and between fe99a43 and b8b9893.

📒 Files selected for processing (4)

internal/raftadmin/server_test.go
internal/raftengine/etcd/engine.go
main_sqs_leadership_refusal.go
main_sqs_leadership_refusal_test.go

coderabbitai · 2026-05-01T16:26:12Z

+	out := make(map[uint64]bool)
+	for queue, routing := range partitionMap {
+		for _, groupRef := range routing.groups {
+			id, err := strconv.ParseUint(groupRef, 10, 64)
+			if err != nil {
+				logger.Warn("sqs: leadership-refusal: skipping non-uint64 group reference (config validation bypass?)",
+					"queue", queue, "group_ref", groupRef, "err", err)
+				continue
+			}
+			out[id] = true
+		}


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Only include genuinely partitioned routes in partitionedGroupSet().

This currently marks every routed group as “partitioned”, even when routing.partitionCount == 1. That makes the refusal hook apply to ordinary FIFO groups too, which can force unnecessary leadership transfers on binaries that don't advertise htfifo. The set should only include routes that are actually split.

Proposed fix

out := make(map[uint64]bool) for queue, routing := range partitionMap { + if routing.partitionCount <= 1 { + continue + } for _, groupRef := range routing.groups { id, err := strconv.ParseUint(groupRef, 10, 64) if err != nil { logger.Warn("sqs: leadership-refusal: skipping non-uint64 group reference (config validation bypass?)", "queue", queue, "group_ref", groupRef, "err", err)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

out := make(map[uint64]bool)

for queue, routing := range partitionMap {

for _, groupRef := range routing.groups {

id, err := strconv.ParseUint(groupRef, 10, 64)

if err != nil {

logger.Warn("sqs: leadership-refusal: skipping non-uint64 group reference (config validation bypass?)",

"queue", queue, "group_ref", groupRef, "err", err)

continue

}

out[id] = true

}

out := make(map[uint64]bool)

for queue, routing := range partitionMap {

if routing.partitionCount <= 1 {

continue

}

for _, groupRef := range routing.groups {

id, err := strconv.ParseUint(groupRef, 10, 64)

if err != nil {

logger.Warn("sqs: leadership-refusal: skipping non-uint64 group reference (config validation bypass?)",

"queue", queue, "group_ref", groupRef, "err", err)

continue

}

out[id] = true

}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@main_sqs_leadership_refusal.go` around lines 143 - 153, partitionedGroupSet() currently marks every routed group in partitionMap as partitioned; change it to only add group IDs when the route is actually partitioned by checking routing.partitionCount > 1 before iterating routing.groups. Keep the existing strconv.ParseUint error handling and logger.Warn for non-uint64 groupRef, but wrap the loop that parses and sets out[id] = true inside an if routing.partitionCount > 1 block (or otherwise skip routes with partitionCount == 1) so ordinary FIFO groups are not included.

chatgpt-codex-connector Bot reviewed Apr 30, 2026

View reviewed changes

gemini-code-assist Bot reviewed Apr 30, 2026

View reviewed changes

coderabbitai Bot reviewed Apr 30, 2026

View reviewed changes

Comment thread internal/raftengine/etcd/engine.go

Comment thread main_sqs_leadership_refusal.go Outdated

chatgpt-codex-connector Bot reviewed Apr 30, 2026

View reviewed changes

bootjp enabled auto-merge May 1, 2026 09:28

coderabbitai Bot reviewed May 1, 2026

View reviewed changes

bootjp merged commit 5c7ccce into main May 1, 2026
11 checks passed

bootjp deleted the feat/sqs-htfifo-leadership-refusal branch May 1, 2026 16:26

	for _, fn := range gatherLeaderCallbacks(&e.leaderLossCbsMu, e.leaderLossCbs) {
	for _, fn := range gatherLeaderCallbacks(&e.leaderLossCbsMu, &e.leaderLossCbs) {

Conversation

bootjp commented Apr 30, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's added

raftengine: leader-acquired observer (mirror of leader-loss)

main: SQS leadership-refusal hook

adapter: AdvertisesHTFIFO + flag flip

What's still gated

Test plan

Self-review (per CLAUDE.md)

Summary by CodeRabbit

Uh oh!

bootjp commented Apr 30, 2026

Uh oh!

claude Bot commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review: Phase 3.D PR 4-B-3b — Leadership-Refusal Hook + Flag Flip

Finding 1 — Bug: invokeLeaderAcquiredCallback missing structured log fields and stack trace

Finding 2 — Minor: TOCTOU window in installSQSLeadershipRefusal startup check

Finding 3 — Observation: installSQSLeadershipRefusalAcrossGroups nil/skip branches not covered

Architecture / design passes (per CLAUDE.md)

Positive notes

Uh oh!

coderabbitai Bot commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated Code Review Effort

Possibly Related PRs

Poem

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

bootjp commented Apr 30, 2026

Uh oh!

claude Bot commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review: Round 2 — Phase 3.D PR 4-B-3b

Fix #1 — gatherLeaderCallbacks slice pointer race ✅

Fix #2 — invokeLeaderAcquiredCallback panic log parity ✅

Fix #3 — TOCTOU order flip in installSQSLeadershipRefusal ✅

Fix #4 — Branch coverage for installSQSLeadershipRefusalAcrossGroups ✅

Additional observations (non-blocking)

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Apr 30, 2026

bootjp commented Apr 30, 2026 •

edited by coderabbitai Bot

Loading

adapter: `AdvertisesHTFIFO` + flag flip

claude Bot commented Apr 30, 2026 •

edited

Loading

Finding 1 — Bug: `invokeLeaderAcquiredCallback` missing structured log fields and stack trace

Finding 2 — Minor: TOCTOU window in `installSQSLeadershipRefusal` startup check

Finding 3 — Observation: `installSQSLeadershipRefusalAcrossGroups` nil/skip branches not covered

coderabbitai Bot commented Apr 30, 2026 •

edited

Loading

claude Bot commented Apr 30, 2026 •

edited

Loading

Fix #1 — `gatherLeaderCallbacks` slice pointer race ✅

Fix #2 — `invokeLeaderAcquiredCallback` panic log parity ✅

Fix #3 — TOCTOU order flip in `installSQSLeadershipRefusal` ✅

Fix #4 — Branch coverage for `installSQSLeadershipRefusalAcrossGroups` ✅

claude Bot commented May 1, 2026 •

edited

Loading

Review: Round 3 — CI Failure Fix (`fakeEngine` stub)