Skip to content

test(grpc): absorb leader churn in consistency loops without weakening the check#632

Merged
bootjp merged 2 commits intomainfrom
test/grpc-flake-retry
Apr 25, 2026
Merged

test(grpc): absorb leader churn in consistency loops without weakening the check#632
bootjp merged 2 commits intomainfrom
test/grpc-flake-retry

Conversation

@bootjp
Copy link
Copy Markdown
Owner

@bootjp bootjp commented Apr 25, 2026

Summary

Test_consistency_satisfy_write_after_read_sequence and Test_grpc_transaction both run a 9999-iteration Put/Get loop against a 3-node Raft cluster. On a busy CI runner Raft re-election can fire mid-loop and the in-flight RPC surfaces as rpc error: code = Unknown desc = leader not found (also etcd raft engine is not leader). Today both tests abort on the first such error even though it is purely an availability hiccup — nothing committed, so consistency cannot be violated.

Wrap every RPC in the existing retryNotLeader helper so transient leader churn is absorbed within leaderChurnRetryTimeout. The consistency assertions (assert.Equal for value reads, assert.Nil for post-delete reads) are unchanged: once the Put / Delete eventually commits, the subsequent Get must agree, and a stale read still fails the test loudly. This is the explicit constraint — 一貫性確認そのものは損なわず — and it holds because retryNotLeader only inspects RPC error codes, never the response payload.

The fix is symmetric with how rpushEventually / lpushEventually already wrap Lua list tests for the same class of CI flake.

Test plan

  • Build / vet / golangci-lint clean.
  • Locally: go test ./adapter/ -run 'Test_grpc_transaction|Test_consistency_satisfy_write_after_read_sequence' -count=1 -timeout 600s passes (this is the same race-mode shape that was failing on https://github.com/bootjp/elastickv/actions/runs/24930560663/job/73007486310).
  • No change to the consistency invariants — only the RPC failure path is retried.

/gemini review
@codex review

Summary by CodeRabbit

  • Tests
    • Enhanced robustness of gRPC transaction tests by implementing retry logic for transient failures during leader elections, reducing flaky test failures and improving overall test reliability in distributed cluster scenarios.

… the check

Test_consistency_satisfy_write_after_read_sequence and
Test_grpc_transaction both fire 9999 Put/Get (and Delete/Get) round
trips against a 3-node Raft cluster. On a busy CI runner Raft
re-election can fire mid-loop; the in-flight RPC then surfaces as
"rpc error: code = Unknown desc = leader not found" or
"etcd raft engine is not leader". Today both tests treat this as a
hard failure and abort, even though the error is purely an
availability hiccup — nothing was committed, so consistency cannot be
violated.

Wrap every RPC in the existing retryNotLeader helper so transient
leader churn is absorbed within leaderChurnRetryTimeout. The post-RPC
assertions (assert.Equal for the value-read invariant, assert.Nil for
the post-delete invariant) are unchanged: once the Put / Delete
eventually commits, the subsequent Get must agree, and a stale read
still fails the test loudly. This is the explicit user constraint —
"一貫性確認そのものは損なわず" — and it holds because retryNotLeader
only inspects RPC error codes, never the response payload.

The fix is symmetric with how rpushEventually / lpushEventually
already wrap Lua list tests for the same class of CI flake.

Build / vet / lint clean.
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 25, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 36a5cf10-6a2d-403d-9f69-cd8ffead0e20

📥 Commits

Reviewing files that changed from the base of the PR and between 8a9bab4 and 1aed386.

📒 Files selected for processing (1)
  • adapter/grpc_test.go

📝 Walkthrough

Walkthrough

The test functions are updated to use the test context instead of background context and wrap RPC operations with retry logic to handle transient "not leader" errors during Raft leader transitions, improving test reliability.

Changes

Cohort / File(s) Summary
Test Resilience
adapter/grpc_test.go
Added retryNotLeader() wrapper around Put/Get/Delete RPC calls in Test_consistency_satisfy_write_after_read_sequence and Test_grpc_transaction to gracefully handle transient leader unavailability; replaced context.Background() with t.Context() for RPC execution.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Possibly related PRs

Poem

🐰 A test that once would flicker and fail,
Now bounces back through the Raft's leader trail—
When "not leader" whispers its transient cry,
Our retries catch it and give it a try! ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly and specifically describes the main change: wrapping consistency test loops with retry logic to handle transient leader changes without compromising test assertions.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch test/grpc-flake-retry

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request wraps gRPC calls in the Test_consistency_satisfy_write_after_read_sequence and Test_grpc_transaction tests with a retryNotLeader helper to improve resilience against transient Raft leader re-elections. The review feedback suggests using a context that supports cancellation or deadlines instead of context.Background() within the retry closures to ensure better resource management and responsiveness during test execution.

Comment thread adapter/grpc_test.go Outdated
Comment on lines +202 to +206
err := retryNotLeader(context.Background(), func() error {
_, perr := c.RawPut(context.Background(),
&pb.RawPutRequest{Key: key, Value: want})
return perr
})
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The closure passed to retryNotLeader uses context.Background() for the RawPut call. Following the project rule to use context.Context for managing deadlines and cancellation, it is better practice to use a context that can be cancelled or one that respects the retry helper's timeout. While this is a test, using context.Background() inside the closure prevents the RPC call from being cancelled if the retry loop times out or the test is interrupted.

References
  1. Use context.Context for managing deadlines and cancellation to ensure proper resource management and responsiveness.

Comment thread adapter/grpc_test.go Outdated
Comment on lines +225 to +229
err = retryNotLeader(context.Background(), func() error {
var gerr error
resp, gerr = c.RawGet(context.Background(), &pb.RawGetRequest{Key: key})
return gerr
})
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Similar to the RawPut call, RawGet here uses context.Background() inside the retry closure. In alignment with the project rule to use context.Context for managing deadlines and cancellation, consider using a context that is tied to the test lifecycle or the retry helper's timeout to ensure timely cancellation of in-flight requests during churn or timeouts.

References
  1. Use context.Context for managing deadlines and cancellation to ensure proper resource management and responsiveness.

@chatgpt-codex-connector
Copy link
Copy Markdown

Codex Review: Didn't find any major issues. Hooray!

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Replace context.Background() in Test_consistency_satisfy_write_after_
read_sequence and Test_grpc_transaction with the per-test t.Context().
Both tests now derive a single ctx at the top of the test body and
thread it through every RPC and retryNotLeader call. Two effects:

- Test-level cancel (timeout, parent stopping the test, t.Cleanup
  draining) propagates into the in-flight RPC instead of leaking the
  goroutine through context.Background until the gRPC call eventually
  errors out on its own.
- The retry helper observes the same cancel signal, so a stalled
  shutdown does not chase 9999 iterations of "leader not found"
  retries past the point the test has been told to stop.

Behaviour-equivalent on the happy path; the consistency assertions
(assert.Equal / assert.Nil) are unchanged. Build / vet / lint clean,
both tests still pass at 9999 iterations locally.
@chatgpt-codex-connector
Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, you can upgrade your account or add credits to your account and enable them for code reviews in your settings.

@bootjp bootjp merged commit dbe4725 into main Apr 25, 2026
8 checks passed
@bootjp bootjp deleted the test/grpc-flake-retry branch April 25, 2026 13:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant