Skip to content

test(jepsen): add ZSet safety workload with model-based checker#550

Open
bootjp wants to merge 6 commits intomainfrom
feat/jepsen-zset-safety
Open

test(jepsen): add ZSet safety workload with model-based checker#550
bootjp wants to merge 6 commits intomainfrom
feat/jepsen-zset-safety

Conversation

@bootjp
Copy link
Copy Markdown
Owner

@bootjp bootjp commented Apr 19, 2026

Summary

Adds a Jepsen workload (elastickv-zset-safety-test) that verifies ZSet-specific safety properties under faults (network partitions, node kills), using a custom model-based checker. Goes beyond the simple add->read visibility check.

Properties verified:

  • Score correctness: a ZRANGE result's score for a member must equal the model's latest committed score, OR equal a score written by an operation concurrent with the read.
  • Order preservation: ZRANGE 0 -1 must be sorted by (score asc, member lex asc).
  • ZRANGEBYSCORE correctness: bounded range queries return exactly the members whose score falls in the bound, modulo concurrent mutations.
  • No phantom members: every read member must have been introduced by some successful or in-flight ZADD/ZINCRBY.
  • Atomicity: the checker treats every :ok operation as atomic; any visible inconsistency is reported.

Concurrent-ZADD handling uses an invoke/complete windowing approach. A mutation is "committed before" a read iff its :complete index is strictly less than the read's :invoke index. Mutations whose intervals overlap are "concurrent" and contribute to a per-member allowed-score set. Indeterminate (:info) mutations are treated as possibly-concurrent. ZINCRBY whose response is unknown sets :unknown-score? so the checker skips the strict score check for concurrent reads. ZREM carries the actual removed? boolean from the server reply so a no-op ZREM does not falsely mark the member as deleted.

Files

  • New: jepsen/src/elastickv/redis_zset_safety_workload.clj
  • New: jepsen/test/elastickv/redis_zset_safety_workload_test.clj (test-spec construction + checker edge cases: no-op ZREM, :info ZINCRBY, deterministic score mismatch).
  • Modified: jepsen/src/elastickv/jepsen_test.clj (entry point added).
  • Modified: .github/workflows/jepsen-test.yml (5s smoke run on every push).
  • Modified: .github/workflows/jepsen-test-scheduled.yml (150s default run every 6h).

Running locally

The workload is invoked directly via its own -main, not through jepsen-test/-main:

cd jepsen
lein run -m elastickv.redis-zset-safety-workload \
  --time-limit 60 --rate 10 --concurrency 5 \
  --ports 63791,63792,63793 --host 127.0.0.1

(elastickv.jepsen-test exposes elastickv-zset-safety-test only as a Clojure function for REPL use; CI and ad-hoc runs use the namespace's own -main.

Test plan

  • passes (Java 21).
  • Run with partition + kill faults: confirm checker emits a clear failure when a stale-leader read returns a divergent score.
    EOF
    )

Summary by CodeRabbit

  • Tests
    • Added a new safety test suite for Redis ZSet operations, validating data consistency and correctness under concurrent operations and fault conditions.
    • Extended CI/CD pipeline to automatically run ZSet safety tests in both scheduled and on-demand test workflows.

Adds a Jepsen workload that goes beyond add->read visibility and verifies
ZSet-specific safety properties under faults (network partitions, node
kills):

- score correctness: a ZRANGE result's score for a member must equal the
  model's latest committed score for that member, OR equal a score
  written by an operation that is concurrent with the read (since the
  client cannot linearise concurrent writes to the same member).
- order preservation: ZRANGE 0 -1 must be sorted by (score asc, member
  lex asc).
- ZRANGEBYSCORE correctness: bounded range queries return exactly the
  members whose score falls in the bound, modulo concurrent mutations.
- no phantom members: every read member must have been introduced by some
  successful or in-flight ZADD/ZINCRBY.

Concurrent-ZADD handling uses an invoke/complete windowing approach:
mutations whose complete index < read's invoke index are committed
before the read; mutations whose intervals overlap are concurrent and
contribute to the per-member allowed-score set. Indeterminate (:info)
mutations are treated as possibly-concurrent.

Workload entry point added to jepsen_test.clj as elastickv-zset-safety-test.
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 19, 2026

Warning

Rate limit exceeded

@bootjp has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 38 minutes and 41 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 38 minutes and 41 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: a177dc39-c2e2-45cc-8b55-8052a88e4000

📥 Commits

Reviewing files that changed from the base of the PR and between 9bfcc13 and 86d23dd.

📒 Files selected for processing (2)
  • jepsen/src/elastickv/redis_zset_safety_workload.clj
  • jepsen/test/elastickv/redis_zset_safety_workload_test.clj
📝 Walkthrough

Walkthrough

This pull request introduces a new Redis ZSet safety Jepsen workload (elastickv.redis-zset-safety-workload) to test elastickv's Redis implementation under faults. It includes CI integration in two scheduled/on-demand workflows, the workload implementation with a model-based checker, and comprehensive unit tests validating checker behavior.

Changes

Cohort / File(s) Summary
CI Workflow Integration
.github/workflows/jepsen-test-scheduled.yml, .github/workflows/jepsen-test.yml
Added new CI steps to run the Redis ZSet safety workload with timeout controls, rate limiting, and port/host configuration alongside existing Redis workload tests.
Workload Registration
jepsen/src/elastickv/jepsen_test.clj
Added namespace require for elastickv.redis-zset-safety-workload and exposed a new elastickv-zset-safety-test function to integrate the ZSet safety workload into the CLI.
Workload Implementation
jepsen/src/elastickv/redis_zset_safety_workload.clj
Implemented new Jepsen workload namespace with client operations (:zadd, :zincrby, :zrem, :zrange-all, :zrangebyscore), a model-based checker validating ordering/completeness/score consistency, fault configuration, and CLI option parsing.
Test Coverage
jepsen/test/elastickv/redis_zset_safety_workload_test.clj
Added unit tests for workload construction, option overrides, and checker validation logic across edge cases (no-op ZREM, uncertain scores from :info operations, score mismatches).

Sequence Diagram

sequenceDiagram
    participant Client
    participant Redis
    participant Nemesis as Nemesis (Fault Injection)
    participant Checker

    Client->>Redis: ZADD/ZINCRBY/ZREM/ZRANGE operation
    Redis-->>Client: Response (or timeout/error)
    Note over Client: Record invoke/complete<br/>with type (:ok/:info)
    
    Nemesis->>Redis: Inject faults (partition/stop/etc.)
    Redis-->>Client: Degraded/failed responses
    
    Nemesis->>Redis: Heal/restore
    Redis-->>Client: Recover operations
    
    Checker->>Checker: Extract mutation windows<br/>per operation
    Checker->>Checker: Build per-member<br/>committed state
    Checker->>Checker: Validate ordering<br/>(score ↑, then member ↑)
    Checker->>Checker: Check score correctness<br/>vs committed/concurrent
    Checker->>Checker: Enforce completeness<br/>(no phantom members)
    Checker-->>Checker: Produce :valid?
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Poem

🐰 A hop through the ZSet maze!

New safety workload hops in place,
Checking scores with fuzzy grace,
When faults crash through, the checker knows,
Each member's rank and all it owes—
Redis ZSets now stand the test! 🎯

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'test(jepsen): add ZSet safety workload with model-based checker' accurately summarizes the main change: introducing a new Jepsen test workload for ZSet safety with a model-based checker, which is reflected across all modified and added files.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/jepsen-zset-safety

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

- jepsen-test.yml: 5s smoke run on every push, mirroring the other
  workloads.
- jepsen-test-scheduled.yml: 150s default run (overridable via
  workflow_dispatch inputs) every 6 hours.

Workload entry: elastickv.redis-zset-safety-workload (added in the
previous commit via -main).
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new Jepsen workload to validate elastickv’s Redis ZSet safety properties (scores, ordering, range correctness, and phantom detection) under faults using a custom model-based checker, and wires a helper entrypoint into the Jepsen test namespace.

Changes:

  • Introduces redis_zset_safety_workload.clj with a Carmine-based client, randomized op generator, and a custom checker for ZSet safety properties.
  • Adds a new wrapper function in jepsen_test.clj intended to expose the new workload.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.

File Description
jepsen/src/elastickv/redis_zset_safety_workload.clj New ZSet safety Jepsen workload with custom model/checker and CLI entrypoint.
jepsen/src/elastickv/jepsen_test.clj Adds a wrapper function and require for the new ZSet safety test.

Comment on lines +476 to +526
(defn elastickv-zset-safety-test
"Builds a Jepsen test map that drives elastickv's Redis ZSet safety
workload."
([] (elastickv-zset-safety-test {}))
([opts]
(let [nodes (or (:nodes opts) default-nodes)
redis-ports (or (:redis-ports opts)
(repeat (count nodes) (or (:redis-port opts) 6379)))
node->port (or (:node->port opts)
(cli/ports->node-map redis-ports nodes))
local? (:local opts)
db (if local?
jdb/noop
(ekdb/db {:grpc-port (or (:grpc-port opts) 50051)
:redis-port node->port
:raft-groups (:raft-groups opts)
:shard-ranges (:shard-ranges opts)}))
rate (double (or (:rate opts) 10))
time-limit (or (:time-limit opts) 60)
faults (if local?
[]
(cli/normalize-faults (or (:faults opts) [:partition :kill])))
nemesis-p (when-not local?
(combined/nemesis-package {:db db
:faults faults
:interval (or (:fault-interval opts) 40)}))
nemesis-gen (if nemesis-p
(:generator nemesis-p)
(gen/once {:type :info :f :noop}))
workload (elastickv-zset-safety-workload
(assoc opts :node->port node->port))]
(merge workload
{:name (or (:name opts) "elastickv-redis-zset-safety")
:nodes nodes
:db db
:redis-host (:redis-host opts)
:os (if local? os/noop debian/os)
:net (if local? net/noop net/iptables)
:ssh (merge {:username "vagrant"
:private-key-path "/home/vagrant/.ssh/id_rsa"
:strict-host-key-checking false}
(when local? {:dummy true})
(:ssh opts))
:remote control/ssh
:nemesis (if nemesis-p (:nemesis nemesis-p) nemesis/noop)
:final-generator nil
:concurrency (or (:concurrency opts) 5)
:generator (->> (:generator workload)
(gen/nemesis nemesis-gen)
(gen/stagger (/ rate))
(gen/time-limit time-limit))}))))
Comment on lines +251 to +254
:zrem
(let [m (:value invoke)]
{:f :zrem :member m :score nil :zrem? true
:type t :invoke-idx inv-idx :complete-idx cmp-idx})))))
Comment on lines +244 to +249
:zincrby
(let [[m _delta] (:value invoke)
s (when (and (= :ok t) (vector? (:value complete)))
(second (:value complete)))]
{:f :zincrby :member m :score (some-> s double)
:type t :invoke-idx inv-idx :complete-idx cmp-idx})
Comment on lines +271 to +274
strictly precede `read-inv-idx`. Model maps member -> {:score s} or
marks member as :deleted. Returns {:members map :ok-members set}.
Only considers :ok mutations for the authoritative model; :info
mutations are treated as uncertain (neither strictly applied nor not)."
Comment on lines +18 to 23
(defn elastickv-zset-safety-test []
(zset-safety-workload/elastickv-zset-safety-test {}))

(defn -main
[& args]
(cli/run! (cli/single-test-cmd {:test-fn elastickv-test}) args))
Three Copilot findings on PR #550:

1. :zincrby indeterminate handling. Pending or :info ZINCRBY left the
   resulting score unknown, but the checker still required the
   observed read score to be in the finite allowed-scores set. A read
   that legitimately observed an in-flight increment was flagged as a
   score mismatch (false positive).

   completed-mutation-window now sets :unknown-score? on a ZINCRBY when
   the completion is :info or pending. allowed-scores-for-member
   returns :unknown-score? when any concurrent ZINCRBY carries the
   flag, and check-zrange-all / check-zrangebyscore skip the strict
   score-membership check in that case.

2. :zrem no-op handling. ZREM of a never-added member returns 0
   server-side (no-op). The previous model treated every ZREM as a
   deletion, producing missing-member false positives and
   score-mismatch false negatives.

   invoke! already exposes the actual removed? boolean as the second
   element of the completion value. completed-mutation-window now
   threads :removed? through, and the new apply-mutation-to-state
   helper leaves state unchanged when :removed? is false.

3. model-before docstring claimed it returned {:members map :ok-members
   set}, but it returned the model map directly. Docstring rewritten to
   match the actual return value.

Adds jepsen/test/elastickv/redis_zset_safety_workload_test.clj
covering test-spec construction, the no-op ZREM edge case, the :info
ZINCRBY skip, and a positive-control score-mismatch detection. The
checker tests bypass the timeline.html sub-checker (which writes to
the test store) by invoking zset-safety-checker directly.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 19, 2026

94be1bd で Copilot 指摘を fix:

  • :zincrby indeterminate false positive: pending/:info ZINCRBY の resulting score 不明分を :unknown-score? フラグで伝播。allowed-scores-for-member が concurrent に unknown ZINCRBY を含む場合、strict score check を skip。
  • :zrem no-op false positive: invoke! が既に返している removed? boolean を mutation record に thread。:removed? false の ZREM は state を変更しない (apply-mutation-to-state ヘルパーで分岐)。
  • model-before docstring: 実装に合わせて修正。
  • PR description: lein run -m elastickv.redis-zset-safety-workload 直接呼び出し方式を明記 (CI と整合)。
  • Workload unit test: jepsen/test/elastickv/redis_zset_safety_workload_test.clj 追加。test-spec 構築、no-op ZREM, :info ZINCRBY, positive-control score mismatch を deterministic に検証。timeline.html サブチェッカーは store を要求するので bypass し、zset-safety-checker を直接呼び出し。

lein test elastickv.redis-zset-safety-workload-test 全 5 件 PASS (Java 21 環境で確認)。

/gemini review

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

…re-mismatch

CI failure on 94be1bd surfaced a remaining checker false positive: a
read that observes a member whose only prior ops are no-op ZREMs was
classified as :score-mismatch with :allowed #{} rather than treated as
a never-existed member. allowed-scores-for-member returned
any-known? = true because (seq concurrent) was truthy even when every
concurrent op was a :zrem that didn't actually remove anything.

Fix: any-known? (renamed internally to existence-evidence?) now only
counts concurrent mutations that provide evidence the member ever
existed -- :zadd, :zincrby, or a :zrem whose :removed? boolean is true.
A concurrent no-op ZREM contributes nothing.

Adds no-op-zrem-alone-does-not-false-positive as a regression test.
All 6 workload unit tests pass under Java 21.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 20, 2026

9bfcc13 で CI 失敗原因の checker false positive を追加 fix:

  • no-op ZREM のみで score-mismatch 誤検出: CI 結果 :allowed #{} の score-mismatch が発生。any-known?(seq concurrent) truthy を返していたため、concurrent が全部 no-op ZREM でも score 判定パスに入っていた。
  • 修正: existence-evidence? に名前を改め、concurrent mutation のうち そのメンバが存在したことの証拠 だけカウント (:zadd / :zincrby / :removed? true:zrem)。no-op ZREM は存在証拠にならない。
  • regression test 追加 (no-op-zrem-alone-does-not-false-positive)。unit test 計 6 件 PASS。

/gemini review

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (1)
jepsen/test/elastickv/redis_zset_safety_workload_test.clj (1)

42-89: Add a direct :zrangebyscore checker regression.

The checker has a separate bounded-range path, but these edge-case tests only exercise :zrange-all. A small missing-member or out-of-range regression would protect the advertised ZRANGEBYSCORE property.

Example test to cover bounded-range completeness
+(deftest zrangebyscore-missing-member-is-detected
+  (let [history [{:type :invoke :process 0 :f :zadd :value ["m1" 5] :index 0}
+                 {:type :ok     :process 0 :f :zadd :value ["m1" 5] :index 1}
+                 {:type :invoke :process 0 :f :zrangebyscore :value [0 10] :index 2}
+                 {:type :ok     :process 0 :f :zrangebyscore
+                  :value {:bounds [0 10] :members []}
+                  :index 3}]
+        result  (run-checker history)]
+    (is (not (:valid? result)) (str "expected range mismatch, got: " result))))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@jepsen/test/elastickv/redis_zset_safety_workload_test.clj` around lines 42 -
89, Add a new test that exercises the bounded-range code path by invoking
:zrangebyscore instead of :zrange-all so the checker’s ZRANGEBYSCORE logic is
covered; create a test (e.g. noop-zrem-does-not-flag-bounded-zrangebyscore or
similar) that mirrors one of the existing edge-case histories but uses
:zrangebyscore (with appropriate score bounds/values) and asserts run-checker
returns :valid? or not as expected, ensuring you reference the same run-checker
invocation and operation symbols (:zadd, :zrem, :zincrby, :zrangebyscore) so the
new test hits the bounded-range branch.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@jepsen/src/elastickv/redis_zset_safety_workload.clj`:
- Around line 386-416: The read result must be checked for duplicate members to
prevent a ZSet read returning the same member twice; update check-zrange-all
(and the similar function handling ranges around lines 430-464) to detect
duplicate member entries before per-entry validation by scanning entries and
maintaining a seen-members set, and if a member is already seen swap! the errors
atom with a {:kind :duplicate :index cmp-idx :member member :entries entries}
(or similar) error entry so duplicate-member reads are reported and rejected.
- Around line 313-318: The current logic builds applied as (->> muts (filter ...
(< (:complete-idx %) read-inv-idx)) (sort-by :complete-idx)) and then reduces
via apply-mutation-to-state, which incorrectly linearizes overlapping :ok
mutations by completion time; change this to detect overlapping committed
mutations (using their :invoke-idx and :complete-idx intervals from muts) and do
not sort/serialize ambiguous pairs by :complete-idx. Instead compute either the
set of all possible latest states for the read by merging non-overlapping
mutations deterministically and treating overlapping/conflicting writes
conservatively (e.g., allow values from any write whose interval is not ordered
before the read), and update the applied construction and reduction via
apply-mutation-to-state to use that conservative/possible-states approach; apply
the same fix at the other occurrence around lines 347-352.

---

Nitpick comments:
In `@jepsen/test/elastickv/redis_zset_safety_workload_test.clj`:
- Around line 42-89: Add a new test that exercises the bounded-range code path
by invoking :zrangebyscore instead of :zrange-all so the checker’s ZRANGEBYSCORE
logic is covered; create a test (e.g.
noop-zrem-does-not-flag-bounded-zrangebyscore or similar) that mirrors one of
the existing edge-case histories but uses :zrangebyscore (with appropriate score
bounds/values) and asserts run-checker returns :valid? or not as expected,
ensuring you reference the same run-checker invocation and operation symbols
(:zadd, :zrem, :zincrby, :zrangebyscore) so the new test hits the bounded-range
branch.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: e12c0d39-43a2-4d57-a8a3-f49ade75baba

📥 Commits

Reviewing files that changed from the base of the PR and between 173fbbc and 9bfcc13.

📒 Files selected for processing (5)
  • .github/workflows/jepsen-test-scheduled.yml
  • .github/workflows/jepsen-test.yml
  • jepsen/src/elastickv/jepsen_test.clj
  • jepsen/src/elastickv/redis_zset_safety_workload.clj
  • jepsen/test/elastickv/redis_zset_safety_workload_test.clj

Comment thread jepsen/src/elastickv/redis_zset_safety_workload.clj
Comment thread jepsen/src/elastickv/redis_zset_safety_workload.clj
Comment thread jepsen/src/elastickv/redis_zset_safety_workload.clj
Three Major-severity CodeRabbit findings on PR #550:

1. Duplicate-member detection (line 416): a ZSet read must return each
   member at most once. Previously, if ZRANGE returned the same member
   twice with an allowed score, the checker accepted it because sort
   and score-membership checks passed independently per entry.
   duplicate-members helper now flags :duplicate-members (and
   :duplicate-members-range for ZRANGEBYSCORE) before the per-entry
   loop.

2. Overlapping committed writes (line 318): two :ok mutations whose
   invoke/complete windows overlap have ambiguous serialization order.
   Pinning allowed-scores to a single last-wins linearization by
   :complete-idx was unsound. allowed-scores-for-member now unions all
   :zadd/:ok-:zincrby scores from committed mutations (over-approx
   that stays sound), and must-be-present? is relaxed when any pair
   of committed writes for the same member overlaps in time.

3. Pre-read :info mutations (line 328): a mutation recorded as :info
   whose completion precedes a later read's invoke may have taken
   effect server-side. Previously it was ignored by both model-before
   (:ok only) and the concurrent window (complete-idx >= read-inv-idx
   required). Now collected as pre-read-info, contributing to allowed
   scores and flipping unknown-score? for :zincrby with unknown
   resulting score.

3 new regression tests (duplicate-members-are-flagged, overlapping-
committed-zadds-allow-either-score, info-before-read-is-considered-
uncertain). Workload unit test count now 9, all PASS under Java 21.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 20, 2026

6d0b4c3 で CodeRabbit Major 3 件を fix:

  • 重複メンバー: duplicate-members ヘルパーで ZRANGE / ZRANGEBYSCORE 結果の重複を :duplicate-members / :duplicate-members-range として明示検知。sort / score-membership の個別チェックでは検知できなかったケースを閉じた。
  • 重複する committed writes の線形化: :complete-idx 順の単一直列化は 2 つの :ok mutation の window が重なる場合に unsafe。allowed-scores-for-member を「committed の全 :zadd / :ok :zincrby score の union」に広げ (sound な over-approx)、must-be-present? は committed 間に重なりがあるときリラックス。
  • read 前完了の :info: pre-read-info として抽出し、allowed-scores と unknown-score? の判定に含めた。以前は model-before (:ok のみ) にも concurrent 窓 (complete-idx >= read-inv-idx 必須) にも入らず無視されていた。

regression test 3 件追加 (duplicate-members-are-flagged / overlapping-committed-zadds-allow-either-score / info-before-read-is-considered-uncertain)。unit test 計 9 件 PASS。

/gemini review

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 20, 2026

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new Jepsen workload and model-based checker to verify Redis ZSet safety properties in elastickv. It includes a client for Redis operations, a generator for high-contention workloads, and a checker that validates score correctness, ordering, and atomicity. Feedback focuses on handling infinite scores in Redis to prevent parsing exceptions, improving the soundness of the linearizability model in allowed-scores-for-member to prevent stale reads, and optimizing the efficiency of mutation checks.

Comment on lines +71 to +80
(defn- parse-withscores
"Carmine returns a flat [member score member score ...] vector for
ZRANGE WITHSCORES. Convert to a sorted vector of [member (double score)]
preserving server-returned order (score ascending, then member)."
[flat]
(->> flat
(partition 2)
(mapv (fn [[m s]]
[(if (bytes? m) (String. ^bytes m) (str m))
(Double/parseDouble (str s))]))))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The use of Double/parseDouble directly on Redis score strings will fail if the score is infinite. Redis returns "inf", "+inf", or "-inf" for infinite values, but Java's Double/parseDouble expects "Infinity" or "-Infinity". This will cause a NumberFormatException during the test if infinite scores are encountered.

(defn- parse-double-safe
  "Parses a Redis score string into a Double, handling 'inf' representations."
  [s]
  (let [s-str (str/lower-case (str s))]
    (cond
      (str/includes? s-str "inf") (if (str/starts-with? s-str "-")
                                     Double/NEGATIVE_INFINITY
                                     Double/POSITIVE_INFINITY)
      :else (Double/parseDouble (str s)))))

(defn- parse-withscores
  "Carmine returns a flat [member score member score ...] vector for
  ZRANGE WITHSCORES. Convert to a sorted vector of [member (double score)]
  preserving server-returned order (score ascending, then member)."
  [flat]
  (->> flat
       (partition 2)
       (mapv (fn [[m s]]
               [(if (bytes? m) (String. ^bytes m) (str m))
                (parse-double-safe s)]))))

(let [[member delta] (:value op)
new-score (car/wcar cs (car/zincrby zset-key (double delta) member))]
(assoc op :type :ok
:value [member (Double/parseDouble (str new-score))]))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

As noted in the parse-withscores feedback, Double/parseDouble should be replaced with a safe parser that handles Redis's "inf" strings.

                   :value [member (parse-double-safe new-score)]]))

Comment on lines +330 to +418
(defn- allowed-scores-for-member
"Compute the set of scores considered valid for `member` by a read
whose window is [read-inv-idx, read-cmp-idx], based on committed state
and any concurrent mutations.

Returns:
:scores - set of acceptable scores (committed + concurrent
:zadd / :ok :zincrby).
:unknown-score? - true iff any concurrent ZINCRBY's resulting score
is unknown (in-flight or :info). When set, the
caller MUST skip the strict score-membership
check to stay sound.
:must-be-present? - committed state says present and no concurrent
mutation could have removed/changed it.
:any-known? - some op claims to have touched this member."
[mutations-by-m member read-inv-idx read-cmp-idx]
(let [muts (get mutations-by-m member [])
;; :ok mutations that completed strictly before the read. They
;; may have overlapped with each other in wall-clock time, so
;; the serialization order among them is ambiguous.
committed (->> muts
(filter #(and (= :ok (:type %))
(some? (:complete-idx %))
(< (:complete-idx %) read-inv-idx))))
;; :info mutations that completed before the read: they may or
;; may not have taken effect server-side. We must account for
;; their possible scores just like concurrent ones.
pre-read-info (->> muts
(filter #(and (= :info (:type %))
(some? (:complete-idx %))
(< (:complete-idx %) read-inv-idx))))
;; Concurrent mutations: windows overlap the read. Include both
;; :ok and :info since either may have taken effect.
concurrent (concurrent-mutations-for-member muts read-inv-idx read-cmp-idx)
;; A conservative last-wins linearization for the must-be-present?
;; check only. Ambiguous when committed writes overlap each other.
committed-sorted (sort-by :complete-idx committed)
committed-state (reduce apply-mutation-to-state nil committed-sorted)
committed-overlap? (boolean
(some (fn [[a b]]
(and (not (identical? a b))
(<= (:invoke-idx a) (:complete-idx b))
(<= (:invoke-idx b) (:complete-idx a))))
(for [a committed, b committed] [a b])))
;; Union of every score that any committed / pre-read :info /
;; concurrent op could have produced. This over-approximates the
;; legitimate post-state set when writes overlap, keeping the
;; checker sound at the cost of being slightly less strict on
;; overlapping concurrent writers.
add-scores (fn [acc m]
(case (:f m)
:zadd (conj acc (:score m))
:zincrby (cond-> acc (some? (:score m)) (conj (:score m)))
:zrem acc))
scores (as-> #{} s
(reduce add-scores s committed)
(reduce add-scores s pre-read-info)
(reduce add-scores s concurrent))
unknown-score? (or
(some #(and (= :zincrby (:f %)) (:unknown-score? %))
concurrent)
(some #(and (= :zincrby (:f %)) (:unknown-score? %))
pre-read-info))
;; any-known? must only be true when something provides evidence
;; the member actually existed at some point. A no-op ZREM
;; (:removed? false) does NOT prove existence.
existence-evidence? (or (some #(case (:f %)
(:zadd :zincrby) true
:zrem (:removed? %))
committed)
(some #(case (:f %)
(:zadd :zincrby) true
:zrem (:removed? %))
pre-read-info)
(some #(case (:f %)
(:zadd :zincrby) true
:zrem (:removed? %))
concurrent))]
{:scores scores
:unknown-score? (boolean unknown-score?)
;; must-be-present? is relaxed when committed writes overlap
;; among themselves or when any :info / concurrent mutation could
;; have removed the member before the read.
:must-be-present? (boolean (and committed-state
(:present? committed-state)
(not committed-overlap?)
(empty? pre-read-info)
(empty? concurrent)))
:any-known? (boolean existence-evidence?)}))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The current implementation of allowed-scores-for-member is too permissive and potentially unsound for linearizability checks.

  1. Stale Reads: It includes all successful committed scores in the scores set (line 384), even if they were strictly followed by a later committed mutation. In a linearizable system, a read should only observe scores from the "potentially latest" committed writes (those not strictly followed by another) or concurrent ones.
  2. Stale Presence: The checker does not verify if a member is allowed to be present. If a member was added and then deleted (both committed before the read), any-known? remains true, and the read will accept the member even though it should be absent.
  3. Efficiency: The committed-overlap? check (line 368) is $O(N^2)$ relative to the number of mutations for a member. While the member universe is small, this can be optimized to $O(N)$ by tracking the maximum invoke-idx of preceding mutations.
(defn- allowed-scores-for-member
  "Compute the set of scores considered valid for `member` by a read
  whose window is [read-inv-idx, read-cmp-idx], based on committed state
  and any concurrent mutations."
  [mutations-by-m member read-inv-idx read-cmp-idx]
  (let [muts (get mutations-by-m member [])
        ;; Mutations that completed strictly before the read started.
        preceding (filter #(and (= :ok (:type %))
                                (some? (:complete-idx %))
                                (< (:complete-idx %) read-inv-idx))
                          muts)
        ;; A mutation is a candidate for the 'latest' state if no other
        ;; committed mutation strictly follows it (m2.invoke > m.complete).
        max-inv (reduce max -1 (map :invoke-idx preceding))
        candidates (filter #(>= (:complete-idx %) max-inv) preceding)

        concurrent (concurrent-mutations-for-member muts read-inv-idx read-cmp-idx)
        pre-read-info (filter #(and (= :info (:type %))
                                    (some? (:complete-idx %))
                                    (< (:complete-idx %) read-inv-idx))
                              muts)

        ;; Scores allowed by committed state or uncertain ops.
        scores (into (set (keep :score candidates))
                     (concat (keep :score concurrent)
                             (keep :score pre-read-info)))

        unknown-score? (some #(and (= :zincrby (:f %)) (:unknown-score? %))
                             (concat concurrent pre-read-info))

        ;; Allowed to be present if any candidate or concurrent/info op is a write.
        can-be-present? (or (some #(#{:zadd :zincrby} (:f %)) candidates)
                            (some #(#{:zadd :zincrby} (:f %)) concurrent)
                            (some #(#{:zadd :zincrby} (:f %)) pre-read-info))

        ;; Must be present if all candidates are writes and no concurrent/info op can delete.
        must-be-present? (and (seq candidates)
                              (every? #(#{:zadd :zincrby} (:f %)) candidates)
                              (not (some #(= :zrem (:f %)) concurrent))
                              (not (some #(= :zrem (:f %)) pre-read-info)))]
    {:scores scores
     :unknown-score? (boolean unknown-score?)
     :must-be-present? must-be-present?
     :can-be-present? can-be-present?}))

Comment on lines +450 to +457
(let [{:keys [scores any-known? unknown-score?]}
(allowed-scores-for-member mutations-by-m member inv-idx cmp-idx)]
(cond
(not any-known?)
(swap! errors conj {:kind :phantom
:index cmp-idx
:member member
:score score})
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The phantom check should be replaced with a more robust can-be-present? check to catch both phantoms (never existed) and stale reads (present after a committed deletion).

    (doseq [[member score] entries]
      (let [{:keys [scores can-be-present? unknown-score?]}
            (allowed-scores-for-member mutations-by-m member inv-idx cmp-idx)]
        (cond
          (not can-be-present?)
          (swap! errors conj {:kind :unexpected-presence
                              :index cmp-idx
                              :member member
                              :score score})

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants