Skip to content

fix(bb/msm): stream-walker bucket_sums off-curve — exception-safe split-bucket combine#23740

Draft
AztecBot wants to merge 1 commit into
stream-walker-implfrom
cb/msm-walker-combine-correctness
Draft

fix(bb/msm): stream-walker bucket_sums off-curve — exception-safe split-bucket combine#23740
AztecBot wants to merge 1 commit into
stream-walker-implfrom
cb/msm-walker-combine-correctness

Conversation

@AztecBot
Copy link
Copy Markdown
Collaborator

@AztecBot AztecBot commented May 30, 2026

Summary

Fixes a pre-existing correctness bug in the stream-walker MSM that returned
off-curve bucket_sums (and therefore wrong/garbage MSM results) on hot
buckets. A previous session (PR #23726) diagnosed the symptom as a
"non-deterministic torn-write race on bucket_sums". With a GPU-vs-@noble/curves
cross-check + per-bucket readback under headless SwiftShader, the actual root
cause
is now pinned down and fixed.

Base: stream-walker-impl. No change to msm_v2.ts orchestration — only the
WGSL kernels (+ regenerated _generated/shaders.ts) and a dev cross-check
harness.

Root cause (proven, not guessed)

The walker's per-bucket partials are all correct and on-curve — the bug is
entirely downstream, in ba_walker_combine, which sums a bucket's partials with
plain affine point addition (dx = px − acc_x, then 1/dx). That formula
divides by zero whenever a running prefix sum equals ±(the next partial):

  • P == acc → a point doubling (needs the doubling slope 3x²/2y);
  • P == −acc → an intermediate point-at-infinity.

For a hot bucket (one bucket split across many tasks → dozens of partials),
the partials are walked in the CAS-insertion order of the per-bucket linked
list
, which is non-deterministic across GPU runs. In a generic order at
least one prefix hits one of those exceptional cases, so the un-guarded affine
add produced off-curve garbage whose value changed run-to-run with the CAS
order
— exactly the "non-deterministic torn write" the prior session observed.
On serial SwiftShader the bad order is fixed, so it reproduced deterministically;
the linked-list-order replay confirmed exactly one dx==0 per off-curve
bucket
(LL_ORDER_DxZero=1, mostly the infinity case).

A second, latent bug was also found and fixed: partial_dest is allocated for
the host-max thread count (streamNumThreads, 8192) but only the dispatched
threads initialise their slots; the host clears the buffer to 0, and the old
encoding read 0 as bucket id 0, linking bogus (0,0) partials into bucket
0's combine list. (It happened to land on window 0's zero-digit column, which the
reduce drops, so it was silent in the final result but corrupted bucket_sums[0].)

The fix (WGSL only, in-scope = the bucket_sums path)

  1. ba_walker_combine — exception-safe (complete) affine accumulation: detect
    dx==0, branch to the doubling slope when P==acc, and track an explicit
    identity flag for the P==−acc (infinity) case; a bucket that sums to identity
    is written as (0,0) (the reduce already marks all-zero buckets not-present).
  2. ba_stream_walker + ba_walker_partials_index — make partial_dest
    1-indexed (bucket_id + 1, 0 = empty) so the host's clear-to-0 means
    "empty" and over-allocated/un-dispatched slots can never be mistaken for bucket 0.
  3. Regenerated wgsl/_generated/shaders.ts (node src/msm_webgpu/scripts/inline-wgsl.mjs).

Proof — repeated-green cross-check (headless SwiftShader, GPU vs @noble/curves)

dev/msm-webgpu/msm-correctness.* runs the full MsmV2 pipeline (incl. the
stream-walker) and checks, per run: final MSM == noble, every bucket_sums
on-curve
, every per-window sum on-curve. Run as a sweep over seeds × reps
(re-running re-traverses the CAS list, so any surviving non-determinism shows up).

Input distribution Configs bucket_sums on-curve full MSM == noble
Realistic random (Pᵢ = rᵢ·G) 8 seeds × logn{8,10} × 3 reps = 48 48/48 48/48
Hot buckets (64-scalar pool, ~16–64 pts/bucket) 3 seeds × logn{10,12} × 2 reps = 12 12/12 12/12
AP points (the harness that originally exposed the bug) 8 seeds × logn{8,10} × 2 reps = 32 32/32 30/32 (2 hit the reduce, see scope note)

Before the fix the same harness returned off-curve results at every size
(logn 8–16) with the mismatch set changing run-to-run. Full logs:
https://gist.github.com/AztecBot/96a1697838df66bf688f51906fe8e814

Every bucket_sums value is on-curve (off=0) in every configuration tested
— including the AP-points harness that originally exposed the bug and the
hot-bucket stress — and results are bit-identical across repeated runs (the
non-determinism is gone).

Scope note (out of scope: the shared affine reduce)

The same "affine add assumes no dx==0" pattern also exists, by explicit
design
, in the shared ba_reduce_level_bench kernel (its header: "Point-equality
(P=±Q) handling is omitted — the algorithm assumes uniformly-random inputs with
no point collisions"
). For realistic random/SRS-like points it never triggers
(100% green above). It can trigger only under deliberately structured inputs
(an arithmetic-progression point set, or ≤8 distinct buckets per window): in
those cases bucket_sums is still 100% correct (this PR), but the reduce can
emit one off-curve window. That is a separate, pre-existing limitation of the
reduce (used by the V2 pair-tree path too), outside this PR's stream-walker
bucket_sums scope. It can be fixed with the same complete-addition pattern,
branchlessly, by selecting 2·y_d as the batched denominator for doubling
candidates — happy to do that as a follow-up if wanted.

@AztecBot AztecBot added the claudebox Owned by claudebox. it can push to this PR. label May 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

claudebox Owned by claudebox. it can push to this PR.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant