fix(bb/msm): stream-walker bucket_sums off-curve — exception-safe split-bucket combine#23740
Draft
AztecBot wants to merge 1 commit into
Draft
fix(bb/msm): stream-walker bucket_sums off-curve — exception-safe split-bucket combine#23740AztecBot wants to merge 1 commit into
AztecBot wants to merge 1 commit into
Conversation
…it-bucket combine
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes a pre-existing correctness bug in the stream-walker MSM that returned
off-curve
bucket_sums(and therefore wrong/garbage MSM results) on hotbuckets. A previous session (PR #23726) diagnosed the symptom as a
"non-deterministic torn-write race on
bucket_sums". With a GPU-vs-@noble/curvescross-check + per-bucket readback under headless SwiftShader, the actual root
cause is now pinned down and fixed.
Base:
stream-walker-impl. No change tomsm_v2.tsorchestration — only theWGSL kernels (+ regenerated
_generated/shaders.ts) and a dev cross-checkharness.
Root cause (proven, not guessed)
The walker's per-bucket partials are all correct and on-curve — the bug is
entirely downstream, in
ba_walker_combine, which sums a bucket's partials withplain affine point addition (
dx = px − acc_x, then1/dx). That formuladivides by zero whenever a running prefix sum equals ±(the next partial):
P == acc→ a point doubling (needs the doubling slope3x²/2y);P == −acc→ an intermediate point-at-infinity.For a hot bucket (one bucket split across many tasks → dozens of partials),
the partials are walked in the CAS-insertion order of the per-bucket linked
list, which is non-deterministic across GPU runs. In a generic order at
least one prefix hits one of those exceptional cases, so the un-guarded affine
add produced off-curve garbage whose value changed run-to-run with the CAS
order — exactly the "non-deterministic torn write" the prior session observed.
On serial SwiftShader the bad order is fixed, so it reproduced deterministically;
the linked-list-order replay confirmed exactly one
dx==0per off-curvebucket (
LL_ORDER_DxZero=1, mostly the infinity case).A second, latent bug was also found and fixed:
partial_destis allocated forthe host-max thread count (
streamNumThreads, 8192) but only the dispatchedthreads initialise their slots; the host clears the buffer to 0, and the old
encoding read
0as bucket id 0, linking bogus(0,0)partials into bucket0's combine list. (It happened to land on window 0's zero-digit column, which the
reduce drops, so it was silent in the final result but corrupted
bucket_sums[0].)The fix (WGSL only, in-scope = the bucket_sums path)
ba_walker_combine— exception-safe (complete) affine accumulation: detectdx==0, branch to the doubling slope whenP==acc, and track an explicitidentity flag for the
P==−acc(infinity) case; a bucket that sums to identityis written as
(0,0)(the reduce already marks all-zero buckets not-present).ba_stream_walker+ba_walker_partials_index— makepartial_dest1-indexed (
bucket_id + 1,0 = empty) so the host's clear-to-0 means"empty" and over-allocated/un-dispatched slots can never be mistaken for bucket 0.
wgsl/_generated/shaders.ts(node src/msm_webgpu/scripts/inline-wgsl.mjs).Proof — repeated-green cross-check (headless SwiftShader, GPU vs @noble/curves)
dev/msm-webgpu/msm-correctness.*runs the fullMsmV2pipeline (incl. thestream-walker) and checks, per run: final MSM == noble, every
bucket_sumson-curve, every per-window sum on-curve. Run as a sweep over seeds × reps
(re-running re-traverses the CAS list, so any surviving non-determinism shows up).
bucket_sumson-curveBefore the fix the same harness returned off-curve results at every size
(logn 8–16) with the mismatch set changing run-to-run. Full logs:
https://gist.github.com/AztecBot/96a1697838df66bf688f51906fe8e814
Every
bucket_sumsvalue is on-curve (off=0) in every configuration tested— including the AP-points harness that originally exposed the bug and the
hot-bucket stress — and results are bit-identical across repeated runs (the
non-determinism is gone).
Scope note (out of scope: the shared affine reduce)
The same "affine add assumes no
dx==0" pattern also exists, by explicitdesign, in the shared
ba_reduce_level_benchkernel (its header: "Point-equality(P=±Q) handling is omitted — the algorithm assumes uniformly-random inputs with
no point collisions"). For realistic random/SRS-like points it never triggers
(100% green above). It can trigger only under deliberately structured inputs
(an arithmetic-progression point set, or ≤8 distinct buckets per window): in
those cases
bucket_sumsis still 100% correct (this PR), but the reduce canemit one off-curve window. That is a separate, pre-existing limitation of the
reduce (used by the V2 pair-tree path too), outside this PR's stream-walker
bucket_sums scope. It can be fixed with the same complete-addition pattern,
branchlessly, by selecting
2·y_das the batched denominator for doublingcandidates — happy to do that as a follow-up if wanted.