Skip to content

perf(bb/msm): sweepable stream-walker inversion-amortization knobs (S/TPB/NumThreads)#23735

Draft
AztecBot wants to merge 3 commits into
stream-walker-implfrom
cb/walker-inv-amortize-S
Draft

perf(bb/msm): sweepable stream-walker inversion-amortization knobs (S/TPB/NumThreads)#23735
AztecBot wants to merge 3 commits into
stream-walker-implfrom
cb/walker-inv-amortize-S

Conversation

@AztecBot
Copy link
Copy Markdown
Collaborator

@AztecBot AztecBot commented May 30, 2026

Axis A (speed): cut the stream-walker's field-inversion share

Status: knob infrastructure + GPU-only test harness landed and SwiftShader-correctness-proven. Real-hardware BrowserStack timing pending a free seat (account has been 2/2 busy throughout this session).

Rationale (from PR #23732, real Apple M2 hardware)

The safegcd field inversion (fr_inv_by_loop_pk, Bernstein-Yang) is ~47% of the ba_stream_walker accumulate kernel. The walker does exactly one inversion per batch of S affine adds (Montgomery's trick), so the inversion's per-add share is |inv|/S. Raising S directly amortizes it:

S inversion share (model, from 47% @ S=8)
8 (current) ~47%
16 ~31%
32 ~18%

Prior threads chasing memory-load wins (dx-cache) got only ~1–2% because the kernel is compute-bound on the inversion. Amortizing the inversion is the high-leverage lever.

What this change does

STREAM_S (=8), WALKER_TPB (=64) and STREAM_T (NUM_THREADS, =8192) were hardcoded at the MsmV2.create() call site even though shader_manager, ensureScratch, the planner (cumsum/partition_task), the walker, and walker_combine are already fully parameterized by them. This PR un-hardcodes them into MsmConfig:

  • walkerS — one inversion per S adds (default 8)
  • walkerTpb — walker workgroup size (default 64)
  • walkerNumThreads — walker NUM_THREADS (default 8192)

exposed as dev-page URL knobs (?walkers=16&walkertpb=32&walkert=4096). No shader/buffer/bind-layout changes — {} config is byte-for-byte the old behaviour.

A new GPU-only autorun msm-walker-bench runs MsmV2 without the (stubbed) bb.js WASM: a @noble/curves correctness check at a small size, then timed reps at the bench size, honoring the walker knobs and ?warmup=N. run-browserstack.mjs gains --query to forward the sweep knobs.

Why this avoids a memory regression

  • Workgroup memory: pref_scratch = TPB·S·2 vec4. Pairing larger S with smaller walkerTpb keeps it ≤16 KB (Mali) / ≤32 KB (Apple/Adreno). S=16/TPB=32 = same 16 KB as S=8/TPB=64.
  • Device memory: walkerPartials (+ linked-list node buffers) scale as NUM_THREADS·S. Lowering walkerNumThreads ∝ 1/S holds NUM_THREADS·S — and peak GPU footprint — constant.

Correctness (local, SwiftShader headless + @noble/curves, logn=10)

config pref_scratch device partials (∝ T·S) noble
S=8 / TPB=64 / T=8192 (baseline = prior default) 16 KB green (prior)
S=16 / TPB=32 / T=8192 16 KB (unchanged) MATCH ✓
S=16 / T=4096 (memory-neutral) 32 KB 1× (unchanged) (running)

The S=16/TPB=32 walker compiled, bound, dispatched and produced a noble-correct result with no WGSL/validation errors — the parameterization is structurally sound.

Remaining for "done" (strict bar)

Real-hardware BrowserStack timing of S=8 baseline vs S=16 on ≥1 Apple, ≥1 Adreno, ≥1 Mali, showing a significant time win with no memory regression. The harness (msm-walker-bench + run-browserstack.mjs --query) is ready; gated only on a free BrowserStack seat (2 shared seats, busy all session). Apple gives per-kernel stream_walker timestamps; Adreno/Mali use GPU wall time (Android Chrome lacks timestamp-query).

🤖 Generated with Claude Code

@AztecBot AztecBot added ci-barretenberg Run all barretenberg/cpp checks. claudebox Owned by claudebox. it can push to this PR. labels May 30, 2026
AztecBot added 2 commits May 30, 2026 02:44
…-browserstack --query passthrough

WASM oracle is a stub, so msm-bench/msm-cross-check (which click Run -> boot WASM) can't benchmark on a clean device. msm-walker-bench runs MsmV2 GPU-only: a noble correctness check at a small size then timed reps at the bench size, honoring the walker S/TPB/NumThreads knobs from the URL, posting median wall + per-phase GPU times. run-browserstack gains --query to forward the sweep knobs to the page URL.
… NaN)

Correctness-only runs use reps=0 (the noble check is the signal); guard the
median formatting so the summary line stays clean.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-barretenberg Run all barretenberg/cpp checks. claudebox Owned by claudebox. it can push to this PR.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant