perf(bb/msm): stream-walker accumulate — cache forward-pass dx (drop inverse-pass point reloads)#23732
Draft
AztecBot wants to merge 3 commits into
Draft
perf(bb/msm): stream-walker accumulate — cache forward-pass dx (drop inverse-pass point reloads)#23732AztecBot wants to merge 3 commits into
AztecBot wants to merge 3 commits into
Conversation
… logn=8 Adds a msm-noble-check autorun mode to the msm-webgpu dev page that cross-checks the WebGPU result against the in-process noble (JS bigint) reference MSM, with no dependency on the bb WASM-MT oracle (which needs cross-origin isolation and a built barretenberg-threads.wasm). This lets the walker be correctness-gated under a plain SwiftShader headless browser at small logn. Lowers LOGN_MIN 10 -> 8 so logn=8 is reachable.
… reloads The stream-walker accumulate kernel's batched-inversion inner loop recomputed each slot's dx = (p_rx - p_lx) in the inverse pass purely to advance the running inverse, re-loading both x-coordinates through the l0_index -> point_x double indirection and re-running fr_sub for every active slot, every iteration. The forward prefix pass already computes exactly those dx values. Cache the S forward-pass dx values in a private array and reuse them in the inverse pass. This removes, per inner-loop iteration, (S-1) x (2 load_pt_x + 1 fr_sub_f8) — i.e. up to 14 storage-buffer point reads and 7 field subtractions at S=8 — trading them for S private dx slots. Identical arithmetic; cross-check GREEN vs noble at logn=8 and logn=10 under SwiftShader.
Times the WebGPU MSM and reports the per-pass GPU breakdown — including the stream_walker accumulate kernel — plus the pool's allocated GPU bytes, with no WASM-MT oracle. Runs on any WebGPU device behind BrowserStack without a built bb wasm. Posts /progress rows so the BrowserStack driver can detect the runId and track liveness. Use &profile=1 for timestamp-query per-pass timings.
This was referenced May 30, 2026
perf(bb/msm): stream-walker — amortize the field inversion via large S + private pref_scratch
#23736
Draft
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Goal
Make the memory-light stream-walker BN254 MSM accumulator faster — focused first on the ~28 ms accumulate kernel — without regressing its memory footprint or correctness. Approach: small, individually-validated changes, each its own commit. Base branch:
stream-walker-impl.Change 1 — cache forward-pass
dx, drop inverse-pass point reloads (ba_stream_walker)The batched-inversion inner loop has three phases per iteration over the
S=8slots: forward prefix product, one safegcd inversion, inverse pass + backward affine-add peel.The inverse pass recomputed each slot's
dx = (p_rx − p_lx)purely to advance the running inverse (inv = inv·dx_k). That recompute re-loaded both x-coordinates through thel0_index → point_xdouble indirection and re-ranfr_sub_f8for every active slot, every iteration — although the forward pass already computed exactly thosedxone phase earlier.This caches the
Sforward-passdxin a private array and reuses them.Removed per iteration (S=8):
(S−1)·(2 load_pt_x + 1 fr_sub_f8)= up to 14 storage-buffer point reads + 7 field subtractions, traded forSprivatedxslots. Arithmetic is bit-for-bit identical. Thel0_index → point_xchain is a dependent (latency-bound) load on mobile GPUs, so cutting it from the inner loop is the target.Memory budget unchanged.
dx_slotis per-thread private/register state — not workgroup-shared (the 16 KB Mali / 32 KB Apple·Adrenopref_scratchcap is untouched) and not a device buffer (≤100 MB budget untouched). Confirmed: pool GPU bytes are identical baseline vs optimized on every device measured.Change 0 — COI-free harness (dev page)
Local box has no GPU and the bb WASM-MT oracle is unavailable (needs cross-origin isolation + a built
barretenberg-threads.wasm). Added two autorun modes that need neither:msm-noble-check— cross-checks WebGPU vs the in-process noble (JS bigint) reference MSM. Correctness gate under headless SwiftShader.msm-bench-gpu— times the WebGPU MSM, reporting the per-pass GPU breakdown (incl. thestream_walkeraccumulate kernel viatimestamp-query) + pool GPU bytes. No WASM, runs on any WebGPU device behind BrowserStack.Lowered dev
LOGN_MIN10 → 8 so logn=8 is reachable.Correctness — VERIFIED ✅
SwiftShader (software WebGPU, headless Chromium), WebGPU walker vs noble reference, both baseline and optimized:
BrowserStack timings — real devices, logn=17, 5 reps (
msm-bench-gpu)Apple M2 · macOS Sequoia · Chrome 148 — per-kernel A/B ✅
stream_walker(accumulate) medianstream_walkermean (5 reps)Small but consistent — every optimized rep landed below baseline's max, with lower variance (opt 66.2–67.0 ms vs baseline 66.1–68.0 ms). Modest on Apple's bandwidth-rich TBDR GPU, where the removed point-loads were largely L1/L2 cache hits and the safegcd inversion dominates the kernel.
Adreno (Snapdragon 8 Elite) · Galaxy S25 Ultra · Android 15 · Chrome 145 — wall A/B
Same direction as Apple, slightly larger on this bandwidth-limited mobile GPU. Caveat: Android Chrome does not expose the
timestamp-queryfeature, so per-pass isolation is unavailable — only wall time, whose ~±5% rep-to-rep variance is too coarse to tightly resolve a single-kernel change. The consistent mean/median direction is suggestive, not conclusive.Mali (Pixel 9 Pro XL) — not separately A/B'd
Same Android-Chrome
timestamp-querylimitation (no per-kernel isolation), and the device was occupied by other agents throughout the window (the shared 2-seat BrowserStack pool). Memory-neutrality holds by construction (no device buffer added). A wall A/B can be run later when a Mali seat frees.Summary
A small, safe, memory-neutral first step on the accumulate kernel: ~1–2% faster on both Apple and Adreno, identical pool bytes, correctness GREEN at logn=8/10. The win is bounded on Apple because the safegcd inversion (~47% of the kernel) dominates and Apple's caches hide the redundant reads; the larger relative move on Adreno fits the "dependent-load on a bandwidth-limited GPU" thesis. Next levers (separate, individually-benchmarked commits): a forward-pass dead-store removal, idle-anchor point hoisting, and a GPU partials-reduction path.
Commits
test(bb/msm): COI-free noble cross-check autorun + dev floor → logn=8perf(bb/msm): walker — cache forward-pass dx, drop inverse-pass point reloadstest(bb/msm): COI-free GPU-only bench autorun (msm-bench-gpu)