Skip to content

perf(bb/msm): stream-walker accumulate — cache forward-pass dx (drop inverse-pass point reloads)#23732

Draft
AztecBot wants to merge 3 commits into
stream-walker-implfrom
cb/msm-walker-dx-cache
Draft

perf(bb/msm): stream-walker accumulate — cache forward-pass dx (drop inverse-pass point reloads)#23732
AztecBot wants to merge 3 commits into
stream-walker-implfrom
cb/msm-walker-dx-cache

Conversation

@AztecBot
Copy link
Copy Markdown
Collaborator

@AztecBot AztecBot commented May 30, 2026

Goal

Make the memory-light stream-walker BN254 MSM accumulator faster — focused first on the ~28 ms accumulate kernel — without regressing its memory footprint or correctness. Approach: small, individually-validated changes, each its own commit. Base branch: stream-walker-impl.

Change 1 — cache forward-pass dx, drop inverse-pass point reloads (ba_stream_walker)

The batched-inversion inner loop has three phases per iteration over the S=8 slots: forward prefix product, one safegcd inversion, inverse pass + backward affine-add peel.

The inverse pass recomputed each slot's dx = (p_rx − p_lx) purely to advance the running inverse (inv = inv·dx_k). That recompute re-loaded both x-coordinates through the l0_index → point_x double indirection and re-ran fr_sub_f8 for every active slot, every iteration — although the forward pass already computed exactly those dx one phase earlier.

This caches the S forward-pass dx in a private array and reuses them.

Removed per iteration (S=8): (S−1)·(2 load_pt_x + 1 fr_sub_f8) = up to 14 storage-buffer point reads + 7 field subtractions, traded for S private dx slots. Arithmetic is bit-for-bit identical. The l0_index → point_x chain is a dependent (latency-bound) load on mobile GPUs, so cutting it from the inner loop is the target.

Memory budget unchanged. dx_slot is per-thread private/register state — not workgroup-shared (the 16 KB Mali / 32 KB Apple·Adreno pref_scratch cap is untouched) and not a device buffer (≤100 MB budget untouched). Confirmed: pool GPU bytes are identical baseline vs optimized on every device measured.

Change 0 — COI-free harness (dev page)

Local box has no GPU and the bb WASM-MT oracle is unavailable (needs cross-origin isolation + a built barretenberg-threads.wasm). Added two autorun modes that need neither:

  • msm-noble-check — cross-checks WebGPU vs the in-process noble (JS bigint) reference MSM. Correctness gate under headless SwiftShader.
  • msm-bench-gpu — times the WebGPU MSM, reporting the per-pass GPU breakdown (incl. the stream_walker accumulate kernel via timestamp-query) + pool GPU bytes. No WASM, runs on any WebGPU device behind BrowserStack.

Lowered dev LOGN_MIN 10 → 8 so logn=8 is reachable.

Correctness — VERIFIED ✅

SwiftShader (software WebGPU, headless Chromium), WebGPU walker vs noble reference, both baseline and optimized:

logn n baseline optimized
8 256 agree ✅ agree ✅
10 1024 agree ✅ agree ✅

BrowserStack timings — real devices, logn=17, 5 reps (msm-bench-gpu)

Apple M2 · macOS Sequoia · Chrome 148 — per-kernel A/B ✅

metric baseline optimized (dx-cache) Δ
stream_walker (accumulate) median 67.57 ms 66.72 ms −0.85 ms (−1.3%)
stream_walker mean (5 reps) 67.19 ms 66.69 ms −0.50 ms (−0.7%)
wall median 88.6 ms 87.9 ms −0.7 ms
pool GPU bytes 71,712,496 71,712,496 0 (memory-neutral)

Small but consistent — every optimized rep landed below baseline's max, with lower variance (opt 66.2–67.0 ms vs baseline 66.1–68.0 ms). Modest on Apple's bandwidth-rich TBDR GPU, where the removed point-loads were largely L1/L2 cache hits and the safegcd inversion dominates the kernel.

Adreno (Snapdragon 8 Elite) · Galaxy S25 Ultra · Android 15 · Chrome 145 — wall A/B

metric baseline optimized Δ
wall median 523.4 ms 516.4 ms −7.0 ms (−1.3%)
wall mean (5 reps) 527.2 ms 514.8 ms −12.3 ms (−2.3%)
pool GPU bytes 71,712,496 71,712,496 0 (memory-neutral)

Same direction as Apple, slightly larger on this bandwidth-limited mobile GPU. Caveat: Android Chrome does not expose the timestamp-query feature, so per-pass isolation is unavailable — only wall time, whose ~±5% rep-to-rep variance is too coarse to tightly resolve a single-kernel change. The consistent mean/median direction is suggestive, not conclusive.

Mali (Pixel 9 Pro XL) — not separately A/B'd

Same Android-Chrome timestamp-query limitation (no per-kernel isolation), and the device was occupied by other agents throughout the window (the shared 2-seat BrowserStack pool). Memory-neutrality holds by construction (no device buffer added). A wall A/B can be run later when a Mali seat frees.

Summary

A small, safe, memory-neutral first step on the accumulate kernel: ~1–2% faster on both Apple and Adreno, identical pool bytes, correctness GREEN at logn=8/10. The win is bounded on Apple because the safegcd inversion (~47% of the kernel) dominates and Apple's caches hide the redundant reads; the larger relative move on Adreno fits the "dependent-load on a bandwidth-limited GPU" thesis. Next levers (separate, individually-benchmarked commits): a forward-pass dead-store removal, idle-anchor point hoisting, and a GPU partials-reduction path.

Commits

  1. test(bb/msm): COI-free noble cross-check autorun + dev floor → logn=8
  2. perf(bb/msm): walker — cache forward-pass dx, drop inverse-pass point reloads
  3. test(bb/msm): COI-free GPU-only bench autorun (msm-bench-gpu)

AztecBot added 2 commits May 30, 2026 00:34
… logn=8

Adds a msm-noble-check autorun mode to the msm-webgpu dev page that
cross-checks the WebGPU result against the in-process noble (JS bigint)
reference MSM, with no dependency on the bb WASM-MT oracle (which needs
cross-origin isolation and a built barretenberg-threads.wasm). This lets
the walker be correctness-gated under a plain SwiftShader headless
browser at small logn. Lowers LOGN_MIN 10 -> 8 so logn=8 is reachable.
… reloads

The stream-walker accumulate kernel's batched-inversion inner loop
recomputed each slot's dx = (p_rx - p_lx) in the inverse pass purely to
advance the running inverse, re-loading both x-coordinates through the
l0_index -> point_x double indirection and re-running fr_sub for every
active slot, every iteration. The forward prefix pass already computes
exactly those dx values.

Cache the S forward-pass dx values in a private array and reuse them in
the inverse pass. This removes, per inner-loop iteration, (S-1) x
(2 load_pt_x + 1 fr_sub_f8) — i.e. up to 14 storage-buffer point reads
and 7 field subtractions at S=8 — trading them for S private dx slots.
Identical arithmetic; cross-check GREEN vs noble at logn=8 and logn=10
under SwiftShader.
@AztecBot AztecBot added the claudebox Owned by claudebox. it can push to this PR. label May 30, 2026
Times the WebGPU MSM and reports the per-pass GPU breakdown — including
the stream_walker accumulate kernel — plus the pool's allocated GPU
bytes, with no WASM-MT oracle. Runs on any WebGPU device behind
BrowserStack without a built bb wasm. Posts /progress rows so the
BrowserStack driver can detect the runId and track liveness. Use
&profile=1 for timestamp-query per-pass timings.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

claudebox Owned by claudebox. it can push to this PR.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant