perf(bb/msm): stream-walker accumulate — cache forward-pass dx (drop inverse-pass point reloads) by AztecBot · Pull Request #23732 · AztecProtocol/aztec-packages

AztecBot · 2026-05-30T00:34:50Z

Goal

Make the memory-light stream-walker BN254 MSM accumulator faster — focused first on the ~28 ms accumulate kernel — without regressing its memory footprint or correctness. Approach: small, individually-validated changes, each its own commit. Base branch: stream-walker-impl.

Change 1 — cache forward-pass `dx`, drop inverse-pass point reloads (`ba_stream_walker`)

The batched-inversion inner loop has three phases per iteration over the S=8 slots: forward prefix product, one safegcd inversion, inverse pass + backward affine-add peel.

The inverse pass recomputed each slot's dx = (p_rx − p_lx) purely to advance the running inverse (inv = inv·dx_k). That recompute re-loaded both x-coordinates through the l0_index → point_x double indirection and re-ran fr_sub_f8 for every active slot, every iteration — although the forward pass already computed exactly those dx one phase earlier.

This caches the S forward-pass dx in a private array and reuses them.

Removed per iteration (S=8): (S−1)·(2 load_pt_x + 1 fr_sub_f8) = up to 14 storage-buffer point reads + 7 field subtractions, traded for S private dx slots. Arithmetic is bit-for-bit identical. The l0_index → point_x chain is a dependent (latency-bound) load on mobile GPUs, so cutting it from the inner loop is the target.

Memory budget unchanged. dx_slot is per-thread private/register state — not workgroup-shared (the 16 KB Mali / 32 KB Apple·Adreno pref_scratch cap is untouched) and not a device buffer (≤100 MB budget untouched). Confirmed: pool GPU bytes are identical baseline vs optimized on every device measured.

Change 0 — COI-free harness (dev page)

Local box has no GPU and the bb WASM-MT oracle is unavailable (needs cross-origin isolation + a built barretenberg-threads.wasm). Added two autorun modes that need neither:

msm-noble-check — cross-checks WebGPU vs the in-process noble (JS bigint) reference MSM. Correctness gate under headless SwiftShader.
msm-bench-gpu — times the WebGPU MSM, reporting the per-pass GPU breakdown (incl. the stream_walker accumulate kernel via timestamp-query) + pool GPU bytes. No WASM, runs on any WebGPU device behind BrowserStack.

Lowered dev LOGN_MIN 10 → 8 so logn=8 is reachable.

Correctness — VERIFIED ✅

SwiftShader (software WebGPU, headless Chromium), WebGPU walker vs noble reference, both baseline and optimized:

logn	n	baseline	optimized
8	256	agree ✅	agree ✅
10	1024	agree ✅	agree ✅

BrowserStack timings — real devices, logn=17, 5 reps (`msm-bench-gpu`)

Apple M2 · macOS Sequoia · Chrome 148 — per-kernel A/B ✅

metric	baseline	optimized (dx-cache)	Δ
`stream_walker` (accumulate) median	67.57 ms	66.72 ms	−0.85 ms (−1.3%)
`stream_walker` mean (5 reps)	67.19 ms	66.69 ms	−0.50 ms (−0.7%)
wall median	88.6 ms	87.9 ms	−0.7 ms
pool GPU bytes	71,712,496	71,712,496	0 (memory-neutral)

Small but consistent — every optimized rep landed below baseline's max, with lower variance (opt 66.2–67.0 ms vs baseline 66.1–68.0 ms). Modest on Apple's bandwidth-rich TBDR GPU, where the removed point-loads were largely L1/L2 cache hits and the safegcd inversion dominates the kernel.

Adreno (Snapdragon 8 Elite) · Galaxy S25 Ultra · Android 15 · Chrome 145 — wall A/B

metric	baseline	optimized	Δ
wall median	523.4 ms	516.4 ms	−7.0 ms (−1.3%)
wall mean (5 reps)	527.2 ms	514.8 ms	−12.3 ms (−2.3%)
pool GPU bytes	71,712,496	71,712,496	0 (memory-neutral)

Same direction as Apple, slightly larger on this bandwidth-limited mobile GPU. Caveat: Android Chrome does not expose the timestamp-query feature, so per-pass isolation is unavailable — only wall time, whose ~±5% rep-to-rep variance is too coarse to tightly resolve a single-kernel change. The consistent mean/median direction is suggestive, not conclusive.

Mali (Pixel 9 Pro XL) — not separately A/B'd

Same Android-Chrome timestamp-query limitation (no per-kernel isolation), and the device was occupied by other agents throughout the window (the shared 2-seat BrowserStack pool). Memory-neutrality holds by construction (no device buffer added). A wall A/B can be run later when a Mali seat frees.

Summary

A small, safe, memory-neutral first step on the accumulate kernel: ~1–2% faster on both Apple and Adreno, identical pool bytes, correctness GREEN at logn=8/10. The win is bounded on Apple because the safegcd inversion (~47% of the kernel) dominates and Apple's caches hide the redundant reads; the larger relative move on Adreno fits the "dependent-load on a bandwidth-limited GPU" thesis. Next levers (separate, individually-benchmarked commits): a forward-pass dead-store removal, idle-anchor point hoisting, and a GPU partials-reduction path.

Commits

test(bb/msm): COI-free noble cross-check autorun + dev floor → logn=8
perf(bb/msm): walker — cache forward-pass dx, drop inverse-pass point reloads
test(bb/msm): COI-free GPU-only bench autorun (msm-bench-gpu)

… logn=8 Adds a msm-noble-check autorun mode to the msm-webgpu dev page that cross-checks the WebGPU result against the in-process noble (JS bigint) reference MSM, with no dependency on the bb WASM-MT oracle (which needs cross-origin isolation and a built barretenberg-threads.wasm). This lets the walker be correctness-gated under a plain SwiftShader headless browser at small logn. Lowers LOGN_MIN 10 -> 8 so logn=8 is reachable.

… reloads The stream-walker accumulate kernel's batched-inversion inner loop recomputed each slot's dx = (p_rx - p_lx) in the inverse pass purely to advance the running inverse, re-loading both x-coordinates through the l0_index -> point_x double indirection and re-running fr_sub for every active slot, every iteration. The forward prefix pass already computes exactly those dx values. Cache the S forward-pass dx values in a private array and reuse them in the inverse pass. This removes, per inner-loop iteration, (S-1) x (2 load_pt_x + 1 fr_sub_f8) — i.e. up to 14 storage-buffer point reads and 7 field subtractions at S=8 — trading them for S private dx slots. Identical arithmetic; cross-check GREEN vs noble at logn=8 and logn=10 under SwiftShader.

Times the WebGPU MSM and reports the per-pass GPU breakdown — including the stream_walker accumulate kernel — plus the pool's allocated GPU bytes, with no WASM-MT oracle. Runs on any WebGPU device behind BrowserStack without a built bb wasm. Posts /progress rows so the BrowserStack driver can detect the runId and track liveness. Use &profile=1 for timestamp-query per-pass timings.

AztecBot added 2 commits May 30, 2026 00:34

AztecBot added the claudebox Owned by claudebox. it can push to this PR. label May 30, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(bb/msm): stream-walker accumulate — cache forward-pass dx (drop inverse-pass point reloads)#23732

perf(bb/msm): stream-walker accumulate — cache forward-pass dx (drop inverse-pass point reloads)#23732
AztecBot wants to merge 3 commits into
stream-walker-implfrom
cb/msm-walker-dx-cache

AztecBot commented May 30, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

AztecBot commented May 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Goal

Change 1 — cache forward-pass dx, drop inverse-pass point reloads (ba_stream_walker)

Change 0 — COI-free harness (dev page)

Correctness — VERIFIED ✅

BrowserStack timings — real devices, logn=17, 5 reps (msm-bench-gpu)

Apple M2 · macOS Sequoia · Chrome 148 — per-kernel A/B ✅

Adreno (Snapdragon 8 Elite) · Galaxy S25 Ultra · Android 15 · Chrome 145 — wall A/B

Mali (Pixel 9 Pro XL) — not separately A/B'd

Summary

Commits

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

AztecBot commented May 30, 2026 •

edited

Loading

Change 1 — cache forward-pass `dx`, drop inverse-pass point reloads (`ba_stream_walker`)

BrowserStack timings — real devices, logn=17, 5 reps (`msm-bench-gpu`)