perf(bb/msm): stream-walker S/TPB knobs + measurement harness — accumulate is occupancy-bound, S=8 is optimal by AztecBot · Pull Request #23727 · AztecProtocol/aztec-packages

AztecBot · 2026-05-30T00:07:21Z

TL;DR

Investigated the highest-leverage lever for the WebGPU stream-walker accumulator (the slow kernel on stream-walker-impl). The textbook lever — raise the batched-inversion slot count S to amortise the field inversion over more adds — does not work on real GPUs: the accumulate kernel is occupancy/register-bound, not inversion-bound. Measured on a real Apple M2, S=16 regresses the accumulate phase by +24% and doubles the partials buffer. S=8 is the time-and-memory optimum and stays the default.

This PR lands (1) the parameterisation + device-aware clamping to make S/TPB autotunable, (2) reusable measurement infrastructure (WASM-free GPU profiling autorun, SwiftShader CPU-Vulkan correctness driver, BrowserStack stream_s passthrough), and (3) the measured evidence that redirects the optimization roadmap away from inversion-amortization and toward occupancy/bandwidth.

Why this lever, and what the data says

In the walker inner loop one safegcd inversion (~332 Fq muls) is shared across S affine adds, so per-add inversion cost is |I|/S ≈ 41 muls at S=8 — algebraically the dominant term (everything else is ~5 muls/add). The natural hypothesis: raise S (within the per-device 16/32 KB workgroup-storage budget) to ~halve it.

I measured it instead of assuming it. New WASM-free gpu-bench autorun reports per-phase GPU timestamps; ran S=8 vs S=16 on the same BrowserStack device.

macOS Sequoia · Apple M2 · Chrome 148 · n = 2¹⁷ · 5 reps (GPU timestamp-query):

config	wall	stream_walker (accumulate)	walker_combine	reduce	total GPU	partials buf
S=8 (default)	88.9 ms	67.5 ms	5.96 ms	8.53 ms	85.9 ms	8 MB
S=16	110.0 ms	82.5 ms	11.7 ms	8.52 ms	106.9 ms	16 MB

Raising S made it worse on both axes — time and memory. The accumulate kernel carries an affine accumulator + cursor state per slot in private registers and a 2×vec4 prefix slot in workgroup scratch (TPB×S×32 bytes); a larger S loses occupancy faster than it saves inversions, and doubles the device-side partials buffer + the walker_combine scan. So the inversion-amortization lever is exhausted in the upward direction; the real bottleneck is occupancy/bandwidth (uncoalesced SRS reads, triple point re-load per add, register pressure) — exactly the deferred wins in the design plan (coalesced-task layout, register reduction).

S=8 is also the only value that fits 16 KB Mali Bifrost at TPB=64, so the measured optimum and the hard mobile constraint agree. Default kept at 8 → zero regression on any device.

Changes

msm_v2.ts — new MsmConfig.streamS / walkerTpb. STREAM_S resolves to 8 by default; config.streamS overrides it, clamped to what the device's maxComputeWorkgroupStorageSize allows. Removes the two hardcoded constants. The shared MsmV2Pool already grows its streamS-sized buffers, so warm-path sizing stays consistent at any S.
dev/msm-webgpu/main.ts — WASM-free autorun=gpu-vs-noble (correctness vs @noble/curves, for SwiftShader CI where the threaded WASM oracle is a placeholder) and autorun=gpu-bench (per-phase GPU timing over reps); stream_s / walker_tpb URL knobs; LOGN_MIN lowered to 8 for small-vector runs.
dev/msm-webgpu/swiftshader-crosscheck.mjs — headless SwiftShader (CPU-Vulkan) driver.
dev/msm-webgpu/scripts/run-browserstack.mjs — --stream-s passthrough; configurable external-worker-id wait (MSM_EXTERNAL_ID_WAIT_MS) so the tunnel survives BrowserStack seat queueing.

Correctness — SwiftShader (software Vulkan, headless), GREEN

WebGPU output cross-checked against the @noble/curves/bn254 reference:

logn (n)	S=4	S=8	S=16
8 (256)	—	✅	✅
10 (1024)	✅	✅	✅ (+autotuned default ✅)

All S values agree with the reference at both sizes — the parameterisation is correct.

Memory (device algorithm buffers, governs the ≤100 MB budget)

Dominated by partials_buf = 2·NUM_THREADS·S slots × 64 B (+ task_cuts ∝ S). S=8 → ~8.5 MB of S-dependent buffers (negligible vs the shared SRS/l0_index); S=16 → ~17 MB. So S=8 is the memory choice too; a smaller S (e.g. 4 → ~4.3 MB) is the only direction that could improve memory, and is the natural next experiment now that the harness exists.

Status / blockers

Parameterise S/TPB; device-aware clamp; safe S=8 default
Measurement harness (gpu-vs-noble, gpu-bench, SwiftShader + BrowserStack drivers)
SwiftShader correctness logn=8,10 (S=4,8,16) — PASS
BrowserStack Apple M2: S=8 vs S=16 accumulate-phase comparison — done; S=8 wins
BrowserStack Adreno (S25 Ultra) / Mali (Pixel 9 Pro XL) and the S=4 point — blocked on BrowserStack capacity (2 seats shared across ~10 concurrent agents; both seats were saturated for >20 min of polling, never preempted others' jobs). The harness + --stream-s / --target knobs are in place to run these the moment seats free.

Net for the project: the inversion-batch lever is a dead end upward; S=8 is the time+memory optimum on Apple, and matches the Mali constraint. The accumulate kernel needs occupancy/bandwidth optimization (coalesced reads, fewer point reloads, lower register footprint), which the new gpu-bench per-phase harness now makes directly measurable.

…GPU stream-walker

… sweep)

…clamped knob BrowserStack on Apple M2 (n=2^17) shows the accumulate kernel is occupancy-bound, not inversion-bound: S=16 regresses the accumulate phase 67.5->82.5 ms (+24%) and doubles the partials buffer. Default S stays 8 (also the only value fitting 16 KB Mali at TPB=64); config.streamS remains an override clamped to the device workgroup-storage limit for autotuning.

perf(bb/msm): device-autotuned batched-inversion slots (S) in the Web…

1e9550d

…GPU stream-walker

AztecBot added the claudebox Owned by claudebox. it can push to this PR. label May 30, 2026

AztecBot added 2 commits May 30, 2026 00:09

feat(bb/msm): WASM-free gpu-bench autorun for per-phase GPU timing (S…

51041fd

… sweep)

AztecBot changed the title ~~perf(bb/msm): device-autotuned batched-inversion slots (S) in the WebGPU stream-walker~~ perf(bb/msm): stream-walker S/TPB knobs + measurement harness — accumulate is occupancy-bound, S=8 is optimal May 30, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(bb/msm): stream-walker S/TPB knobs + measurement harness — accumulate is occupancy-bound, S=8 is optimal#23727

perf(bb/msm): stream-walker S/TPB knobs + measurement harness — accumulate is occupancy-bound, S=8 is optimal#23727
AztecBot wants to merge 3 commits into
stream-walker-implfrom
cb/msm-opt/bucket-stream-1cc5

AztecBot commented May 30, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

AztecBot commented May 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

TL;DR

Why this lever, and what the data says

Changes

Correctness — SwiftShader (software Vulkan, headless), GREEN

Memory (device algorithm buffers, governs the ≤100 MB budget)

Status / blockers

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

AztecBot commented May 30, 2026 •

edited

Loading