Skip to content

perf(bb/msm): stream-walker S/TPB knobs + measurement harness — accumulate is occupancy-bound, S=8 is optimal#23727

Draft
AztecBot wants to merge 3 commits into
stream-walker-implfrom
cb/msm-opt/bucket-stream-1cc5
Draft

perf(bb/msm): stream-walker S/TPB knobs + measurement harness — accumulate is occupancy-bound, S=8 is optimal#23727
AztecBot wants to merge 3 commits into
stream-walker-implfrom
cb/msm-opt/bucket-stream-1cc5

Conversation

@AztecBot
Copy link
Copy Markdown
Collaborator

@AztecBot AztecBot commented May 30, 2026

TL;DR

Investigated the highest-leverage lever for the WebGPU stream-walker accumulator (the slow kernel on stream-walker-impl). The textbook lever — raise the batched-inversion slot count S to amortise the field inversion over more adds — does not work on real GPUs: the accumulate kernel is occupancy/register-bound, not inversion-bound. Measured on a real Apple M2, S=16 regresses the accumulate phase by +24% and doubles the partials buffer. S=8 is the time-and-memory optimum and stays the default.

This PR lands (1) the parameterisation + device-aware clamping to make S/TPB autotunable, (2) reusable measurement infrastructure (WASM-free GPU profiling autorun, SwiftShader CPU-Vulkan correctness driver, BrowserStack stream_s passthrough), and (3) the measured evidence that redirects the optimization roadmap away from inversion-amortization and toward occupancy/bandwidth.

Why this lever, and what the data says

In the walker inner loop one safegcd inversion (~332 Fq muls) is shared across S affine adds, so per-add inversion cost is |I|/S ≈ 41 muls at S=8 — algebraically the dominant term (everything else is ~5 muls/add). The natural hypothesis: raise S (within the per-device 16/32 KB workgroup-storage budget) to ~halve it.

I measured it instead of assuming it. New WASM-free gpu-bench autorun reports per-phase GPU timestamps; ran S=8 vs S=16 on the same BrowserStack device.

macOS Sequoia · Apple M2 · Chrome 148 · n = 2¹⁷ · 5 reps (GPU timestamp-query):

config wall stream_walker (accumulate) walker_combine reduce total GPU partials buf
S=8 (default) 88.9 ms 67.5 ms 5.96 ms 8.53 ms 85.9 ms 8 MB
S=16 110.0 ms 82.5 ms 11.7 ms 8.52 ms 106.9 ms 16 MB

Raising S made it worse on both axes — time and memory. The accumulate kernel carries an affine accumulator + cursor state per slot in private registers and a 2×vec4 prefix slot in workgroup scratch (TPB×S×32 bytes); a larger S loses occupancy faster than it saves inversions, and doubles the device-side partials buffer + the walker_combine scan. So the inversion-amortization lever is exhausted in the upward direction; the real bottleneck is occupancy/bandwidth (uncoalesced SRS reads, triple point re-load per add, register pressure) — exactly the deferred wins in the design plan (coalesced-task layout, register reduction).

S=8 is also the only value that fits 16 KB Mali Bifrost at TPB=64, so the measured optimum and the hard mobile constraint agree. Default kept at 8 → zero regression on any device.

Changes

  • msm_v2.ts — new MsmConfig.streamS / walkerTpb. STREAM_S resolves to 8 by default; config.streamS overrides it, clamped to what the device's maxComputeWorkgroupStorageSize allows. Removes the two hardcoded constants. The shared MsmV2Pool already grows its streamS-sized buffers, so warm-path sizing stays consistent at any S.
  • dev/msm-webgpu/main.ts — WASM-free autorun=gpu-vs-noble (correctness vs @noble/curves, for SwiftShader CI where the threaded WASM oracle is a placeholder) and autorun=gpu-bench (per-phase GPU timing over reps); stream_s / walker_tpb URL knobs; LOGN_MIN lowered to 8 for small-vector runs.
  • dev/msm-webgpu/swiftshader-crosscheck.mjs — headless SwiftShader (CPU-Vulkan) driver.
  • dev/msm-webgpu/scripts/run-browserstack.mjs--stream-s passthrough; configurable external-worker-id wait (MSM_EXTERNAL_ID_WAIT_MS) so the tunnel survives BrowserStack seat queueing.

Correctness — SwiftShader (software Vulkan, headless), GREEN

WebGPU output cross-checked against the @noble/curves/bn254 reference:

logn (n) S=4 S=8 S=16
8 (256)
10 (1024) ✅ (+autotuned default ✅)

All S values agree with the reference at both sizes — the parameterisation is correct.

Memory (device algorithm buffers, governs the ≤100 MB budget)

Dominated by partials_buf = 2·NUM_THREADS·S slots × 64 B (+ task_cuts ∝ S). S=8 → ~8.5 MB of S-dependent buffers (negligible vs the shared SRS/l0_index); S=16 → ~17 MB. So S=8 is the memory choice too; a smaller S (e.g. 4 → ~4.3 MB) is the only direction that could improve memory, and is the natural next experiment now that the harness exists.

Status / blockers

  • Parameterise S/TPB; device-aware clamp; safe S=8 default
  • Measurement harness (gpu-vs-noble, gpu-bench, SwiftShader + BrowserStack drivers)
  • SwiftShader correctness logn=8,10 (S=4,8,16) — PASS
  • BrowserStack Apple M2: S=8 vs S=16 accumulate-phase comparison — done; S=8 wins
  • BrowserStack Adreno (S25 Ultra) / Mali (Pixel 9 Pro XL) and the S=4 point — blocked on BrowserStack capacity (2 seats shared across ~10 concurrent agents; both seats were saturated for >20 min of polling, never preempted others' jobs). The harness + --stream-s / --target knobs are in place to run these the moment seats free.

Net for the project: the inversion-batch lever is a dead end upward; S=8 is the time+memory optimum on Apple, and matches the Mali constraint. The accumulate kernel needs occupancy/bandwidth optimization (coalesced reads, fewer point reloads, lower register footprint), which the new gpu-bench per-phase harness now makes directly measurable.

@AztecBot AztecBot added the claudebox Owned by claudebox. it can push to this PR. label May 30, 2026
AztecBot added 2 commits May 30, 2026 00:09
…clamped knob

BrowserStack on Apple M2 (n=2^17) shows the accumulate kernel is
occupancy-bound, not inversion-bound: S=16 regresses the accumulate phase
67.5->82.5 ms (+24%) and doubles the partials buffer. Default S stays 8
(also the only value fitting 16 KB Mali at TPB=64); config.streamS remains
an override clamped to the device workgroup-storage limit for autotuning.
@AztecBot AztecBot changed the title perf(bb/msm): device-autotuned batched-inversion slots (S) in the WebGPU stream-walker perf(bb/msm): stream-walker S/TPB knobs + measurement harness — accumulate is occupancy-bound, S=8 is optimal May 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

claudebox Owned by claudebox. it can push to this PR.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant