perf(bb/msm): stream-walker S/TPB knobs + measurement harness — accumulate is occupancy-bound, S=8 is optimal#23727
Draft
AztecBot wants to merge 3 commits into
Draft
perf(bb/msm): stream-walker S/TPB knobs + measurement harness — accumulate is occupancy-bound, S=8 is optimal#23727AztecBot wants to merge 3 commits into
AztecBot wants to merge 3 commits into
Conversation
…GPU stream-walker
…clamped knob BrowserStack on Apple M2 (n=2^17) shows the accumulate kernel is occupancy-bound, not inversion-bound: S=16 regresses the accumulate phase 67.5->82.5 ms (+24%) and doubles the partials buffer. Default S stays 8 (also the only value fitting 16 KB Mali at TPB=64); config.streamS remains an override clamped to the device workgroup-storage limit for autotuning.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
TL;DR
Investigated the highest-leverage lever for the WebGPU stream-walker accumulator (the slow kernel on
stream-walker-impl). The textbook lever — raise the batched-inversion slot countSto amortise the field inversion over more adds — does not work on real GPUs: the accumulate kernel is occupancy/register-bound, not inversion-bound. Measured on a real Apple M2,S=16regresses the accumulate phase by +24% and doubles the partials buffer.S=8is the time-and-memory optimum and stays the default.This PR lands (1) the parameterisation + device-aware clamping to make
S/TPBautotunable, (2) reusable measurement infrastructure (WASM-free GPU profiling autorun, SwiftShader CPU-Vulkan correctness driver, BrowserStackstream_spassthrough), and (3) the measured evidence that redirects the optimization roadmap away from inversion-amortization and toward occupancy/bandwidth.Why this lever, and what the data says
In the walker inner loop one safegcd inversion (~332 Fq muls) is shared across
Saffine adds, so per-add inversion cost is|I|/S ≈ 41muls atS=8— algebraically the dominant term (everything else is ~5 muls/add). The natural hypothesis: raiseS(within the per-device 16/32 KB workgroup-storage budget) to ~halve it.I measured it instead of assuming it. New WASM-free
gpu-benchautorun reports per-phase GPU timestamps; ranS=8vsS=16on the same BrowserStack device.macOS Sequoia · Apple M2 · Chrome 148 · n = 2¹⁷ · 5 reps (GPU timestamp-query):
Raising
Smade it worse on both axes — time and memory. The accumulate kernel carries an affine accumulator + cursor state per slot in private registers and a 2×vec4prefix slot in workgroup scratch (TPB×S×32bytes); a largerSloses occupancy faster than it saves inversions, and doubles the device-side partials buffer + the walker_combine scan. So the inversion-amortization lever is exhausted in the upward direction; the real bottleneck is occupancy/bandwidth (uncoalesced SRS reads, triple point re-load per add, register pressure) — exactly the deferred wins in the design plan (coalesced-task layout, register reduction).S=8is also the only value that fits 16 KB Mali Bifrost atTPB=64, so the measured optimum and the hard mobile constraint agree. Default kept at 8 → zero regression on any device.Changes
msm_v2.ts— newMsmConfig.streamS/walkerTpb.STREAM_Sresolves to 8 by default;config.streamSoverrides it, clamped to what the device'smaxComputeWorkgroupStorageSizeallows. Removes the two hardcoded constants. The sharedMsmV2Poolalready grows itsstreamS-sized buffers, so warm-path sizing stays consistent at anyS.dev/msm-webgpu/main.ts— WASM-freeautorun=gpu-vs-noble(correctness vs@noble/curves, for SwiftShader CI where the threaded WASM oracle is a placeholder) andautorun=gpu-bench(per-phase GPU timing over reps);stream_s/walker_tpbURL knobs;LOGN_MINlowered to 8 for small-vector runs.dev/msm-webgpu/swiftshader-crosscheck.mjs— headless SwiftShader (CPU-Vulkan) driver.dev/msm-webgpu/scripts/run-browserstack.mjs—--stream-spassthrough; configurable external-worker-id wait (MSM_EXTERNAL_ID_WAIT_MS) so the tunnel survives BrowserStack seat queueing.Correctness — SwiftShader (software Vulkan, headless), GREEN
WebGPU output cross-checked against the
@noble/curves/bn254reference:All
Svalues agree with the reference at both sizes — the parameterisation is correct.Memory (device algorithm buffers, governs the ≤100 MB budget)
Dominated by
partials_buf = 2·NUM_THREADS·Sslots × 64 B (+task_cuts ∝ S).S=8→ ~8.5 MB of S-dependent buffers (negligible vs the shared SRS/l0_index);S=16→ ~17 MB. SoS=8is the memory choice too; a smallerS(e.g. 4 → ~4.3 MB) is the only direction that could improve memory, and is the natural next experiment now that the harness exists.Status / blockers
S=4point — blocked on BrowserStack capacity (2 seats shared across ~10 concurrent agents; both seats were saturated for >20 min of polling, never preempted others' jobs). The harness +--stream-s/--targetknobs are in place to run these the moment seats free.Net for the project: the inversion-batch lever is a dead end upward;
S=8is the time+memory optimum on Apple, and matches the Mali constraint. The accumulate kernel needs occupancy/bandwidth optimization (coalesced reads, fewer point reloads, lower register footprint), which the newgpu-benchper-phase harness now makes directly measurable.