feat(bb/msm): benchmark-gated WebGPU MSM harness — SwiftShader correctness + Adreno baseline + memory attribution#23723
Draft
AztecBot wants to merge 9 commits into
Draft
Conversation
…ness - msm-bench autorun runs GPU-only (skips the unavailable threaded WASM oracle); Run/Run×5 enable on SRS-ready in gpu_only mode. - Report per-MSM working-set memory (shared scratch + instance buffers, excluding the SRS pool) via MsmV2Pool/MsmV2.statsBytes(); included in bench results as `mem` and logged as [mem]. - Add scripts/swiftshader-bench.mjs for headless GPU-only timing validation.
Add MsmV2Pool.statsBreakdown() returning a labeled per-buffer byte map and surface the top-6 scratch buffers in the [mem] log + bench results. Makes the working-set total attributable to individual buffers (e.g. the fixed-size walkerPartials vs the n/B-scaled bucket buffers) on real BrowserStack devices.
The Run handler clears $log on each click, so the bench rep loop's attempt to parse '[gpu] returned in X ms' from new log lines always missed and recorded wallMs=0. Expose the dispatch->readback wall as __lastGpuWallMs and read it directly. This is the trustworthy timing on GPUs with unreliable timestamp-queries (Adreno reported garbage per-pass deltas while the real dispatch wall was ~491ms at 2^17).
…+ thread right-sizing Make the stream-walker's occupancy and memory levers runtime-selectable so a single device session can sweep the curve, and add the two correctness-proven wins: - pref_scratch placement knob (workgroup|device|private). 'private' moves the per-thread prefix scratch to module-scope var<private>, freeing the 16 KB var<workgroup> allocation (the swarm's #1 occupancy limiter) with no extra GPU buffer and no peak-memory regression. 'device' is also wired but needs an 11th storage binding (exceeds the common 10-buffer adapter ceiling). - streamS (S) knob: S=4 halves every S-scaled scratch buffer (walkerPartials 8.5->4.25 MiB, walkerNodes, taskCuts) — algo memory 11.8->6.8 MiB at logN=10. - walkerTpb knob: TPB=128 for 32 KB GPUs (arch-aware occupancy). - streamThreads knob: right-sizes the STREAM_T-scaled scratch at small n. Harness forwards ?swalk= ?wtpb= ?pref= ?sthreads=; crosscheck script takes an extra query string. SwiftShader+noble cross-check GREEN at logN=10 for default, swalk=4, pref=private, pref=private&swalk=4(&wtpb=128), sthreads=4096.
…r pair The accumulate loop ran three passes (forward prefix, inverse, backward peel) that each re-read the slot's x-coords from point_x and re-subtracted dx. The operands are invariant across the three passes of one iteration (the cursor only advances at the end of the peel), so the forward pass now caches dx and both x-operands in private arrays; the inverse pass reuses cached dx and the peel reuses the cached x-coords, reading only the y-coords. This cuts point_x traffic from 3× to 1× per x-coord (~1/3 of SRS reads) at zero extra GPU memory. Gated by walkerDxCache (default true; ?dxcache=0 for A/B). SwiftShader+noble cross-check GREEN at logN=8 and 10 for dxcache on/off, default and the private+S=4+TPB=128 composite.
- run-browserstack.mjs --knobs forwards arbitrary walker query params (pref/swalk/wtpb/dxcache/sthreads) into the autorun URL so one seat can benchmark any walker config. - msm-bench autorun now posts /progress rows (start, srs-wait, warmup, per-rep) so the runner detects the runId and keeps its stall watchdog fed through the slow SRS+build phase on real devices; previously it posted only a terminal /results row, tripping the 120s first-progress watchdog at logN=17. - swiftshader-bench.mjs takes an extra query-string arg for local A/B.
The dev page downloaded the entire 2^20-point CRS (~32 MB compressed) from the external CDN on every boot, before the autorun could start. Through a slow tunnel on a real mobile device this stalled boot indefinitely, so the BrowserStack bench autorun never reached its first /progress post and the runner's watchdog killed the worker. Boot now loads 1<<min(LOGN_MAX, max(10,logN)) points from ?logn (or ?srsn=), cutting the logN=17 download to ~4 MB. SwiftShader+noble cross-check still GREEN.
…ce session Adds an autorun mode that sweeps a list of walker configs (?sweep=pref-S-TPB-dx, e.g. workgroup-8-64-0,private-4-128-1) in a single page load. Each config tears down and rebuilds MsmV2 + the pool so its time and (deterministic) memory accounting are correct for that config — the pool grows but never shrinks otherwise. Lets one contended BrowserStack device allocation map a whole parameter curve instead of one config per seat. SwiftShader-verified: two-config sweep reports walkerPartials 8.5 vs 4.25 MiB and algo 11.8 vs 6.8 MiB correctly.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
WebGPU MSM optimization — measured foundation (WIP)
Working branch off
stream-walker-impl. Goal: a memory- and time-optimal BN254 WebGPU MSM for laptop + mobile GPUs (Apple TBDR, Adreno, Mali), within the 32 KB / 16 KB workgroup-shared-memory and ≤100 MB total-memory constraints up to n = 2²⁰.This session was benchmark-gated: get a real WebGPU correctness path + a real device perf/memory path working first, record baselines, and only then change the algorithm. No unmeasured claims — so this PR ships the validated harness + the baseline measurements + the analysis that should drive the next change, rather than a speculative rewrite.
1. Harness enablement (the blocker this box had)
The box has no GPU. Nothing here could validate WebGPU before. Fixed:
dev/msm-webgpu/scripts/swiftshader-crosscheck.mjs). The pre-existingtest-msm-swiftshader.mjscouldn't get a WebGPU adapter on Linux. Working incantation (Playwright Chromium + SwiftShader Vulkan ICD):?ref=noblecross-check mode. The threaded-WASM oracle in the harness (barretenberg.wasm.gz) is a 213-byte placeholder here, so GPU-vs-WASM is unavailable. The new mode cross-checks the WebGPU result against the in-page@noble/curvesreference at the actual logN and skips WASM — valid GPU-less validation.LOGN_MIN10 → 8 so 2⁸ is genuinely testable (was clamped to 2¹⁰).?gpu_only=1, auto for themsm-benchautorun): skips the unavailable WASM path so timing runs don't depend on it; Run/Run×5 enable on SRS-ready.MsmV2Pool.statsBreakdown()): turns one opaque "working-set" number into a labeled per-buffer map, surfaced in the[mem]log and bench results.$logon click, so the bench rep loop always parsedwallMs = 0. Now the dispatch→readback wall is read from__lastGpuWallMs. This is the trustworthy timing on GPUs with broken timestamp-queries (see Adreno finding below).2. Correctness — GREEN (stream-walker path, this branch)
SwiftShader (software Vulkan), WebGPU cross-checked against noble:
(SwiftShader wall times are software-rasterizer numbers — correctness only, never perf.)
3. Baseline — real device (Samsung Galaxy S25 Ultra · Adreno, Chrome 145 Mobile)
Stream-walker accumulate path, n = 2¹⁷:
[gpu] returned in)4. Memory attribution (key finding)
statsBreakdown()shows the working set is not the "~9 MB memory-light" figure the plan implies — it's ~60 MiB at 2¹⁷:walkerPartials≈ 8.5 MiB is fixed at every n — it'ssoaBuf(2 · STREAM_T · S)withSTREAM_T = 8192hardcoded (msm_v2.ts:1614), independent of n. At small n it's massively over-provisioned (it is the entire scratch at logN=10); at 2¹⁷ it's ~14% of scratch.5. Where the next change should go (analysis, not yet implemented)
Ordered by leverage; each must clear SwiftShader correctness + a BrowserStack number before landing:
src/msm_webgpu/MSM_DESIGN_ANALYSIS.md§6.4.)STREAM_Tfor small n — frees most of the fixed 8.5 MiBwalkerPartialsbelow ~2¹⁴. Neutral at 2¹⁷⁺. Low risk but needs the indirect-dispatch + taskCuts/walkerNodes sizing to track it.6. Blocked / honest status
?autorun=msm-bench&logn=17).Reproduce locally (no GPU):