feat(bb/msm): benchmark-gated WebGPU MSM harness — SwiftShader correctness + Adreno baseline + memory attribution by AztecBot · Pull Request #23723 · AztecProtocol/aztec-packages

AztecBot · 2026-05-30T00:01:11Z

WebGPU MSM optimization — measured foundation (WIP)

Working branch off stream-walker-impl. Goal: a memory- and time-optimal BN254 WebGPU MSM for laptop + mobile GPUs (Apple TBDR, Adreno, Mali), within the 32 KB / 16 KB workgroup-shared-memory and ≤100 MB total-memory constraints up to n = 2²⁰.

This session was benchmark-gated: get a real WebGPU correctness path + a real device perf/memory path working first, record baselines, and only then change the algorithm. No unmeasured claims — so this PR ships the validated harness + the baseline measurements + the analysis that should drive the next change, rather than a speculative rewrite.

1. Harness enablement (the blocker this box had)

The box has no GPU. Nothing here could validate WebGPU before. Fixed:

SwiftShader headless correctness (dev/msm-webgpu/scripts/swiftshader-crosscheck.mjs). The pre-existing test-msm-swiftshader.mjs couldn't get a WebGPU adapter on Linux. Working incantation (Playwright Chromium + SwiftShader Vulkan ICD):
```
VK_ICD_FILENAMES=.../vk_swiftshader_icd.json
--enable-unsafe-webgpu --enable-features=WebGPU,Vulkan
--use-vulkan=swiftshader --use-webgpu-adapter=swiftshader
```
?ref=noble cross-check mode. The threaded-WASM oracle in the harness (barretenberg.wasm.gz) is a 213-byte placeholder here, so GPU-vs-WASM is unavailable. The new mode cross-checks the WebGPU result against the in-page @noble/curves reference at the actual logN and skips WASM — valid GPU-less validation. LOGN_MIN 10 → 8 so 2⁸ is genuinely testable (was clamped to 2¹⁰).
GPU-only bench mode (?gpu_only=1, auto for the msm-bench autorun): skips the unavailable WASM path so timing runs don't depend on it; Run/Run×5 enable on SRS-ready.
Per-buffer GPU-memory attribution (MsmV2Pool.statsBreakdown()): turns one opaque "working-set" number into a labeled per-buffer map, surfaced in the [mem] log and bench results.
Bench wall-time fix: the Run handler clears $log on click, so the bench rep loop always parsed wallMs = 0. Now the dispatch→readback wall is read from __lastGpuWallMs. This is the trustworthy timing on GPUs with broken timestamp-queries (see Adreno finding below).

2. Correctness — GREEN (stream-walker path, this branch)

SwiftShader (software Vulkan), WebGPU cross-checked against noble:

logN	n	WebGPU vs noble
8	256	✅ agree
10	1024	✅ agree

(SwiftShader wall times are software-rasterizer numbers — correctness only, never perf.)

3. Baseline — real device (Samsung Galaxy S25 Ultra · Adreno, Chrome 145 Mobile)

Stream-walker accumulate path, n = 2¹⁷:

metric	value	notes
GPU wall (dispatch→readback)	~491 ms	the reliable number (`[gpu] returned in`)
per-pass timestamp sum	~25 000 ms	garbage — Adreno timestamp-queries are unreliable; do not use per-phase deltas on Adreno
GPU working set (excl. SRS)	60.4 MiB	shared scratch + instance buffers
SRS pool	8.0 MiB	n·64 B
total GPU buffers	68.4 MiB	under the 100 MB budget at 2¹⁷

4. Memory attribution (key finding)

statsBreakdown() shows the working set is not the "~9 MB memory-light" figure the plan implies — it's ~60 MiB at 2¹⁷:

walkerPartials ≈ 8.5 MiB is fixed at every n — it's soaBuf(2 · STREAM_T · S) with STREAM_T = 8192 hardcoded (msm_v2.ts:1614), independent of n. At small n it's massively over-provisioned (it is the entire scratch at logN=10); at 2¹⁷ it's ~14% of scratch.
The rest (~52 MiB at 2¹⁷) is the n/B-scaled Pippenger buffers: bucket sums, scalar decomposition outputs, reduction buffers. These scale toward the 100 MB ceiling by 2²⁰ and are the real budget pressure.
The legacy V2 pair-tree / old stream-accum buffers are already 4-byte stubs (plan §14 step 9) — no easy dead-buffer win remains.

5. Where the next change should go (analysis, not yet implemented)

Ordered by leverage; each must clear SwiftShader correctness + a BrowserStack number before landing:

GLV endomorphism — the standout memory and time lever. φ(P)=(βx,y) gives s = s₁+λs₂ with 128-bit halves ⇒ effective windows T ≈ 9 vs 17 at c=15 ⇒ roughly halves both the bucket-accumulation work and the n·T- and B-scaled accumulation-column memory. Was removed from the bb.js port; re-wiring on the warm SRS path is greenfield but mathematically small. (Details: src/msm_webgpu/MSM_DESIGN_ANALYSIS.md §6.4.)
Right-size STREAM_T for small n — frees most of the fixed 8.5 MiB walkerPartials below ~2¹⁴. Neutral at 2¹⁷⁺. Low risk but needs the indirect-dispatch + taskCuts/walkerNodes sizing to track it.
GPU partials-reduction kernel + warp-coalesced task layout (plan's deferred wins).

6. Blocked / honest status

Only the Adreno baseline was captured this session. BrowserStack has 2 seats shared across ~10 agents; both were occupied when I went for a second device, and the rule is to wait, never preempt. Apple + Mali baselines are pending a free seat — the harness + breakdown are ready to capture them in one run each (?autorun=msm-bench&logn=17).
No algorithm change is included precisely because I can't yet validate one's perf/memory at scale on this contended path, and the brief forbids unmeasured claims.

Reproduce locally (no GPU):

cd barretenberg/ts && yarn dev:msm-webgpu   # :5173
VK_ICD_FILENAMES=/opt/ms-playwright/chromium-1148/chrome-linux/vk_swiftshader_icd.json \
CHROMIUM_PATH=/opt/ms-playwright/chromium-1148/chrome-linux/chrome \
  node dev/msm-webgpu/scripts/swiftshader-crosscheck.mjs 10   # → "WebGPU and noble agree"

…check mode

…ness - msm-bench autorun runs GPU-only (skips the unavailable threaded WASM oracle); Run/Run×5 enable on SRS-ready in gpu_only mode. - Report per-MSM working-set memory (shared scratch + instance buffers, excluding the SRS pool) via MsmV2Pool/MsmV2.statsBytes(); included in bench results as `mem` and logged as [mem]. - Add scripts/swiftshader-bench.mjs for headless GPU-only timing validation.

Add MsmV2Pool.statsBreakdown() returning a labeled per-buffer byte map and surface the top-6 scratch buffers in the [mem] log + bench results. Makes the working-set total attributable to individual buffers (e.g. the fixed-size walkerPartials vs the n/B-scaled bucket buffers) on real BrowserStack devices.

The Run handler clears $log on each click, so the bench rep loop's attempt to parse '[gpu] returned in X ms' from new log lines always missed and recorded wallMs=0. Expose the dispatch->readback wall as __lastGpuWallMs and read it directly. This is the trustworthy timing on GPUs with unreliable timestamp-queries (Adreno reported garbage per-pass deltas while the real dispatch wall was ~491ms at 2^17).

…+ thread right-sizing Make the stream-walker's occupancy and memory levers runtime-selectable so a single device session can sweep the curve, and add the two correctness-proven wins: - pref_scratch placement knob (workgroup|device|private). 'private' moves the per-thread prefix scratch to module-scope var<private>, freeing the 16 KB var<workgroup> allocation (the swarm's #1 occupancy limiter) with no extra GPU buffer and no peak-memory regression. 'device' is also wired but needs an 11th storage binding (exceeds the common 10-buffer adapter ceiling). - streamS (S) knob: S=4 halves every S-scaled scratch buffer (walkerPartials 8.5->4.25 MiB, walkerNodes, taskCuts) — algo memory 11.8->6.8 MiB at logN=10. - walkerTpb knob: TPB=128 for 32 KB GPUs (arch-aware occupancy). - streamThreads knob: right-sizes the STREAM_T-scaled scratch at small n. Harness forwards ?swalk= ?wtpb= ?pref= ?sthreads=; crosscheck script takes an extra query string. SwiftShader+noble cross-check GREEN at logN=10 for default, swalk=4, pref=private, pref=private&swalk=4(&wtpb=128), sthreads=4096.

…r pair The accumulate loop ran three passes (forward prefix, inverse, backward peel) that each re-read the slot's x-coords from point_x and re-subtracted dx. The operands are invariant across the three passes of one iteration (the cursor only advances at the end of the peel), so the forward pass now caches dx and both x-operands in private arrays; the inverse pass reuses cached dx and the peel reuses the cached x-coords, reading only the y-coords. This cuts point_x traffic from 3× to 1× per x-coord (~1/3 of SRS reads) at zero extra GPU memory. Gated by walkerDxCache (default true; ?dxcache=0 for A/B). SwiftShader+noble cross-check GREEN at logN=8 and 10 for dxcache on/off, default and the private+S=4+TPB=128 composite.

- run-browserstack.mjs --knobs forwards arbitrary walker query params (pref/swalk/wtpb/dxcache/sthreads) into the autorun URL so one seat can benchmark any walker config. - msm-bench autorun now posts /progress rows (start, srs-wait, warmup, per-rep) so the runner detects the runId and keeps its stall watchdog fed through the slow SRS+build phase on real devices; previously it posted only a terminal /results row, tripping the 120s first-progress watchdog at logN=17. - swiftshader-bench.mjs takes an extra query-string arg for local A/B.

The dev page downloaded the entire 2^20-point CRS (~32 MB compressed) from the external CDN on every boot, before the autorun could start. Through a slow tunnel on a real mobile device this stalled boot indefinitely, so the BrowserStack bench autorun never reached its first /progress post and the runner's watchdog killed the worker. Boot now loads 1<<min(LOGN_MAX, max(10,logN)) points from ?logn (or ?srsn=), cutting the logN=17 download to ~4 MB. SwiftShader+noble cross-check still GREEN.

…ce session Adds an autorun mode that sweeps a list of walker configs (?sweep=pref-S-TPB-dx, e.g. workgroup-8-64-0,private-4-128-1) in a single page load. Each config tears down and rebuilds MsmV2 + the pool so its time and (deterministic) memory accounting are correct for that config — the pool grows but never shrinks otherwise. Lets one contended BrowserStack device allocation map a whole parameter curve instead of one config per seat. SwiftShader-verified: two-config sweep reports walkerPartials 8.5 vs 4.25 MiB and algo 11.8 vs 6.8 MiB correctly.

feat(bb/msm): GPU-less SwiftShader correctness harness + noble cross-…

5cac1ed

…check mode

AztecBot added the claudebox Owned by claudebox. it can push to this PR. label May 30, 2026

AztecBot added 3 commits May 30, 2026 00:13

AztecBot changed the title ~~feat(bb/msm): GPU-less SwiftShader correctness harness + noble cross-check mode~~ feat(bb/msm): benchmark-gated WebGPU MSM harness — SwiftShader correctness + Adreno baseline + memory attribution May 30, 2026

AztecBot added 5 commits May 30, 2026 01:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(bb/msm): benchmark-gated WebGPU MSM harness — SwiftShader correctness + Adreno baseline + memory attribution#23723

feat(bb/msm): benchmark-gated WebGPU MSM harness — SwiftShader correctness + Adreno baseline + memory attribution#23723
AztecBot wants to merge 9 commits into
stream-walker-implfrom
cb/msm-opt-bucket-scan-ad1d

AztecBot commented May 30, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

AztecBot commented May 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

WebGPU MSM optimization — measured foundation (WIP)

1. Harness enablement (the blocker this box had)

2. Correctness — GREEN (stream-walker path, this branch)

3. Baseline — real device (Samsung Galaxy S25 Ultra · Adreno, Chrome 145 Mobile)

4. Memory attribution (key finding)

5. Where the next change should go (analysis, not yet implemented)

6. Blocked / honest status

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

AztecBot commented May 30, 2026 •

edited

Loading