Skip to content

feat(bb/msm): benchmark-gated WebGPU MSM harness — SwiftShader correctness + Adreno baseline + memory attribution#23723

Draft
AztecBot wants to merge 9 commits into
stream-walker-implfrom
cb/msm-opt-bucket-scan-ad1d
Draft

feat(bb/msm): benchmark-gated WebGPU MSM harness — SwiftShader correctness + Adreno baseline + memory attribution#23723
AztecBot wants to merge 9 commits into
stream-walker-implfrom
cb/msm-opt-bucket-scan-ad1d

Conversation

@AztecBot
Copy link
Copy Markdown
Collaborator

@AztecBot AztecBot commented May 30, 2026

WebGPU MSM optimization — measured foundation (WIP)

Working branch off stream-walker-impl. Goal: a memory- and time-optimal BN254 WebGPU MSM for laptop + mobile GPUs (Apple TBDR, Adreno, Mali), within the 32 KB / 16 KB workgroup-shared-memory and ≤100 MB total-memory constraints up to n = 2²⁰.

This session was benchmark-gated: get a real WebGPU correctness path + a real device perf/memory path working first, record baselines, and only then change the algorithm. No unmeasured claims — so this PR ships the validated harness + the baseline measurements + the analysis that should drive the next change, rather than a speculative rewrite.

1. Harness enablement (the blocker this box had)

The box has no GPU. Nothing here could validate WebGPU before. Fixed:

  • SwiftShader headless correctness (dev/msm-webgpu/scripts/swiftshader-crosscheck.mjs). The pre-existing test-msm-swiftshader.mjs couldn't get a WebGPU adapter on Linux. Working incantation (Playwright Chromium + SwiftShader Vulkan ICD):
    VK_ICD_FILENAMES=.../vk_swiftshader_icd.json
    --enable-unsafe-webgpu --enable-features=WebGPU,Vulkan
    --use-vulkan=swiftshader --use-webgpu-adapter=swiftshader
    
  • ?ref=noble cross-check mode. The threaded-WASM oracle in the harness (barretenberg.wasm.gz) is a 213-byte placeholder here, so GPU-vs-WASM is unavailable. The new mode cross-checks the WebGPU result against the in-page @noble/curves reference at the actual logN and skips WASM — valid GPU-less validation. LOGN_MIN 10 → 8 so 2⁸ is genuinely testable (was clamped to 2¹⁰).
  • GPU-only bench mode (?gpu_only=1, auto for the msm-bench autorun): skips the unavailable WASM path so timing runs don't depend on it; Run/Run×5 enable on SRS-ready.
  • Per-buffer GPU-memory attribution (MsmV2Pool.statsBreakdown()): turns one opaque "working-set" number into a labeled per-buffer map, surfaced in the [mem] log and bench results.
  • Bench wall-time fix: the Run handler clears $log on click, so the bench rep loop always parsed wallMs = 0. Now the dispatch→readback wall is read from __lastGpuWallMs. This is the trustworthy timing on GPUs with broken timestamp-queries (see Adreno finding below).

2. Correctness — GREEN (stream-walker path, this branch)

SwiftShader (software Vulkan), WebGPU cross-checked against noble:

logN n WebGPU vs noble
8 256 ✅ agree
10 1024 ✅ agree

(SwiftShader wall times are software-rasterizer numbers — correctness only, never perf.)

3. Baseline — real device (Samsung Galaxy S25 Ultra · Adreno, Chrome 145 Mobile)

Stream-walker accumulate path, n = 2¹⁷:

metric value notes
GPU wall (dispatch→readback) ~491 ms the reliable number ([gpu] returned in)
per-pass timestamp sum ~25 000 ms garbage — Adreno timestamp-queries are unreliable; do not use per-phase deltas on Adreno
GPU working set (excl. SRS) 60.4 MiB shared scratch + instance buffers
SRS pool 8.0 MiB n·64 B
total GPU buffers 68.4 MiB under the 100 MB budget at 2¹⁷

4. Memory attribution (key finding)

statsBreakdown() shows the working set is not the "~9 MB memory-light" figure the plan implies — it's ~60 MiB at 2¹⁷:

  • walkerPartials ≈ 8.5 MiB is fixed at every n — it's soaBuf(2 · STREAM_T · S) with STREAM_T = 8192 hardcoded (msm_v2.ts:1614), independent of n. At small n it's massively over-provisioned (it is the entire scratch at logN=10); at 2¹⁷ it's ~14% of scratch.
  • The rest (~52 MiB at 2¹⁷) is the n/B-scaled Pippenger buffers: bucket sums, scalar decomposition outputs, reduction buffers. These scale toward the 100 MB ceiling by 2²⁰ and are the real budget pressure.
  • The legacy V2 pair-tree / old stream-accum buffers are already 4-byte stubs (plan §14 step 9) — no easy dead-buffer win remains.

5. Where the next change should go (analysis, not yet implemented)

Ordered by leverage; each must clear SwiftShader correctness + a BrowserStack number before landing:

  1. GLV endomorphism — the standout memory and time lever. φ(P)=(βx,y) gives s = s₁+λs₂ with 128-bit halves ⇒ effective windows T ≈ 9 vs 17 at c=15 ⇒ roughly halves both the bucket-accumulation work and the n·T- and B-scaled accumulation-column memory. Was removed from the bb.js port; re-wiring on the warm SRS path is greenfield but mathematically small. (Details: src/msm_webgpu/MSM_DESIGN_ANALYSIS.md §6.4.)
  2. Right-size STREAM_T for small n — frees most of the fixed 8.5 MiB walkerPartials below ~2¹⁴. Neutral at 2¹⁷⁺. Low risk but needs the indirect-dispatch + taskCuts/walkerNodes sizing to track it.
  3. GPU partials-reduction kernel + warp-coalesced task layout (plan's deferred wins).

6. Blocked / honest status

  • Only the Adreno baseline was captured this session. BrowserStack has 2 seats shared across ~10 agents; both were occupied when I went for a second device, and the rule is to wait, never preempt. Apple + Mali baselines are pending a free seat — the harness + breakdown are ready to capture them in one run each (?autorun=msm-bench&logn=17).
  • No algorithm change is included precisely because I can't yet validate one's perf/memory at scale on this contended path, and the brief forbids unmeasured claims.

Reproduce locally (no GPU):

cd barretenberg/ts && yarn dev:msm-webgpu   # :5173
VK_ICD_FILENAMES=/opt/ms-playwright/chromium-1148/chrome-linux/vk_swiftshader_icd.json \
CHROMIUM_PATH=/opt/ms-playwright/chromium-1148/chrome-linux/chrome \
  node dev/msm-webgpu/scripts/swiftshader-crosscheck.mjs 10   # → "WebGPU and noble agree"

@AztecBot AztecBot added the claudebox Owned by claudebox. it can push to this PR. label May 30, 2026
AztecBot added 3 commits May 30, 2026 00:13
…ness

- msm-bench autorun runs GPU-only (skips the unavailable threaded WASM
  oracle); Run/Run×5 enable on SRS-ready in gpu_only mode.
- Report per-MSM working-set memory (shared scratch + instance buffers,
  excluding the SRS pool) via MsmV2Pool/MsmV2.statsBytes(); included in
  bench results as `mem` and logged as [mem].
- Add scripts/swiftshader-bench.mjs for headless GPU-only timing validation.
Add MsmV2Pool.statsBreakdown() returning a labeled per-buffer byte map and
surface the top-6 scratch buffers in the [mem] log + bench results. Makes the
working-set total attributable to individual buffers (e.g. the fixed-size
walkerPartials vs the n/B-scaled bucket buffers) on real BrowserStack devices.
The Run handler clears $log on each click, so the bench rep loop's attempt
to parse '[gpu] returned in X ms' from new log lines always missed and
recorded wallMs=0. Expose the dispatch->readback wall as __lastGpuWallMs and
read it directly. This is the trustworthy timing on GPUs with unreliable
timestamp-queries (Adreno reported garbage per-pass deltas while the real
dispatch wall was ~491ms at 2^17).
@AztecBot AztecBot changed the title feat(bb/msm): GPU-less SwiftShader correctness harness + noble cross-check mode feat(bb/msm): benchmark-gated WebGPU MSM harness — SwiftShader correctness + Adreno baseline + memory attribution May 30, 2026
AztecBot added 5 commits May 30, 2026 01:49
…+ thread right-sizing

Make the stream-walker's occupancy and memory levers runtime-selectable so a
single device session can sweep the curve, and add the two correctness-proven
wins:

- pref_scratch placement knob (workgroup|device|private). 'private' moves the
  per-thread prefix scratch to module-scope var<private>, freeing the 16 KB
  var<workgroup> allocation (the swarm's #1 occupancy limiter) with no extra
  GPU buffer and no peak-memory regression. 'device' is also wired but needs
  an 11th storage binding (exceeds the common 10-buffer adapter ceiling).
- streamS (S) knob: S=4 halves every S-scaled scratch buffer (walkerPartials
  8.5->4.25 MiB, walkerNodes, taskCuts) — algo memory 11.8->6.8 MiB at logN=10.
- walkerTpb knob: TPB=128 for 32 KB GPUs (arch-aware occupancy).
- streamThreads knob: right-sizes the STREAM_T-scaled scratch at small n.

Harness forwards ?swalk= ?wtpb= ?pref= ?sthreads=; crosscheck script takes an
extra query string. SwiftShader+noble cross-check GREEN at logN=10 for default,
swalk=4, pref=private, pref=private&swalk=4(&wtpb=128), sthreads=4096.
…r pair

The accumulate loop ran three passes (forward prefix, inverse, backward peel)
that each re-read the slot's x-coords from point_x and re-subtracted dx. The
operands are invariant across the three passes of one iteration (the cursor
only advances at the end of the peel), so the forward pass now caches dx and
both x-operands in private arrays; the inverse pass reuses cached dx and the
peel reuses the cached x-coords, reading only the y-coords. This cuts point_x
traffic from 3× to 1× per x-coord (~1/3 of SRS reads) at zero extra GPU memory.

Gated by walkerDxCache (default true; ?dxcache=0 for A/B). SwiftShader+noble
cross-check GREEN at logN=8 and 10 for dxcache on/off, default and the
private+S=4+TPB=128 composite.
- run-browserstack.mjs --knobs forwards arbitrary walker query params
  (pref/swalk/wtpb/dxcache/sthreads) into the autorun URL so one seat can
  benchmark any walker config.
- msm-bench autorun now posts /progress rows (start, srs-wait, warmup, per-rep)
  so the runner detects the runId and keeps its stall watchdog fed through the
  slow SRS+build phase on real devices; previously it posted only a terminal
  /results row, tripping the 120s first-progress watchdog at logN=17.
- swiftshader-bench.mjs takes an extra query-string arg for local A/B.
The dev page downloaded the entire 2^20-point CRS (~32 MB compressed) from the
external CDN on every boot, before the autorun could start. Through a slow
tunnel on a real mobile device this stalled boot indefinitely, so the
BrowserStack bench autorun never reached its first /progress post and the
runner's watchdog killed the worker. Boot now loads 1<<min(LOGN_MAX,
max(10,logN)) points from ?logn (or ?srsn=), cutting the logN=17 download to
~4 MB. SwiftShader+noble cross-check still GREEN.
…ce session

Adds an autorun mode that sweeps a list of walker configs (?sweep=pref-S-TPB-dx,
e.g. workgroup-8-64-0,private-4-128-1) in a single page load. Each config tears
down and rebuilds MsmV2 + the pool so its time and (deterministic) memory
accounting are correct for that config — the pool grows but never shrinks
otherwise. Lets one contended BrowserStack device allocation map a whole
parameter curve instead of one config per seat. SwiftShader-verified: two-config
sweep reports walkerPartials 8.5 vs 4.25 MiB and algo 11.8 vs 6.8 MiB correctly.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

claudebox Owned by claudebox. it can push to this PR.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant