perf(bb/msm): stream-walker occupancy lever — private pref + TPB96 (Apple M2 −7% GPU) by AztecBot · Pull Request #23724 · AztecProtocol/aztec-packages

AztecBot · 2026-05-30T00:03:37Z

Summary

Self-review + rework of the hybrid stream-walker. The original PR shipped two
levers (dx-cache + an S/TPB Pareto knob) that together moved the needle only
~1–2%. A measured A/B ladder on Apple M2 (one BrowserStack session per
ladder, so no cross-session noise) shows the real bottleneck is occupancy,
and pins down a config that is −9% on the dominant stream_walker phase and
−7% on full-MSM GPU time, cross-checked correct.

The headline change: move the per-thread prefix scratch off workgroup memory
(var<private>), which removes the workgroup-storage cap that throttled
resident workgroups, then raise TPB to its occupancy sweet spot (96).

Measured A/B (Apple M2, Chrome 148, logn=17, S=8, reps=8)

stream_walker is the dominant GPU phase; the rest (preprocess ≈3.4,
walker_combine ≈6, reduce ≈8.5 ms) is stable across variants.

variant	stream_walker	GPU total	Δ walker
baseline (3-pass, dx-cache, workgroup pref, TPB 64)	66.8	85.3	—
3-pass, dx-cache, private pref, TPB 128	61.3	79.6	−8.2%
3-pass, dx-cache, private pref, TPB 96 (new default)	60.8	79.2	−9.0%
3-pass, no dx-cache, private pref, TPB 96	61.2	79.6	−8.4%
3-pass, dx-cache, private pref, TPB 160	63.9	82.2	−4.3%
fused inverse+peel, private pref, TPB 128	63.1	81.2	−5.4%
fused inverse+peel, workgroup pref, TPB 64	68.6	87.0	+2.8% (worse)

Direction and magnitude reproduced across 3 independent sessions.

What changed and why

Occupancy lever — prefMem: 'private' + TPB 96 (the win). pref_scratch
is per-thread with no cross-thread sharing, so it does not need to be in
workgroup memory. Moving it to var<private> frees the
maxComputeWorkgroupStorageSize cap (the original reason TPB was pinned at 64
for Mali's 16 KB), letting TPB rise. TPB 96 is the sweet spot — 64 leaves
occupancy unused, ≥160 turns register-bound and regresses. This is the whole
improvement. Numerically identical (cross-checked).
Fusion of the inverse + peel passes — measured loss, dropped as default.
Merging the two backward passes removes the inv_dx round-trip through
pref_scratch and the dx_cache registers (deriving dx from the affine
add's own loads), so on paper it is fewer instructions and fewer registers.
But it raises peak live-register pressure inside the merged loop, and on an
occupancy-bound kernel that lowers resident-workgroup count — slower at every
point on Apple. Kept behind walkerFused (default off) for architectures with
different trade-offs; honestly reported as a negative result here.
dx-cache stays on (marginal ~0.4 ms positive at TPB 96).
S framing dropped. S is a memory knob, not a speed knob — the walker is
occupancy/memory-bound and flat across S (consistent with prior threads). The
design doc is rewritten around the occupancy finding.

Correctness — GREEN

GPU-vs-Noble (CPU pippenger) under SwiftShader at logn=8 and 10, across the full
knob matrix: S∈{2,4,8}, TPB∈{64,96,128,160,256}, dx-cache on/off, fused on/off,
pref workgroup/private.

[noble-check] logN=8  PASS (WebGPU matches Noble)
[noble-check] logN=10 PASS (WebGPU matches Noble)

Knobs (`MsmConfig` / URL)

walkerPrefMem (?prefmem=, default private), walkerTpb (?wtpb=, default
96), walkerCacheDx (?cachedx=, default on), walkerFused (?fused=, default
off), walkerS (?ws=, default 8 — memory only). The bench autorun
?autorun=msm-gpu-bench&fusedab=1 runs the whole A/B ladder in one session.

Honest status / not yet done

Apple M2 is proven (above), across 3 sessions.
Mobile harness fixed and verified — but the BrowserStack Android devices
still won't execute the page. I built the production-bundle harness the prior
revision proposed (dev/msm-webgpu/vite.preview.config.ts: vite build +
results middleware on configurePreviewServer); the dev page now loads in a
handful of requests instead of hundreds of unbundled modules. I verified the
full path end-to-end from a real headless browser through the Cloudflare
tunnel — page GET + POST /results → HTTP 200, row written, cross-check PASS.
Driving the S25 Ultra (Adreno) at it, the worker ran ~13 min and produced
zero rows — not even the boot heartbeat that posts before any GPU/SRS work,
i.e. main.ts never executes on the device. This reproduces the prior
revision's mobile symptom across a different device (S25/Adreno vs
Pixel/Mali) and a different harness (bundle vs dev server), so the blocker is
the BrowserStack mobile-Chrome environment, not page size — and there is no
device console via the worker API to debug it further from here. The private
pref change is expected to help mobile more (it removes Mali's 16 KB ceiling
outright), but on-device mobile numbers remain unobtained.
Bigger lever not pursued here: coalescing the data-dependent SRS gather
(pre-permuting points into l0 order) would cut the dependent-load chain, at the
cost of an extra pass + buffer — a candidate for a follow-up if more than ~7%
is needed.

Commits

feat: walker fused inverse+peel pass (kept as a knob; measured slower)
feat: prefmem=private occupancy lever + in-session A/B ladder
perf: default walker to private-pref + TPB 96; fusion off
docs: rewrite the design doc around the measured occupancy lever
feat: bundled vite-preview harness for mobile BrowserStack

…ization

Cache the per-slot dx = x_r - x_l from the forward prefix and reuse it to chain the running inverse, instead of reloading and recomputing the slot points in the inverse pass. Removes ~1/3 of the stream-walker's x-coordinate SRS reads; numerically identical output. Toggle via MsmConfig.walkerCacheDx (URL ?cachedx=0/1), default on. Cross-check GREEN at logn=8,10 vs Noble under SwiftShader, both cachedx=on and cachedx=off.

Make the stream-walker's batched-inversion slots S and workgroup size TPB configurable (MsmConfig.walkerS/walkerTpb, URL ?ws=/?wtpb=). S is the primary memory<->time Pareto knob: pref_scratch = TPB*S*32 B of workgroup memory and per-thread register state both scale with S, so smaller S lifts occupancy at the cost of more inversions. Default 8 / 64 (unchanged). Add WASM-free GPU benchmark autorun (msm-gpu-bench): gates on SRS only, times the WebGPU MSM and reports the per-phase GPU breakdown (stream_walker, walker_combine, ...). Sweeps the S knob in one device session (?sweep=8,4,2) so a single BrowserStack seat maps the whole Pareto time curve. Emits /progress rows for the BS runner's runId-detection + stall watchdog. run-browserstack.mjs: add --query passthrough for the new autorun params. Cross-check GREEN at logn=8,10 vs Noble under SwiftShader across S in {2,4,8,16} and TPB in {32,64,128}.

…ef traffic

… ladder

…U); fusion off

… are dead-ends)

…nblocks Adreno/Mali)

feat(bb/msm): hybrid stream-walker — occupancy + memory-traffic optim…

bb0bf3c

…ization

AztecBot added the claudebox Owned by claudebox. it can push to this PR. label May 30, 2026

AztecBot added 8 commits May 30, 2026 00:09

chore(bb/msm): gpu-bench SRS-wait heartbeat for BrowserStack watchdog

3fea83d

docs(bb/msm): hybrid stream-walker design + memory Pareto front

2a899f4

perf(bb/msm): fuse walker inverse+peel — drop dx_cache regs, halve pr…

ee10722

…ef traffic

feat(bb/msm): walker prefmem=private occupancy lever + in-session A/B…

f7f7674

… ladder

perf(bb/msm): default walker to private-pref + TPB96 (Apple M2 -7% GP…

874fd21

…U); fusion off

docs(bb/msm): rewrite around the measured occupancy lever (S + fusion…

672a628

… are dead-ends)

AztecBot changed the title ~~feat(bb/msm): hybrid stream-walker — occupancy + memory-traffic optimization~~ perf(bb/msm): stream-walker occupancy lever — private pref + TPB96 (Apple M2 −7% GPU) May 30, 2026

feat(bb/msm): bundled vite-preview harness for mobile BrowserStack (u…

e0839f5

…nblocks Adreno/Mali)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(bb/msm): stream-walker occupancy lever — private pref + TPB96 (Apple M2 −7% GPU)#23724

perf(bb/msm): stream-walker occupancy lever — private pref + TPB96 (Apple M2 −7% GPU)#23724
AztecBot wants to merge 10 commits into
stream-walker-implfrom
cb/msm-opt/hybrid-walker

AztecBot commented May 30, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

AztecBot commented May 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Measured A/B (Apple M2, Chrome 148, logn=17, S=8, reps=8)

What changed and why

Correctness — GREEN

Knobs (MsmConfig / URL)

Honest status / not yet done

Commits

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

AztecBot commented May 30, 2026 •

edited

Loading

Knobs (`MsmConfig` / URL)