Skip to content

perf(bb/msm): stream-walker occupancy lever — private pref + TPB96 (Apple M2 −7% GPU)#23724

Draft
AztecBot wants to merge 10 commits into
stream-walker-implfrom
cb/msm-opt/hybrid-walker
Draft

perf(bb/msm): stream-walker occupancy lever — private pref + TPB96 (Apple M2 −7% GPU)#23724
AztecBot wants to merge 10 commits into
stream-walker-implfrom
cb/msm-opt/hybrid-walker

Conversation

@AztecBot
Copy link
Copy Markdown
Collaborator

@AztecBot AztecBot commented May 30, 2026

Summary

Self-review + rework of the hybrid stream-walker. The original PR shipped two
levers (dx-cache + an S/TPB Pareto knob) that together moved the needle only
~1–2%. A measured A/B ladder on Apple M2 (one BrowserStack session per
ladder, so no cross-session noise) shows the real bottleneck is occupancy,
and pins down a config that is −9% on the dominant stream_walker phase and
−7% on full-MSM GPU time
, cross-checked correct.

The headline change: move the per-thread prefix scratch off workgroup memory
(var<private>), which removes the workgroup-storage cap that throttled
resident workgroups, then raise TPB to its occupancy sweet spot (96).

Measured A/B (Apple M2, Chrome 148, logn=17, S=8, reps=8)

stream_walker is the dominant GPU phase; the rest (preprocess ≈3.4,
walker_combine ≈6, reduce ≈8.5 ms) is stable across variants.

variant stream_walker GPU total Δ walker
baseline (3-pass, dx-cache, workgroup pref, TPB 64) 66.8 85.3
3-pass, dx-cache, private pref, TPB 128 61.3 79.6 −8.2%
3-pass, dx-cache, private pref, TPB 96 (new default) 60.8 79.2 −9.0%
3-pass, no dx-cache, private pref, TPB 96 61.2 79.6 −8.4%
3-pass, dx-cache, private pref, TPB 160 63.9 82.2 −4.3%
fused inverse+peel, private pref, TPB 128 63.1 81.2 −5.4%
fused inverse+peel, workgroup pref, TPB 64 68.6 87.0 +2.8% (worse)

Direction and magnitude reproduced across 3 independent sessions.

What changed and why

  • Occupancy lever — prefMem: 'private' + TPB 96 (the win). pref_scratch
    is per-thread with no cross-thread sharing, so it does not need to be in
    workgroup memory. Moving it to var<private> frees the
    maxComputeWorkgroupStorageSize cap (the original reason TPB was pinned at 64
    for Mali's 16 KB), letting TPB rise. TPB 96 is the sweet spot — 64 leaves
    occupancy unused, ≥160 turns register-bound and regresses. This is the whole
    improvement. Numerically identical (cross-checked).

  • Fusion of the inverse + peel passes — measured loss, dropped as default.
    Merging the two backward passes removes the inv_dx round-trip through
    pref_scratch and the dx_cache registers (deriving dx from the affine
    add's own loads), so on paper it is fewer instructions and fewer registers.
    But it raises peak live-register pressure inside the merged loop, and on an
    occupancy-bound kernel that lowers resident-workgroup count — slower at every
    point on Apple. Kept behind walkerFused (default off) for architectures with
    different trade-offs; honestly reported as a negative result here.

  • dx-cache stays on (marginal ~0.4 ms positive at TPB 96).

  • S framing dropped. S is a memory knob, not a speed knob — the walker is
    occupancy/memory-bound and flat across S (consistent with prior threads). The
    design doc is rewritten around the occupancy finding.

Correctness — GREEN

GPU-vs-Noble (CPU pippenger) under SwiftShader at logn=8 and 10, across the full
knob matrix: S∈{2,4,8}, TPB∈{64,96,128,160,256}, dx-cache on/off, fused on/off,
pref workgroup/private.

[noble-check] logN=8  PASS (WebGPU matches Noble)
[noble-check] logN=10 PASS (WebGPU matches Noble)

Knobs (MsmConfig / URL)

walkerPrefMem (?prefmem=, default private), walkerTpb (?wtpb=, default
96), walkerCacheDx (?cachedx=, default on), walkerFused (?fused=, default
off), walkerS (?ws=, default 8 — memory only). The bench autorun
?autorun=msm-gpu-bench&fusedab=1 runs the whole A/B ladder in one session.

Honest status / not yet done

  • Apple M2 is proven (above), across 3 sessions.
  • Mobile harness fixed and verified — but the BrowserStack Android devices
    still won't execute the page.
    I built the production-bundle harness the prior
    revision proposed (dev/msm-webgpu/vite.preview.config.ts: vite build +
    results middleware on configurePreviewServer); the dev page now loads in a
    handful of requests instead of hundreds of unbundled modules. I verified the
    full path end-to-end from a real headless browser through the Cloudflare
    tunnel — page GET + POST /results → HTTP 200, row written, cross-check PASS.
    Driving the S25 Ultra (Adreno) at it, the worker ran ~13 min and produced
    zero rows — not even the boot heartbeat that posts before any GPU/SRS work,
    i.e. main.ts never executes on the device. This reproduces the prior
    revision's mobile symptom across a different device (S25/Adreno vs
    Pixel/Mali) and a different harness (bundle vs dev server), so the blocker is
    the BrowserStack mobile-Chrome environment, not page size — and there is no
    device console via the worker API to debug it further from here. The private
    pref change is expected to help mobile more (it removes Mali's 16 KB ceiling
    outright), but on-device mobile numbers remain unobtained.
  • Bigger lever not pursued here: coalescing the data-dependent SRS gather
    (pre-permuting points into l0 order) would cut the dependent-load chain, at the
    cost of an extra pass + buffer — a candidate for a follow-up if more than ~7%
    is needed.

Commits

  1. feat: walker fused inverse+peel pass (kept as a knob; measured slower)
  2. feat: prefmem=private occupancy lever + in-session A/B ladder
  3. perf: default walker to private-pref + TPB 96; fusion off
  4. docs: rewrite the design doc around the measured occupancy lever
  5. feat: bundled vite-preview harness for mobile BrowserStack

@AztecBot AztecBot added the claudebox Owned by claudebox. it can push to this PR. label May 30, 2026
AztecBot added 8 commits May 30, 2026 00:09
Cache the per-slot dx = x_r - x_l from the forward prefix and reuse it to
chain the running inverse, instead of reloading and recomputing the slot
points in the inverse pass. Removes ~1/3 of the stream-walker's
x-coordinate SRS reads; numerically identical output. Toggle via
MsmConfig.walkerCacheDx (URL ?cachedx=0/1), default on.

Cross-check GREEN at logn=8,10 vs Noble under SwiftShader, both
cachedx=on and cachedx=off.
Make the stream-walker's batched-inversion slots S and workgroup size TPB
configurable (MsmConfig.walkerS/walkerTpb, URL ?ws=/?wtpb=). S is the
primary memory<->time Pareto knob: pref_scratch = TPB*S*32 B of workgroup
memory and per-thread register state both scale with S, so smaller S lifts
occupancy at the cost of more inversions. Default 8 / 64 (unchanged).

Add WASM-free GPU benchmark autorun (msm-gpu-bench): gates on SRS only,
times the WebGPU MSM and reports the per-phase GPU breakdown (stream_walker,
walker_combine, ...). Sweeps the S knob in one device session (?sweep=8,4,2)
so a single BrowserStack seat maps the whole Pareto time curve. Emits
/progress rows for the BS runner's runId-detection + stall watchdog.

run-browserstack.mjs: add --query passthrough for the new autorun params.

Cross-check GREEN at logn=8,10 vs Noble under SwiftShader across
S in {2,4,8,16} and TPB in {32,64,128}.
@AztecBot AztecBot changed the title feat(bb/msm): hybrid stream-walker — occupancy + memory-traffic optimization perf(bb/msm): stream-walker occupancy lever — private pref + TPB96 (Apple M2 −7% GPU) May 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

claudebox Owned by claudebox. it can push to this PR.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant