Skip to content

perf(bb/msm): configurable stream-walker S to amortize the dominant field inversion#23734

Draft
AztecBot wants to merge 11 commits into
stream-walker-implfrom
cb/walker-large-s-inversion
Draft

perf(bb/msm): configurable stream-walker S to amortize the dominant field inversion#23734
AztecBot wants to merge 11 commits into
stream-walker-implfrom
cb/walker-large-s-inversion

Conversation

@AztecBot
Copy link
Copy Markdown
Collaborator

Goal (Axis A — stream-walker SPEED, focus: inversion cost)

Measured fact (real hardware, PR #23732): the safegcd field inversion is ~47% of the stream-walker accumulate kernel on Apple M2. The kernel does exactly one inversion per batch of S affine adds, so inversion-cost-per-add = |inversion| / S. Raising S directly amortizes the single largest cost (S=8 → ~47%, S=16 → ~31%, S=32 → ~18%). This is the highest-leverage speed lever and composes with the device/private pref_scratch work from #23726.

This PR makes S a first-class, sweepable knob so the per-arch knee can be mapped on real hardware, then baked as the arch-aware default.

What this PR does (so far)

  1. STREAM_S is now configurable via MsmConfig.streamS instead of being hard-coded to 8 inside MsmV2.create(). All dependent pipelines (cumsum, emit, partition-task, walker, walker-combine) and every ∝ S buffer (walkerPref, walkerPartials, accBuf, taskCuts, …) already size off the shared streamS, so the knob flows end-to-end with no other code change. At streamS = 8 the behavior is byte-identical to baseline (the change is inert by default).
  2. ?ss=N URL knob wired through both dev harnesses (index.html / main.ts and msm-correctness.ts) → MsmConfig.streamS. A single browser/device session can now sweep S = 8,12,16,20,24,32,… with no rebuild (the WGSL {{ s }} is rendered at runtime via mustache; _generated/shaders.ts keeps the placeholder).
  3. BrowserStack runner passthrough: run-browserstack.mjs --ss N (and --wgi N) so the msm-bench autorun can map the inversion-amortization curve on a real device. On Apple, __lastPhaseMs.stream_walker isolates the accumulate kernel directly; on Adreno/Mali (no timestamp-query) we use GPU wall time.
  4. Builds on perf(bb/msm): stream-walker pref_scratch → private memory (frees workgroup occupancy limiter), TPB 64→128 #23726's portable var<private> pref_scratch + TPB=128 (the device-buffer-at-binding-11 variant is not portable — SwiftShader/Dawn and most Adreno/Mali adapters cap at 10 storage buffers per stage, confirmed below).

Why private (not device-buffer) pref_scratch

The walker already uses 10 storage bindings (0–9) + 1 uniform. A device pref_scratch would be an 11th storage buffer, which Dawn/SwiftShader rejects (storage buffers (11) … exceeds the maximum per-stage limit (10)) — and the WebGPU baseline for many mobile GPUs is ≤10 too. So the portable way to free workgroup memory for larger S is var<private> (no extra binding) + register acc. That is what #23726 landed and what this PR raises S on top of.

Status — numbers PENDING (not done yet)

This is the enabling change. The "done" bar (significant real-hardware time win on ≥1 Apple + ≥1 Adreno + ≥1 Mali, memory not regressed, cross-checked) is not met yet. Next steps on this branch:

  • BrowserStack msm-bench S-sweep on macos (Apple M2): stream_walker phase + wall at S=8,12,16,20,24,32; find the knee where register pressure (acc_x/acc_y ∝ S) cancels the inversion amortization.
  • Repeat on s25-ultra (Adreno) and pixel-9-pro-xl (Mali), wall-clock timing.
  • Real-GPU correctness via autorun=msm-cross-check (noble) at the winning S.
  • If register pressure caps the win early, Stage 2: move acc into the device scratch buffer (reuse the single binding-11 buffer used for pref where supported, or a packed layout) so S can grow without spilling registers.
  • Bake the per-arch optimal S as an arch-aware default.

Known blocker (pre-existing): SwiftShader correctness harness

The local msm-correctness SwiftShader harness (added in #23726) currently FAILs even on the base stream-walker-impl branch (GPU result off-curve at logn=8,10), so it is not a usable local gate right now — this is independent of this PR's change (the original "cross-check GREEN" claim, commit 735d5aecf26, predates this harness and used index.html's real-SRS noble check). Likely a SwiftShader software-rasterizer miscompile of the heavy safegcd/Montgomery WGSL, since the same kernels produce correct results on real GPUs. Correctness for perf claims will therefore be gated on real-hardware autorun=msm-cross-check (noble), which is the stronger signal anyway. Driver fixes (TS_ROOT, PW_EXECUTABLE) are included so the harness at least runs headless here.

Memory

streamS raises only ∝ S device buffers (walkerPref = 2·S·NumThreads·16 B, walkerPartials, accBuf). At NumThreads=8192: S=16 → +~4 MB, S=32 → +~8 MB for pref — well within the 100 MB budget to n=2²⁰. statsBytes() (?mem=17,20) is used to confirm no peak-memory regression at the chosen S.


Created by claudebox · group: aztec

@AztecBot AztecBot added ci-barretenberg Run all barretenberg/cpp checks. claudebox Owned by claudebox. it can push to this PR. labels May 30, 2026
@AztecBot
Copy link
Copy Markdown
Collaborator Author

Status update — tooling complete, real-hardware numbers blocked on BrowserStack seats

Landed + validated on this branch:

  • MsmConfig.streamS / ?ss=N knob — stream-walker S was hard-coded at 8; now sweepable end-to-end (inert at S=8).
  • Single-session autorun=msm-s-sweep (?slist=8,12,16,20,24,32): loads the SRS once, rebuilds only the MsmV2 pipelines per S (pool kept), and times the stream_walker phase (Apple timestamp-query) + GPU wall per S. One worker maps the whole inversion-amortization curve — important given the 2-seat limit.
  • BrowserStack runner passthrough: run-browserstack.mjs --ss/--slist/--wgi.
  • Pre-flight (local SwiftShader): S=16 and S=32 compile and run with no WGSL/validation errors; walker scratch scales as analyzed (statsBytes 12.3 → 22.3 → 42.3 MiB for S=8/16/32 at the probe size; device pref = 256 KB·S → +2/4/8 MB). tsgo typecheck clean.

Sweep command (run per family when a seat frees):

node dev/msm-webgpu/scripts/run-browserstack.mjs --target macos          --autorun msm-s-sweep --n 17 --reps 6 --slist 8,12,16,20,24,32
node dev/msm-webgpu/scripts/run-browserstack.mjs --target s25-ultra       --autorun msm-s-sweep --n 17 --reps 6 --slist 8,12,16,20,24,32   # Adreno, wall-clock
node dev/msm-webgpu/scripts/run-browserstack.mjs --target pixel-9-pro-xl  --autorun msm-s-sweep --n 17 --reps 6 --slist 8,12,16,20,24,32   # Mali, wall-clock

Then real-GPU correctness at the winning S via --autorun msm-cross-check.

Blocker: BrowserStack is at 2/2 running (two long-lived Galaxy S25 sessions from other agents) with a 3-deep queue. Per the shared-seat rule I will not preempt; retrying for a free seat. No fabricated numbers — the "done" bar (significant win on Apple+Adreno+Mali, memory not regressed, cross-checked) is not met until these runs land.


Created by claudebox · group: aztec

@AztecBot
Copy link
Copy Markdown
Collaborator Author

Real-hardware data #1 — Apple M2 (macOS Sequoia, Chrome 148), logn=17, reps=4 (medians)

Single-session msm-s-sweep on BrowserStack. Per-phase via timestamp-query (Apple supports it).

S stream_walker (accumulate kernel) ms walker_combine ms GPU total ms wall ms
8 63.05 5.83 81.66 84.63
12 57.21 9.18 79.36 82.11
16 54.46 11.93 79.50 82.34
20 51.31 13.96 78.58 81.37
24 50.20 18.35 80.74 83.95
32 48.69 23.46 85.07 88.42

The inversion-amortization hypothesis is confirmed. The accumulate kernel (stream_walker, ~47% inversion on M2) drops monotonically with S exactly as |inv|/S predicts: −18.6% at S=20, −22.8% at S=32 vs S=8.

But there's a competing cost the operator's framing didn't account for: walker_combine grows ∝ S (5.83 → 23.46 ms, +300%). Higher S splits each thread's bucket stream into more tasks → more boundary partials (2·NumThreads·S slots) → more combine/index work. This claws back most of the kernel win.

Net end-to-end on Apple: best at S≈20 → −3.8% GPU wall (78.58 vs 81.66 ms). Real and repeatable, but modest — below the "significant" bar.

Takeaway / revised plan

The inversion lever works (kernel −19%), but a significant end-to-end win needs Stage 2: stop walker_combine from scaling with S. The index/combine passes touch all 2·NumThreads·S partial slots though the number of real (boundary) partials is far smaller and only weakly S-dependent. Options: compact partials before combine, or cap task-splitting granularity so partial count is decoupled from S. With combine cost flattened, the S≈20–24 kernel saving (−18 to −20%) would pass through to the total.

Next

  • Adreno (s25-ultra) + Mali (pixel-9-pro-xl) sweeps (wall-clock; mobile safegcd inversion is relatively pricier, so the net knee may be deeper there — the operator's mobile target). Seat-contended; running as seats free.
  • Stage 2 combine-decoupling, then re-sweep.

No fabrication: this is one device, net win modest; the strict bar is not met yet.


Created by claudebox · group: aztec

@AztecBot
Copy link
Copy Markdown
Collaborator Author

Mobile (Adreno/Mali) — blocked by BrowserStack tunnel infra (not code)

The single-session sweep runs cleanly on macOS (real Apple data above) and locally under SwiftShader (validated end-to-end, incl. the new ?synth=1 no-SRS path). But every BrowserStack Android attempt produced 0 bytes from the device — the page never loads / never POSTs back. Four independent fixes, all 0 progress, macOS-identical mechanism:

  1. default (2²⁰ SRS) — 0 rows in 15 min
  2. ?srsmax=17 (8× smaller SRS) — 0 rows → not download size
  3. ?synth=1 (no SRS at all) — 0 rows → not SRS
  4. cloudflared --protocol http2 (TCP, not QUIC) — 0 rows → not QUIC/UDP

Conclusion: BrowserStack real Android devices cannot reach the cloudflare trycloudflare.com quick-tunnel that run-browserstack.mjs uses (desktop datacenters can). Mobile needs BrowserStack Local, whose access key isn't available to me client-side (held server-side by the MCP), so I can't stand it up. This is an infra gap in the harness, independent of the walker change. (Each attempt deleted its worker; no seats left held.)

Consolidated status

Confirmed on real Apple M2 (the operator's #1 lever works): raising S amortizes the safegcd inversion — accumulate kernel stream_walker −18.6% @ S=20, −22.8% @ S=32.

New finding the framing missed: walker_combine grows ∝ S (finer task-splitting → more boundary partials, 2·NumThreads·S slots), clawing the kernel win back. Net GPU wall best at S≈20: −3.8% — real but below the "significant" bar.

Tooling delivered & validated (not just designed): streamS/?ss knob, single-session msm-s-sweep (one seat = whole curve), ?synth=1 (no-SRS, for tunnel/CDN-blocked devices), runner passthrough, SwiftShader driver fixes. Memory scales as analyzed (walker scratch = 256 KB·S; +2/4/8 MB for S=8/16/32 ≪ 100 MB budget).

What "significant" needs (Stage 2)

Decouple walker_combine cost from S so the S≈20 kernel saving (−18%) passes through to the total: compact the partials (most of the 2·NumThreads·S slots are empty NO_BUCKET) before the index/combine passes, or cap task-splitting granularity so partial count is ~independent of S. Then re-sweep. Mobile coverage additionally needs BrowserStack Local wired into the runner.

Honest bottom line: hypothesis confirmed at the kernel level on real Apple hardware; end-to-end win currently modest (−3.8%) and Apple-only (mobile blocked by tunnel infra). The strict bar is not met; no fabricated numbers.


Created by claudebox · group: aztec

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-barretenberg Run all barretenberg/cpp checks. claudebox Owned by claudebox. it can push to this PR.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant