perf(bb/msm): configurable stream-walker S to amortize the dominant field inversion by AztecBot · Pull Request #23734 · AztecProtocol/aztec-packages

AztecBot · 2026-05-30T02:25:07Z

Goal (Axis A — stream-walker SPEED, focus: inversion cost)

Measured fact (real hardware, PR #23732): the safegcd field inversion is ~47% of the stream-walker accumulate kernel on Apple M2. The kernel does exactly one inversion per batch of S affine adds, so inversion-cost-per-add = |inversion| / S. Raising S directly amortizes the single largest cost (S=8 → ~47%, S=16 → ~31%, S=32 → ~18%). This is the highest-leverage speed lever and composes with the device/private pref_scratch work from #23726.

This PR makes S a first-class, sweepable knob so the per-arch knee can be mapped on real hardware, then baked as the arch-aware default.

What this PR does (so far)

STREAM_S is now configurable via MsmConfig.streamS instead of being hard-coded to 8 inside MsmV2.create(). All dependent pipelines (cumsum, emit, partition-task, walker, walker-combine) and every ∝ S buffer (walkerPref, walkerPartials, accBuf, taskCuts, …) already size off the shared streamS, so the knob flows end-to-end with no other code change. At streamS = 8 the behavior is byte-identical to baseline (the change is inert by default).
?ss=N URL knob wired through both dev harnesses (index.html / main.ts and msm-correctness.ts) → MsmConfig.streamS. A single browser/device session can now sweep S = 8,12,16,20,24,32,… with no rebuild (the WGSL {{ s }} is rendered at runtime via mustache; _generated/shaders.ts keeps the placeholder).
BrowserStack runner passthrough: run-browserstack.mjs --ss N (and --wgi N) so the msm-bench autorun can map the inversion-amortization curve on a real device. On Apple, __lastPhaseMs.stream_walker isolates the accumulate kernel directly; on Adreno/Mali (no timestamp-query) we use GPU wall time.
Builds on perf(bb/msm): stream-walker pref_scratch → private memory (frees workgroup occupancy limiter), TPB 64→128 #23726's portable var<private> pref_scratch + TPB=128 (the device-buffer-at-binding-11 variant is not portable — SwiftShader/Dawn and most Adreno/Mali adapters cap at 10 storage buffers per stage, confirmed below).

Why private (not device-buffer) pref_scratch

The walker already uses 10 storage bindings (0–9) + 1 uniform. A device pref_scratch would be an 11th storage buffer, which Dawn/SwiftShader rejects (storage buffers (11) … exceeds the maximum per-stage limit (10)) — and the WebGPU baseline for many mobile GPUs is ≤10 too. So the portable way to free workgroup memory for larger S is var<private> (no extra binding) + register acc. That is what #23726 landed and what this PR raises S on top of.

Status — numbers PENDING (not done yet)

This is the enabling change. The "done" bar (significant real-hardware time win on ≥1 Apple + ≥1 Adreno + ≥1 Mali, memory not regressed, cross-checked) is not met yet. Next steps on this branch:

BrowserStack msm-bench S-sweep on macos (Apple M2): stream_walker phase + wall at S=8,12,16,20,24,32; find the knee where register pressure (acc_x/acc_y ∝ S) cancels the inversion amortization.
Repeat on s25-ultra (Adreno) and pixel-9-pro-xl (Mali), wall-clock timing.
Real-GPU correctness via autorun=msm-cross-check (noble) at the winning S.
If register pressure caps the win early, Stage 2: move acc into the device scratch buffer (reuse the single binding-11 buffer used for pref where supported, or a packed layout) so S can grow without spilling registers.
Bake the per-arch optimal S as an arch-aware default.

Known blocker (pre-existing): SwiftShader correctness harness

The local msm-correctness SwiftShader harness (added in #23726) currently FAILs even on the base stream-walker-impl branch (GPU result off-curve at logn=8,10), so it is not a usable local gate right now — this is independent of this PR's change (the original "cross-check GREEN" claim, commit 735d5aecf26, predates this harness and used index.html's real-SRS noble check). Likely a SwiftShader software-rasterizer miscompile of the heavy safegcd/Montgomery WGSL, since the same kernels produce correct results on real GPUs. Correctness for perf claims will therefore be gated on real-hardware autorun=msm-cross-check (noble), which is the stronger signal anyway. Driver fixes (TS_ROOT, PW_EXECUTABLE) are included so the harness at least runs headless here.

Memory

streamS raises only ∝ S device buffers (walkerPref = 2·S·NumThreads·16 B, walkerPartials, accBuf). At NumThreads=8192: S=16 → +~4 MB, S=32 → +~8 MB for pref — well within the 100 MB budget to n=2²⁰. statsBytes() (?mem=17,20) is used to confirm no peak-memory regression at the chosen S.

Created by claudebox · group: aztec

…(KNOB 1), TPB 64→128

…on-amortization sweep

…unner for S sweep

…ization curve on one device

… S sweep

AztecBot · 2026-05-30T02:41:52Z

Status update — tooling complete, real-hardware numbers blocked on BrowserStack seats

Landed + validated on this branch:

MsmConfig.streamS / ?ss=N knob — stream-walker S was hard-coded at 8; now sweepable end-to-end (inert at S=8).
Single-session autorun=msm-s-sweep (?slist=8,12,16,20,24,32): loads the SRS once, rebuilds only the MsmV2 pipelines per S (pool kept), and times the stream_walker phase (Apple timestamp-query) + GPU wall per S. One worker maps the whole inversion-amortization curve — important given the 2-seat limit.
BrowserStack runner passthrough: run-browserstack.mjs --ss/--slist/--wgi.
Pre-flight (local SwiftShader): S=16 and S=32 compile and run with no WGSL/validation errors; walker scratch scales as analyzed (statsBytes 12.3 → 22.3 → 42.3 MiB for S=8/16/32 at the probe size; device pref = 256 KB·S → +2/4/8 MB). tsgo typecheck clean.

Sweep command (run per family when a seat frees):

node dev/msm-webgpu/scripts/run-browserstack.mjs --target macos          --autorun msm-s-sweep --n 17 --reps 6 --slist 8,12,16,20,24,32
node dev/msm-webgpu/scripts/run-browserstack.mjs --target s25-ultra       --autorun msm-s-sweep --n 17 --reps 6 --slist 8,12,16,20,24,32   # Adreno, wall-clock
node dev/msm-webgpu/scripts/run-browserstack.mjs --target pixel-9-pro-xl  --autorun msm-s-sweep --n 17 --reps 6 --slist 8,12,16,20,24,32   # Mali, wall-clock

Then real-GPU correctness at the winning S via --autorun msm-cross-check.

Blocker: BrowserStack is at 2/2 running (two long-lived Galaxy S25 sessions from other agents) with a 3-deep queue. Per the shared-seat rule I will not preempt; retrying for a free seat. No fabricated numbers — the "done" bar (significant win on Apple+Adreno+Mali, memory not regressed, cross-checked) is not met until these runs land.

Created by claudebox · group: aztec

AztecBot · 2026-05-30T03:57:54Z

Real-hardware data #1 — Apple M2 (macOS Sequoia, Chrome 148), logn=17, reps=4 (medians)

Single-session msm-s-sweep on BrowserStack. Per-phase via timestamp-query (Apple supports it).

S	`stream_walker` (accumulate kernel) ms	`walker_combine` ms	GPU total ms	wall ms
8	63.05	5.83	81.66	84.63
12	57.21	9.18	79.36	82.11
16	54.46	11.93	79.50	82.34
20	51.31	13.96	78.58	81.37
24	50.20	18.35	80.74	83.95
32	48.69	23.46	85.07	88.42

The inversion-amortization hypothesis is confirmed. The accumulate kernel (stream_walker, ~47% inversion on M2) drops monotonically with S exactly as |inv|/S predicts: −18.6% at S=20, −22.8% at S=32 vs S=8.

But there's a competing cost the operator's framing didn't account for: walker_combine grows ∝ S (5.83 → 23.46 ms, +300%). Higher S splits each thread's bucket stream into more tasks → more boundary partials (2·NumThreads·S slots) → more combine/index work. This claws back most of the kernel win.

Net end-to-end on Apple: best at S≈20 → −3.8% GPU wall (78.58 vs 81.66 ms). Real and repeatable, but modest — below the "significant" bar.

Takeaway / revised plan

The inversion lever works (kernel −19%), but a significant end-to-end win needs Stage 2: stop walker_combine from scaling with S. The index/combine passes touch all 2·NumThreads·S partial slots though the number of real (boundary) partials is far smaller and only weakly S-dependent. Options: compact partials before combine, or cap task-splitting granularity so partial count is decoupled from S. With combine cost flattened, the S≈20–24 kernel saving (−18 to −20%) would pass through to the total.

Mobile (Adreno/Mali) — blocked by BrowserStack tunnel infra (not code)

The single-session sweep runs cleanly on macOS (real Apple data above) and locally under SwiftShader (validated end-to-end, incl. the new ?synth=1 no-SRS path). But every BrowserStack Android attempt produced 0 bytes from the device — the page never loads / never POSTs back. Four independent fixes, all 0 progress, macOS-identical mechanism:

default (2²⁰ SRS) — 0 rows in 15 min
?srsmax=17 (8× smaller SRS) — 0 rows → not download size
?synth=1 (no SRS at all) — 0 rows → not SRS
cloudflared --protocol http2 (TCP, not QUIC) — 0 rows → not QUIC/UDP

Conclusion: BrowserStack real Android devices cannot reach the cloudflare trycloudflare.com quick-tunnel that run-browserstack.mjs uses (desktop datacenters can). Mobile needs BrowserStack Local, whose access key isn't available to me client-side (held server-side by the MCP), so I can't stand it up. This is an infra gap in the harness, independent of the walker change. (Each attempt deleted its worker; no seats left held.)

Consolidated status

Confirmed on real Apple M2 (the operator's #1 lever works): raising S amortizes the safegcd inversion — accumulate kernel stream_walker −18.6% @ S=20, −22.8% @ S=32.

New finding the framing missed: walker_combine grows ∝ S (finer task-splitting → more boundary partials, 2·NumThreads·S slots), clawing the kernel win back. Net GPU wall best at S≈20: −3.8% — real but below the "significant" bar.

Tooling delivered & validated (not just designed): streamS/?ss knob, single-session msm-s-sweep (one seat = whole curve), ?synth=1 (no-SRS, for tunnel/CDN-blocked devices), runner passthrough, SwiftShader driver fixes. Memory scales as analyzed (walker scratch = 256 KB·S; +2/4/8 MB for S=8/16/32 ≪ 100 MB budget).

What "significant" needs (Stage 2)

Decouple walker_combine cost from S so the S≈20 kernel saving (−18%) passes through to the total: compact the partials (most of the 2·NumThreads·S slots are empty NO_BUCKET) before the index/combine passes, or cap task-splitting granularity so partial count is ~independent of S. Then re-sweep. Mobile coverage additionally needs BrowserStack Local wired into the runner.

Honest bottom line: hypothesis confirmed at the kernel level on real Apple hardware; end-to-end win currently modest (−3.8%) and Apple-only (mobile blocked by tunnel infra). The strict bar is not met; no fabricated numbers.

Created by claudebox · group: aztec

AztecBot added 4 commits May 30, 2026 02:14

perf(bb/msm): device-memory coalesced pref_scratch for stream-walker …

7918354

…(KNOB 1), TPB 64→128

update PR #23726

7d84eb4

feat(bb/msm): configurable stream-walker S (streamS knob) for inversi…

092545f

…on-amortization sweep

test(bb/msm): forward streamS (ss) + wgi knobs through BrowserStack r…

82b2a54

…unner for S sweep

AztecBot added ci-barretenberg Run all barretenberg/cpp checks. claudebox Owned by claudebox. it can push to this PR. labels May 30, 2026

AztecBot added 2 commits May 30, 2026 02:39

test(bb/msm): single-session msm-s-sweep autorun maps inversion-amort…

86d1062

…ization curve on one device

test(bb/msm): forward slist to BrowserStack runner for single-session…

ec5b057

… S sweep

AztecBot added 4 commits May 30, 2026 03:58

test(bb/msm): robust msm-s-sweep (descending S build order, progress …

e37c555

…events)

test(bb/msm): ?srsmax cap so mobile sweeps download only needed SRS p…

3601fdf

…oints

test(bb/msm): ?synth=1 sweep path (no SRS) for SRS-CDN-blocked device…

1ee9473

…s (mobile)

test(bb/msm): --synth + --srsmax passthrough in BrowserStack runner

957a10c

test(bb/msm): force cloudflared http2 (mobile proxies block QUIC)

8d45aa3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(bb/msm): configurable stream-walker S to amortize the dominant field inversion#23734

perf(bb/msm): configurable stream-walker S to amortize the dominant field inversion#23734
AztecBot wants to merge 11 commits into
stream-walker-implfrom
cb/walker-large-s-inversion

AztecBot commented May 30, 2026

Uh oh!

AztecBot commented May 30, 2026

Uh oh!

AztecBot commented May 30, 2026

Uh oh!

AztecBot commented May 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

AztecBot commented May 30, 2026

Goal (Axis A — stream-walker SPEED, focus: inversion cost)

What this PR does (so far)

Why private (not device-buffer) pref_scratch

Status — numbers PENDING (not done yet)

Known blocker (pre-existing): SwiftShader correctness harness

Memory

Uh oh!

AztecBot commented May 30, 2026

Status update — tooling complete, real-hardware numbers blocked on BrowserStack seats

Uh oh!

AztecBot commented May 30, 2026

Real-hardware data #1 — Apple M2 (macOS Sequoia, Chrome 148), logn=17, reps=4 (medians)

Takeaway / revised plan

Next

Uh oh!

AztecBot commented May 30, 2026

Mobile (Adreno/Mali) — blocked by BrowserStack tunnel infra (not code)

Consolidated status

What "significant" needs (Stage 2)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant