perf(bb/msm): stream-walker S-sweep — runtime walker_s knob + 3-device measurement (result: S=8 is already the knee) by AztecBot · Pull Request #23738 · AztecProtocol/aztec-packages

AztecBot · 2026-05-30T02:46:44Z

TL;DR — the assigned lever does not pay off, proven on real hardware

The operator hypothesis (Axis A, lever #1): the stream-walker's batched-inversion slot count S=8 is too small; raising S amortises the ~47%-of-kernel field inversion (inversion-cost-per-add = |inversion|/S) and should be a large speed win.

I made S a runtime knob and swept it on real Apple, Adreno, and Mali GPUs via BrowserStack, with a per-S @noble/curves cross-check. The result is unambiguous and reproducible across all three architectures:

S=8 is already the global optimum. Every other S is slower, in both directions. Raising S is monotonically worse; lowering it is worse. There is no speedup available from S-tuning, so this branch does not clear the "significant win" bar. The walker holds its accumulators (acc_x/acc_y) and pref_scratch in per-thread private memory, so larger S grows per-thread private state (~33·S u32) and the register/occupancy loss outruns the inversion amortisation. This is a clean negative result that redirects the effort (see "What would actually win").

Real-hardware data (logn=16, median wall ms, BrowserStack)

Apple M2 (macOS Sequoia / Chrome 148, timestamp-query available — gpu ms shown; noble cross-check PASS at every S):

S	2	3	4	5	6	8	12	16	20	24	32
wall ms	72.9	67.2	60.6	57.6	55.1	50.0	51.5	52.5	55.0	57.3	62.5
vs S8	.69×	.75×	.83×	.87×	.91×	1.00×	.97×	.95×	.91×	.87×	.80×
peak MiB	39.9	41.1	42.4	43.6	44.9	47.4	52.4	57.4	62.4	67.4	77.4

Adreno (Samsung S25 Ultra / Snapdragon 8 Elite, Android Chrome — wall time only; WebGPU timestamp-query returns garbage ~12–91 s, confirming the known Adreno issue):

S	4	6	8	12	16
wall ms	333.6	270.2	240.6	1505.4	1766.3
vs S8	.72×	.89×	1.00×	.16×	.14×

Mali (Pixel 9 Pro XL / Tensor G4, Android Chrome — wall time; noble cross-check PASS at every S):

S	4	6	8	12	16
wall ms	304.8	287.2	276.6	288.5	286.3
vs S8	.91×	.96×	1.00×	.96×	.97×

Reading the three curves:

Every arch's optimum is S=8. Below 8 the inversion isn't amortised enough; above 8 private-state pressure dominates.
Memory is monotonic in S (~1.25 MiB per unit S at the fixed streamNumThreads=8192), so any S>8 is both slower and larger — strictly dominated.
Adreno hits a register-spill cliff for S>8 (6× slower at S=12) — the clearest evidence the walker is register/occupancy-bound, not inversion-throughput-bound, at S≥8.
Mali is flattest (only ~3% loss at S=12–16) — more register headroom, but still no win.

What would actually win (evidence-backed next step)

The S-cliff proves the binding constraint is private-memory pressure, not the inversion's arithmetic cost per se. So the real unlock is to move the walker's acc_x/acc_y and pref_scratch out of per-thread private memory into a device storage buffer (the V2 pair-tree pipeline and the older ba_stream_accum already keep accumulators device-side). Then large S would amortise the inversion without spilling registers, and the S>8 curve should flip from "cliff" to "win" — especially on Adreno. Blocker: the walker already binds 10 storage buffers (the SwiftShader/Mali/Adreno floor), so a device scratch buffer needs a freed binding first (e.g. interleave sorted_bucket_list+sorted_count_list, or pack acc+pref into one buffer). That refactor — not S-tuning — is where the inversion-amortisation win lives. The other operator lever (drop/trim Montgomery form for the kernel's modest mul/reduce mix) attacks the per-mul cost orthogonally and also composes.

What landed here (reusable infrastructure)

msm_v2.ts — MsmConfig.walkerS / walkerTpb runtime knobs (default 8 / 128); all S-dependent buffer sizing already keyed off m.streamS.
dev/msm-webgpu/main.ts — autorun=msm-walker-sweep: one BrowserStack seat maps the whole S curve — per-S median/min wall, GPU ms (Apple), pool.statsBytes() peak memory, and an optional ?verify=1 once-computed noble cross-check applied to every S. Fresh pool per S (the pool's realloc keys on bTotal/numThreads, not streamS). Plus a getSearchParams() that re-expands the query when BrowserStack's Android intent truncates the URL at the first & (pass & as %26).
run-browserstack.mjs — --walker-s-list, --verify, --no-coi.
msm-correctness.ts — forwards walker_s/walker_tpb.

Notes / caveats

Local SwiftShader cannot validate this pipeline (silently no-ops dispatchWorkgroupsIndirect, and miscomputes the 13-bit-limb f32 Montgomery/safegcd math → off-curve). Verified on pristine stream-walker-impl and perf(bb/msm): stream-walker pref_scratch → private memory (frees workgroup occupancy limiter), TPB 64→128 #23726. Correctness was therefore gated on real hardware at logn=16 via ?verify=1.
Adreno cross-check FAILED at every S including baseline S=8 — a pre-existing pipeline correctness issue on Adreno (most likely f32-Montgomery precision), independent of this change (S=8 here is byte-identical to baseline). Worth a separate look; the timing conclusion (S=8 optimal) stands regardless.

Base: stream-walker-impl. Draft: this proves a negative on the assigned lever and ships the harness; it is not a merge candidate as a perf win.

…(KNOB 1), TPB 64→128

…untime walker_s knob + single-session real-HW sweep)

…s sweep Android intents truncate the worker URL at the first unescaped '&', so multi-param sweep query strings arrived with only autorun set (logN defaulted to 17, which OOM'd the Adreno device). Pass the query with '&' encoded as %26 and re-expand it page-side in getSearchParams().

AztecBot added 3 commits May 30, 2026 00:07

perf(bb/msm): device-memory coalesced pref_scratch for stream-walker …

ea23835

…(KNOB 1), TPB 64→128

update PR #23726

a4040e8

perf(bb/msm): stream-walker S-sweep — amortise the field inversion (r…

a324bb1

…untime walker_s knob + single-session real-HW sweep)

AztecBot added the claudebox Owned by claudebox. it can push to this PR. label May 30, 2026

AztecBot changed the title ~~perf(bb/msm): stream-walker S-sweep — amortise the field inversion (runtime walker_s knob + single-session real-HW sweep)~~ perf(bb/msm): stream-walker S-sweep — runtime walker_s knob + 3-device measurement (result: S=8 is already the knee) May 30, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(bb/msm): stream-walker S-sweep — runtime walker_s knob + 3-device measurement (result: S=8 is already the knee)#23738

perf(bb/msm): stream-walker S-sweep — runtime walker_s knob + 3-device measurement (result: S=8 is already the knee)#23738
AztecBot wants to merge 4 commits into
stream-walker-implfrom
cb/msm-walker-bigS-inversion

AztecBot commented May 30, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

AztecBot commented May 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

TL;DR — the assigned lever does not pay off, proven on real hardware

Real-hardware data (logn=16, median wall ms, BrowserStack)

What would actually win (evidence-backed next step)

What landed here (reusable infrastructure)

Notes / caveats

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

AztecBot commented May 30, 2026 •

edited

Loading