perf(bb/msm): sweepable stream-walker inversion-amortization knobs (S/TPB/NumThreads) by AztecBot · Pull Request #23735 · AztecProtocol/aztec-packages

AztecBot · 2026-05-30T02:37:06Z

Axis A (speed): cut the stream-walker's field-inversion share

Status: knob infrastructure + GPU-only test harness landed and SwiftShader-correctness-proven. Real-hardware BrowserStack timing pending a free seat (account has been 2/2 busy throughout this session).

Rationale (from PR #23732, real Apple M2 hardware)

The safegcd field inversion (fr_inv_by_loop_pk, Bernstein-Yang) is ~47% of the ba_stream_walker accumulate kernel. The walker does exactly one inversion per batch of S affine adds (Montgomery's trick), so the inversion's per-add share is |inv|/S. Raising S directly amortizes it:

S	inversion share (model, from 47% @ S=8)
8 (current)	~47%
16	~31%
32	~18%

Prior threads chasing memory-load wins (dx-cache) got only ~1–2% because the kernel is compute-bound on the inversion. Amortizing the inversion is the high-leverage lever.

What this change does

STREAM_S (=8), WALKER_TPB (=64) and STREAM_T (NUM_THREADS, =8192) were hardcoded at the MsmV2.create() call site even though shader_manager, ensureScratch, the planner (cumsum/partition_task), the walker, and walker_combine are already fully parameterized by them. This PR un-hardcodes them into MsmConfig:

walkerS — one inversion per S adds (default 8)
walkerTpb — walker workgroup size (default 64)
walkerNumThreads — walker NUM_THREADS (default 8192)

exposed as dev-page URL knobs (?walkers=16&walkertpb=32&walkert=4096). No shader/buffer/bind-layout changes — {} config is byte-for-byte the old behaviour.

A new GPU-only autorun msm-walker-bench runs MsmV2 without the (stubbed) bb.js WASM: a @noble/curves correctness check at a small size, then timed reps at the bench size, honoring the walker knobs and ?warmup=N. run-browserstack.mjs gains --query to forward the sweep knobs.

Why this avoids a memory regression

Workgroup memory: pref_scratch = TPB·S·2 vec4. Pairing larger S with smaller walkerTpb keeps it ≤16 KB (Mali) / ≤32 KB (Apple/Adreno). S=16/TPB=32 = same 16 KB as S=8/TPB=64.
Device memory: walkerPartials (+ linked-list node buffers) scale as NUM_THREADS·S. Lowering walkerNumThreads ∝ 1/S holds NUM_THREADS·S — and peak GPU footprint — constant.

Correctness (local, SwiftShader headless + @noble/curves, logn=10)

config	pref_scratch	device partials (∝ T·S)	noble
S=8 / TPB=64 / T=8192 (baseline = prior default)	16 KB	1×	green (prior)
S=16 / TPB=32 / T=8192	16 KB (unchanged)	2×	MATCH ✓
S=16 / T=4096 (memory-neutral)	32 KB	1× (unchanged)	(running)

The S=16/TPB=32 walker compiled, bound, dispatched and produced a noble-correct result with no WGSL/validation errors — the parameterization is structurally sound.

Remaining for "done" (strict bar)

Real-hardware BrowserStack timing of S=8 baseline vs S=16 on ≥1 Apple, ≥1 Adreno, ≥1 Mali, showing a significant time win with no memory regression. The harness (msm-walker-bench + run-browserstack.mjs --query) is ready; gated only on a free BrowserStack seat (2 shared seats, busy all session). Apple gives per-kernel stream_walker timestamps; Adreno/Mali use GPU wall time (Android Chrome lacks timestamp-query).

🤖 Generated with Claude Code

…/TPB/NumThreads)

…-browserstack --query passthrough WASM oracle is a stub, so msm-bench/msm-cross-check (which click Run -> boot WASM) can't benchmark on a clean device. msm-walker-bench runs MsmV2 GPU-only: a noble correctness check at a small size then timed reps at the bench size, honoring the walker S/TPB/NumThreads knobs from the URL, posting median wall + per-phase GPU times. run-browserstack gains --query to forward the sweep knobs to the page URL.

… NaN) Correctness-only runs use reps=0 (the noble check is the signal); guard the median formatting so the summary line stays clean.

perf(bb/msm): sweepable stream-walker inversion-amortization knobs (S…

3af0f72

…/TPB/NumThreads)

AztecBot added ci-barretenberg Run all barretenberg/cpp checks. claudebox Owned by claudebox. it can push to this PR. labels May 30, 2026

AztecBot added 2 commits May 30, 2026 02:44

test(bb/msm): walker-bench DONE line tolerates reps=0 (n/a instead of…

2b77482

… NaN) Correctness-only runs use reps=0 (the noble check is the signal); guard the median formatting so the summary line stays clean.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(bb/msm): sweepable stream-walker inversion-amortization knobs (S/TPB/NumThreads)#23735

perf(bb/msm): sweepable stream-walker inversion-amortization knobs (S/TPB/NumThreads)#23735
AztecBot wants to merge 3 commits into
stream-walker-implfrom
cb/walker-inv-amortize-S

AztecBot commented May 30, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

AztecBot commented May 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Axis A (speed): cut the stream-walker's field-inversion share

Rationale (from PR #23732, real Apple M2 hardware)

What this change does

Why this avoids a memory regression

Correctness (local, SwiftShader headless + @noble/curves, logn=10)

Remaining for "done" (strict bar)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

AztecBot commented May 30, 2026 •

edited

Loading