perf(bb/msm): sweepable stream-walker inversion-amortization knobs (S/TPB/NumThreads)#23735
Draft
AztecBot wants to merge 3 commits into
Draft
perf(bb/msm): sweepable stream-walker inversion-amortization knobs (S/TPB/NumThreads)#23735AztecBot wants to merge 3 commits into
AztecBot wants to merge 3 commits into
Conversation
…-browserstack --query passthrough WASM oracle is a stub, so msm-bench/msm-cross-check (which click Run -> boot WASM) can't benchmark on a clean device. msm-walker-bench runs MsmV2 GPU-only: a noble correctness check at a small size then timed reps at the bench size, honoring the walker S/TPB/NumThreads knobs from the URL, posting median wall + per-phase GPU times. run-browserstack gains --query to forward the sweep knobs to the page URL.
… NaN) Correctness-only runs use reps=0 (the noble check is the signal); guard the median formatting so the summary line stays clean.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Axis A (speed): cut the stream-walker's field-inversion share
Status: knob infrastructure + GPU-only test harness landed and SwiftShader-correctness-proven. Real-hardware BrowserStack timing pending a free seat (account has been 2/2 busy throughout this session).
Rationale (from PR #23732, real Apple M2 hardware)
The safegcd field inversion (
fr_inv_by_loop_pk, Bernstein-Yang) is ~47% of theba_stream_walkeraccumulate kernel. The walker does exactly one inversion per batch of S affine adds (Montgomery's trick), so the inversion's per-add share is|inv|/S. Raising S directly amortizes it:Prior threads chasing memory-load wins (dx-cache) got only ~1–2% because the kernel is compute-bound on the inversion. Amortizing the inversion is the high-leverage lever.
What this change does
STREAM_S(=8),WALKER_TPB(=64) andSTREAM_T(NUM_THREADS, =8192) were hardcoded at theMsmV2.create()call site even thoughshader_manager,ensureScratch, the planner (cumsum/partition_task), the walker, and walker_combine are already fully parameterized by them. This PR un-hardcodes them intoMsmConfig:walkerS— one inversion per S adds (default 8)walkerTpb— walker workgroup size (default 64)walkerNumThreads— walker NUM_THREADS (default 8192)exposed as dev-page URL knobs (
?walkers=16&walkertpb=32&walkert=4096). No shader/buffer/bind-layout changes —{}config is byte-for-byte the old behaviour.A new GPU-only autorun
msm-walker-benchruns MsmV2 without the (stubbed) bb.js WASM: a@noble/curvescorrectness check at a small size, then timed reps at the bench size, honoring the walker knobs and?warmup=N.run-browserstack.mjsgains--queryto forward the sweep knobs.Why this avoids a memory regression
pref_scratch = TPB·S·2vec4. Pairing larger S with smallerwalkerTpbkeeps it ≤16 KB (Mali) / ≤32 KB (Apple/Adreno). S=16/TPB=32 = same 16 KB as S=8/TPB=64.walkerPartials(+ linked-list node buffers) scale asNUM_THREADS·S. LoweringwalkerNumThreads∝ 1/S holdsNUM_THREADS·S— and peak GPU footprint — constant.Correctness (local, SwiftShader headless + @noble/curves, logn=10)
The S=16/TPB=32 walker compiled, bound, dispatched and produced a noble-correct result with no WGSL/validation errors — the parameterization is structurally sound.
Remaining for "done" (strict bar)
Real-hardware BrowserStack timing of S=8 baseline vs S=16 on ≥1 Apple, ≥1 Adreno, ≥1 Mali, showing a significant time win with no memory regression. The harness (
msm-walker-bench+run-browserstack.mjs --query) is ready; gated only on a free BrowserStack seat (2 shared seats, busy all session). Apple gives per-kernelstream_walkertimestamps; Adreno/Mali use GPU wall time (Android Chrome lacks timestamp-query).🤖 Generated with Claude Code