perf(bb/msm): configurable stream-walker S to amortize the dominant field inversion#23734
perf(bb/msm): configurable stream-walker S to amortize the dominant field inversion#23734AztecBot wants to merge 11 commits into
Conversation
…(KNOB 1), TPB 64→128
…on-amortization sweep
…unner for S sweep
…ization curve on one device
Status update — tooling complete, real-hardware numbers blocked on BrowserStack seatsLanded + validated on this branch:
Sweep command (run per family when a seat frees): Then real-GPU correctness at the winning S via Blocker: BrowserStack is at 2/2 running (two long-lived Galaxy S25 sessions from other agents) with a 3-deep queue. Per the shared-seat rule I will not preempt; retrying for a free seat. No fabricated numbers — the "done" bar (significant win on Apple+Adreno+Mali, memory not regressed, cross-checked) is not met until these runs land. Created by claudebox · group: |
Real-hardware data #1 — Apple M2 (macOS Sequoia, Chrome 148), logn=17, reps=4 (medians)Single-session
The inversion-amortization hypothesis is confirmed. The accumulate kernel ( But there's a competing cost the operator's framing didn't account for: Net end-to-end on Apple: best at S≈20 → −3.8% GPU wall (78.58 vs 81.66 ms). Real and repeatable, but modest — below the "significant" bar. Takeaway / revised planThe inversion lever works (kernel −19%), but a significant end-to-end win needs Stage 2: stop Next
No fabrication: this is one device, net win modest; the strict bar is not met yet. Created by claudebox · group: |
Mobile (Adreno/Mali) — blocked by BrowserStack tunnel infra (not code)The single-session sweep runs cleanly on macOS (real Apple data above) and locally under SwiftShader (validated end-to-end, incl. the new
Conclusion: BrowserStack real Android devices cannot reach the cloudflare Consolidated statusConfirmed on real Apple M2 (the operator's #1 lever works): raising S amortizes the safegcd inversion — accumulate kernel New finding the framing missed: Tooling delivered & validated (not just designed): What "significant" needs (Stage 2)Decouple Honest bottom line: hypothesis confirmed at the kernel level on real Apple hardware; end-to-end win currently modest (−3.8%) and Apple-only (mobile blocked by tunnel infra). The strict bar is not met; no fabricated numbers. Created by claudebox · group: |
Goal (Axis A — stream-walker SPEED, focus: inversion cost)
Measured fact (real hardware, PR #23732): the safegcd field inversion is ~47% of the stream-walker accumulate kernel on Apple M2. The kernel does exactly one inversion per batch of
Saffine adds, so inversion-cost-per-add =|inversion| / S. RaisingSdirectly amortizes the single largest cost (S=8 → ~47%,S=16 → ~31%,S=32 → ~18%). This is the highest-leverage speed lever and composes with the device/privatepref_scratchwork from #23726.This PR makes
Sa first-class, sweepable knob so the per-arch knee can be mapped on real hardware, then baked as the arch-aware default.What this PR does (so far)
STREAM_Sis now configurable viaMsmConfig.streamSinstead of being hard-coded to8insideMsmV2.create(). All dependent pipelines (cumsum, emit, partition-task, walker, walker-combine) and every∝ Sbuffer (walkerPref,walkerPartials,accBuf,taskCuts, …) already size off the sharedstreamS, so the knob flows end-to-end with no other code change. AtstreamS = 8the behavior is byte-identical to baseline (the change is inert by default).?ss=NURL knob wired through both dev harnesses (index.html/main.tsandmsm-correctness.ts) →MsmConfig.streamS. A single browser/device session can now sweepS = 8,12,16,20,24,32,…with no rebuild (the WGSL{{ s }}is rendered at runtime via mustache;_generated/shaders.tskeeps the placeholder).run-browserstack.mjs --ss N(and--wgi N) so themsm-benchautorun can map the inversion-amortization curve on a real device. On Apple,__lastPhaseMs.stream_walkerisolates the accumulate kernel directly; on Adreno/Mali (no timestamp-query) we use GPU wall time.var<private>pref_scratch+ TPB=128 (the device-buffer-at-binding-11 variant is not portable — SwiftShader/Dawn and most Adreno/Mali adapters cap at 10 storage buffers per stage, confirmed below).Why private (not device-buffer) pref_scratch
The walker already uses 10 storage bindings (0–9) + 1 uniform. A device
pref_scratchwould be an 11th storage buffer, which Dawn/SwiftShader rejects (storage buffers (11) … exceeds the maximum per-stage limit (10)) — and the WebGPU baseline for many mobile GPUs is ≤10 too. So the portable way to free workgroup memory for largerSisvar<private>(no extra binding) + registeracc. That is what #23726 landed and what this PR raisesSon top of.Status — numbers PENDING (not done yet)
This is the enabling change. The "done" bar (significant real-hardware time win on ≥1 Apple + ≥1 Adreno + ≥1 Mali, memory not regressed, cross-checked) is not met yet. Next steps on this branch:
msm-benchS-sweep on macos (Apple M2):stream_walkerphase + wall atS=8,12,16,20,24,32; find the knee where register pressure (acc_x/acc_y∝ S) cancels the inversion amortization.autorun=msm-cross-check(noble) at the winningS.accinto the device scratch buffer (reuse the single binding-11 buffer used for pref where supported, or a packed layout) soScan grow without spilling registers.Sas an arch-aware default.Known blocker (pre-existing): SwiftShader correctness harness
The local
msm-correctnessSwiftShader harness (added in #23726) currently FAILs even on the basestream-walker-implbranch (GPU result off-curve at logn=8,10), so it is not a usable local gate right now — this is independent of this PR's change (the original "cross-check GREEN" claim, commit735d5aecf26, predates this harness and usedindex.html's real-SRS noble check). Likely a SwiftShader software-rasterizer miscompile of the heavy safegcd/Montgomery WGSL, since the same kernels produce correct results on real GPUs. Correctness for perf claims will therefore be gated on real-hardwareautorun=msm-cross-check(noble), which is the stronger signal anyway. Driver fixes (TS_ROOT,PW_EXECUTABLE) are included so the harness at least runs headless here.Memory
streamSraises only∝ Sdevice buffers (walkerPref = 2·S·NumThreads·16 B,walkerPartials,accBuf). AtNumThreads=8192:S=16 → +~4 MB,S=32 → +~8 MBfor pref — well within the 100 MB budget to n=2²⁰.statsBytes()(?mem=17,20) is used to confirm no peak-memory regression at the chosenS.Created by claudebox · group:
aztec