perf(bb/msm): stream-walker S-sweep — runtime walker_s knob + 3-device measurement (result: S=8 is already the knee)#23738
Draft
AztecBot wants to merge 4 commits into
Draft
Conversation
…(KNOB 1), TPB 64→128
…untime walker_s knob + single-session real-HW sweep)
…s sweep Android intents truncate the worker URL at the first unescaped '&', so multi-param sweep query strings arrived with only autorun set (logN defaulted to 17, which OOM'd the Adreno device). Pass the query with '&' encoded as %26 and re-expand it page-side in getSearchParams().
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
TL;DR — the assigned lever does not pay off, proven on real hardware
The operator hypothesis (Axis A, lever #1): the stream-walker's batched-inversion slot count S=8 is too small; raising S amortises the ~47%-of-kernel field inversion (
inversion-cost-per-add = |inversion|/S) and should be a large speed win.I made S a runtime knob and swept it on real Apple, Adreno, and Mali GPUs via BrowserStack, with a per-S @noble/curves cross-check. The result is unambiguous and reproducible across all three architectures:
S=8 is already the global optimum. Every other S is slower, in both directions. Raising S is monotonically worse; lowering it is worse. There is no speedup available from S-tuning, so this branch does not clear the "significant win" bar. The walker holds its accumulators (
acc_x/acc_y) andpref_scratchin per-thread private memory, so larger S grows per-thread private state (~33·S u32) and the register/occupancy loss outruns the inversion amortisation. This is a clean negative result that redirects the effort (see "What would actually win").Real-hardware data (logn=16, median wall ms, BrowserStack)
Apple M2 (macOS Sequoia / Chrome 148, timestamp-query available — gpu ms shown; noble cross-check PASS at every S):
Adreno (Samsung S25 Ultra / Snapdragon 8 Elite, Android Chrome — wall time only; WebGPU timestamp-query returns garbage ~12–91 s, confirming the known Adreno issue):
Mali (Pixel 9 Pro XL / Tensor G4, Android Chrome — wall time; noble cross-check PASS at every S):
Reading the three curves:
streamNumThreads=8192), so any S>8 is both slower and larger — strictly dominated.What would actually win (evidence-backed next step)
The S-cliff proves the binding constraint is private-memory pressure, not the inversion's arithmetic cost per se. So the real unlock is to move the walker's
acc_x/acc_yandpref_scratchout of per-thread private memory into a device storage buffer (the V2 pair-tree pipeline and the olderba_stream_accumalready keep accumulators device-side). Then large S would amortise the inversion without spilling registers, and the S>8 curve should flip from "cliff" to "win" — especially on Adreno. Blocker: the walker already binds 10 storage buffers (the SwiftShader/Mali/Adreno floor), so a device scratch buffer needs a freed binding first (e.g. interleavesorted_bucket_list+sorted_count_list, or packacc+prefinto one buffer). That refactor — not S-tuning — is where the inversion-amortisation win lives. The other operator lever (drop/trim Montgomery form for the kernel's modest mul/reduce mix) attacks the per-mul cost orthogonally and also composes.What landed here (reusable infrastructure)
msm_v2.ts—MsmConfig.walkerS/walkerTpbruntime knobs (default 8 / 128); all S-dependent buffer sizing already keyed offm.streamS.dev/msm-webgpu/main.ts—autorun=msm-walker-sweep: one BrowserStack seat maps the whole S curve — per-S median/min wall, GPU ms (Apple),pool.statsBytes()peak memory, and an optional?verify=1once-computed noble cross-check applied to every S. Fresh pool per S (the pool's realloc keys on bTotal/numThreads, not streamS). Plus agetSearchParams()that re-expands the query when BrowserStack's Android intent truncates the URL at the first&(pass&as%26).run-browserstack.mjs—--walker-s-list,--verify,--no-coi.msm-correctness.ts— forwardswalker_s/walker_tpb.Notes / caveats
dispatchWorkgroupsIndirect, and miscomputes the 13-bit-limb f32 Montgomery/safegcd math → off-curve). Verified on pristinestream-walker-impland perf(bb/msm): stream-walker pref_scratch → private memory (frees workgroup occupancy limiter), TPB 64→128 #23726. Correctness was therefore gated on real hardware at logn=16 via?verify=1.Base:
stream-walker-impl. Draft: this proves a negative on the assigned lever and ships the harness; it is not a merge candidate as a perf win.