Skip to content

perf(bb/msm): stream-walker S-sweep — runtime walker_s knob + 3-device measurement (result: S=8 is already the knee)#23738

Draft
AztecBot wants to merge 4 commits into
stream-walker-implfrom
cb/msm-walker-bigS-inversion
Draft

perf(bb/msm): stream-walker S-sweep — runtime walker_s knob + 3-device measurement (result: S=8 is already the knee)#23738
AztecBot wants to merge 4 commits into
stream-walker-implfrom
cb/msm-walker-bigS-inversion

Conversation

@AztecBot
Copy link
Copy Markdown
Collaborator

@AztecBot AztecBot commented May 30, 2026

TL;DR — the assigned lever does not pay off, proven on real hardware

The operator hypothesis (Axis A, lever #1): the stream-walker's batched-inversion slot count S=8 is too small; raising S amortises the ~47%-of-kernel field inversion (inversion-cost-per-add = |inversion|/S) and should be a large speed win.

I made S a runtime knob and swept it on real Apple, Adreno, and Mali GPUs via BrowserStack, with a per-S @noble/curves cross-check. The result is unambiguous and reproducible across all three architectures:

S=8 is already the global optimum. Every other S is slower, in both directions. Raising S is monotonically worse; lowering it is worse. There is no speedup available from S-tuning, so this branch does not clear the "significant win" bar. The walker holds its accumulators (acc_x/acc_y) and pref_scratch in per-thread private memory, so larger S grows per-thread private state (~33·S u32) and the register/occupancy loss outruns the inversion amortisation. This is a clean negative result that redirects the effort (see "What would actually win").

Real-hardware data (logn=16, median wall ms, BrowserStack)

Apple M2 (macOS Sequoia / Chrome 148, timestamp-query available — gpu ms shown; noble cross-check PASS at every S):

S 2 3 4 5 6 8 12 16 20 24 32
wall ms 72.9 67.2 60.6 57.6 55.1 50.0 51.5 52.5 55.0 57.3 62.5
vs S8 .69× .75× .83× .87× .91× 1.00× .97× .95× .91× .87× .80×
peak MiB 39.9 41.1 42.4 43.6 44.9 47.4 52.4 57.4 62.4 67.4 77.4

Adreno (Samsung S25 Ultra / Snapdragon 8 Elite, Android Chrome — wall time only; WebGPU timestamp-query returns garbage ~12–91 s, confirming the known Adreno issue):

S 4 6 8 12 16
wall ms 333.6 270.2 240.6 1505.4 1766.3
vs S8 .72× .89× 1.00× .16× .14×

Mali (Pixel 9 Pro XL / Tensor G4, Android Chrome — wall time; noble cross-check PASS at every S):

S 4 6 8 12 16
wall ms 304.8 287.2 276.6 288.5 286.3
vs S8 .91× .96× 1.00× .96× .97×

Reading the three curves:

  • Every arch's optimum is S=8. Below 8 the inversion isn't amortised enough; above 8 private-state pressure dominates.
  • Memory is monotonic in S (~1.25 MiB per unit S at the fixed streamNumThreads=8192), so any S>8 is both slower and larger — strictly dominated.
  • Adreno hits a register-spill cliff for S>8 (6× slower at S=12) — the clearest evidence the walker is register/occupancy-bound, not inversion-throughput-bound, at S≥8.
  • Mali is flattest (only ~3% loss at S=12–16) — more register headroom, but still no win.

What would actually win (evidence-backed next step)

The S-cliff proves the binding constraint is private-memory pressure, not the inversion's arithmetic cost per se. So the real unlock is to move the walker's acc_x/acc_y and pref_scratch out of per-thread private memory into a device storage buffer (the V2 pair-tree pipeline and the older ba_stream_accum already keep accumulators device-side). Then large S would amortise the inversion without spilling registers, and the S>8 curve should flip from "cliff" to "win" — especially on Adreno. Blocker: the walker already binds 10 storage buffers (the SwiftShader/Mali/Adreno floor), so a device scratch buffer needs a freed binding first (e.g. interleave sorted_bucket_list+sorted_count_list, or pack acc+pref into one buffer). That refactor — not S-tuning — is where the inversion-amortisation win lives. The other operator lever (drop/trim Montgomery form for the kernel's modest mul/reduce mix) attacks the per-mul cost orthogonally and also composes.

What landed here (reusable infrastructure)

  • msm_v2.tsMsmConfig.walkerS / walkerTpb runtime knobs (default 8 / 128); all S-dependent buffer sizing already keyed off m.streamS.
  • dev/msm-webgpu/main.tsautorun=msm-walker-sweep: one BrowserStack seat maps the whole S curve — per-S median/min wall, GPU ms (Apple), pool.statsBytes() peak memory, and an optional ?verify=1 once-computed noble cross-check applied to every S. Fresh pool per S (the pool's realloc keys on bTotal/numThreads, not streamS). Plus a getSearchParams() that re-expands the query when BrowserStack's Android intent truncates the URL at the first & (pass & as %26).
  • run-browserstack.mjs--walker-s-list, --verify, --no-coi.
  • msm-correctness.ts — forwards walker_s/walker_tpb.

Notes / caveats

  • Local SwiftShader cannot validate this pipeline (silently no-ops dispatchWorkgroupsIndirect, and miscomputes the 13-bit-limb f32 Montgomery/safegcd math → off-curve). Verified on pristine stream-walker-impl and perf(bb/msm): stream-walker pref_scratch → private memory (frees workgroup occupancy limiter), TPB 64→128 #23726. Correctness was therefore gated on real hardware at logn=16 via ?verify=1.
  • Adreno cross-check FAILED at every S including baseline S=8 — a pre-existing pipeline correctness issue on Adreno (most likely f32-Montgomery precision), independent of this change (S=8 here is byte-identical to baseline). Worth a separate look; the timing conclusion (S=8 optimal) stands regardless.

Base: stream-walker-impl. Draft: this proves a negative on the assigned lever and ships the harness; it is not a merge candidate as a perf win.

@AztecBot AztecBot added the claudebox Owned by claudebox. it can push to this PR. label May 30, 2026
…s sweep

Android intents truncate the worker URL at the first unescaped '&', so
multi-param sweep query strings arrived with only autorun set (logN
defaulted to 17, which OOM'd the Adreno device). Pass the query with '&'
encoded as %26 and re-expand it page-side in getSearchParams().
@AztecBot AztecBot changed the title perf(bb/msm): stream-walker S-sweep — amortise the field inversion (runtime walker_s knob + single-session real-HW sweep) perf(bb/msm): stream-walker S-sweep — runtime walker_s knob + 3-device measurement (result: S=8 is already the knee) May 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

claudebox Owned by claudebox. it can push to this PR.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant