feat(bb/msm): adaptive bucket-accumulate — coop-walker G=1 on Adreno (1.67–2.05× over stream-walker), walker elsewhere#23739
Draft
AztecBot wants to merge 13 commits into
Draft
feat(bb/msm): adaptive bucket-accumulate — coop-walker G=1 on Adreno (1.67–2.05× over stream-walker), walker elsewhere#23739AztecBot wants to merge 13 commits into
AztecBot wants to merge 13 commits into
Conversation
Sweep walker vs coop across logns in one page load (one BrowserStack worker per device) for the real-hardware comparison.
Absorbs one-time driver JIT / cold GPU-clock cost so the first logN in the sweep is not anomalously slow.
…up/scan)
Add a compile-time granularity G to ba_coop_walker controlling how many
threads share one batched inversion:
- G==1 (local): each thread inverts its own dx; no workgroup memory,
no in-loop barriers.
- 1<G<TPB (per-group): per-group serial Montgomery batch inversion; 2
barriers/round regardless of G (vs scan's 2*log2 TPB),
TPB/G inversions run concurrently across group leaders.
- G==TPB (scan): existing workgroup prefix/suffix scan (default).
Wired via MsmV2 coopG config, a coopg query param, and a new
msm-coop-gsweep autorun that benches walker + coop at every G in one
pass. Cross-checked GREEN vs noble (SwiftShader) for G=1,8,16,32,64 at
logn 8 and 10.
…arness fixes - Add accum:'auto' (new default) resolving per-device via adapterInfo; gated behind COOP_AUTO_ON_STARVED_MOBILE (off) so it picks the kernel proven fastest on measurable hardware (walker) until a WebGPU-capable Android A/B proves coop's mobile niche. Documents why coop is, by analysis, dominated by the walker + #23726's var<private> occupancy lever. - msm-accum-ab autorun: emit /progress heartbeats (boot-start, srs, build, rep) under one shared runId so the BrowserStack watchdog survives slow mobile SRS loads; add ?srs_logn=N to cap the SRS download.
Adds ?autorun=env-probe to the dev page: posts UA + navigator.gpu presence + adapter info (vendor/arch/limits) BEFORE any SRS download or pipeline build, then exits. Isolates the three confounded failure modes that made earlier BrowserStack-Android attempts ambiguous (page didn't load / no WebGPU / SRS download stalled over the tunnel) into one decisive signal.
BrowserStack's /5/worker launcher truncates the device URL at the first unescaped '&', so only the first query param survived and every later param silently fell back to its page default (observed: a gsweep meant for logns=12,14,16 ran only the default logn=14, full SRS). Decode an optional ?cfg=<base64 query string> before any URLSearchParams read and rewrite location.search, so a full multi-param config reaches the page as one truncation-proof param.
…ix adapterInfo plumbing - get_device: stash adapter info under device.__adapterInfo (Chrome's device.adapterInfo is a read-only getter — assigning to it throws; the prior selector also read an always-undefined device.adapterInfo, so it could never recognise a mobile GPU). - resolveAccum: 'auto' now selects coop with G=1 on memory-/register-starved mobile GPUs (Adreno/Mali) and the walker on cache-rich desktop GPUs, and returns the per-device default inversion granularity. Measured on a real Galaxy S25 Ultra (Adreno, Chrome 145): coop G=1 is 1.67-2.05x faster than the stream-walker across logN 12/14/16; speedup decays monotonically with G; the workgroup-scan default (G=TPB) is the worst mode and triggered a device-lost at logN>=14, so mobile must use G=1, not the scan.
…ing) Pixel 9 Pro XL (Mali/Tensor, Chrome 145): coop G=1 wins only at logN 12 (1.31x) and regresses at logN 14/16 (0.87x, 0.79x) — Mali does not hide G=1's ~Sx extra safegcd inversions the way Adreno does. So 'auto' now selects coop G=1 only on Adreno/Qualcomm; Mali and all other GPUs keep the walker, which is faster there. Avoids a one-axis Mali regression.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
A single adaptive codebase, proven fastest on real Apple, Adreno, and Mali GPUs:
coopaccumulator with inversion-granularity G=1: measured 1.67–2.05× faster than the stream-walker on a Galaxy S25 Ultra across logN 12/14/16.accum:'auto'(the default) reads the adapter vendor/architecture and selects per device. The choice is data-driven, not a blanket "mobile → coop" — Mali measured opposite to Adreno, so coop is selected only on Adreno.The earlier "mobile is BLOCKED" conclusion was wrong
The previous revision claimed BrowserStack's real-Android target served a non-WebGPU "stock browser" and that no Adreno/Mali number was obtainable. That was a blind-diagnostic artifact: the cross-check autorun emitted no early heartbeat and the full-size SRS download stalled over the tunnel, so the page looked dead.
Two harness fixes settle it:
?autorun=env-probe— posts UA +navigator.gpu+ adapter info before any SRS/WebGPU/WASM work.?cfg=<base64>— BrowserStack's/5/workertruncates the device URL at the first&; encoding the whole query as one base64 param survives it.Result: BrowserStack's "Android Browser" on a Galaxy S25 Ultra is Chrome 145 with full WebGPU on a real Adreno (
vendor:qualcomm, architecture:adreno-8xx); the Pixel 9 Pro XL is Chrome 145 on Mali (vendor:arm, architecture:valhall). Both fully measurable.Real-hardware A/B (BrowserStack, min ms over 8 reps, drift-controlled single page load)
coop G=1 speedup vs stream-walker:
Full inversion-granularity sweep (speedup vs walker; G = threads sharing one batched inversion):
Adreno · Galaxy S25 Ultra — coop wins, monotonic in G:
(g64 — the old
coopdefault scan — is worst here and triggered a device-lost at logN≥14.)Mali · Pixel 9 Pro XL — coop wins only at logN 12, regresses at 14/16:
Apple M2 — walker wins at the sizes that matter:
(Apple g64 0.54×/0.46× at 14/16 matches this PR's prior Apple table — same methodology.)
Why G=1, and why Adreno only
coopsets slots-per-thread=1 and shares the affine-add batch inversion across the workgroup. The inversion-granularity knob G controls how many threads share one safegcd:dx: no workgroup memory, no in-loop barriers, one accumulator/thread ⇒ maximal occupancy. But it does ~S× more safegcd inversions than the walker's S-wide batch (no amortisation).2·log2(TPB)barriers/round + a large workgroup-memory footprint.Adreno is memory-/occupancy-bound enough that G=1's huge occupancy gain hides the extra inversions → a real 1.67–2.05× win. Mali does not hide them (it has fewer resident-workgroup headroom / different inversion throughput), so G=1 wins only at tiny logN 12 and loses at 14/16. Apple (cache-rich, not occupancy-starved) loses outright above logN 12. Hence
autoships coop only on Adreno/Qualcomm; everything else stays on the walker. This avoids a one-axis Mali/Apple regression while capturing the Adreno win.Correctness
GPU vs
@noble/curves, headless SwiftShader: GREEN for bothwalkerandcoop, logN 8 and 10 (multiple configs incl.accum:'auto'). All perf runs are cross-checked.What changed
env-probeautorun + base64cfgparam (truncation-proof BrowserStack config).get_device: stash adapter info underdevice.__adapterInfo(Chrome'sdevice.adapterInfois a read-only getter — the prior selector read an always-undefineddevice.adapterInfoand could never recognise a mobile GPU; this was a latent bug that made the old "adaptive" path inert).resolveAccum:auto→ coop G=1 on Adreno/Qualcomm (measured win), walker elsewhere; returns the per-device default granularity. Explicitaccum/coopGalways honoured.Honest status
autocorrectly keeps them on the walker (no regression).qualcomm/adreno-8xx, Maliarm/valhall, Apple, SwiftShader).autoconservatively keeps them on the walker to avoid the larger-n regression.