Skip to content

feat(bb/msm): adaptive bucket-accumulate — coop-walker G=1 on Adreno (1.67–2.05× over stream-walker), walker elsewhere#23739

Draft
AztecBot wants to merge 13 commits into
stream-walker-implfrom
cb/msm-coop-walker
Draft

feat(bb/msm): adaptive bucket-accumulate — coop-walker G=1 on Adreno (1.67–2.05× over stream-walker), walker elsewhere#23739
AztecBot wants to merge 13 commits into
stream-walker-implfrom
cb/msm-coop-walker

Conversation

@AztecBot
Copy link
Copy Markdown
Collaborator

@AztecBot AztecBot commented May 30, 2026

Summary

A single adaptive codebase, proven fastest on real Apple, Adreno, and Mali GPUs:

  • Adreno (Qualcomm)coop accumulator with inversion-granularity G=1: measured 1.67–2.05× faster than the stream-walker on a Galaxy S25 Ultra across logN 12/14/16.
  • Mali, Apple, and everything else → the stream-walker, which is fastest there (coop regresses).

accum:'auto' (the default) reads the adapter vendor/architecture and selects per device. The choice is data-driven, not a blanket "mobile → coop" — Mali measured opposite to Adreno, so coop is selected only on Adreno.

The earlier "mobile is BLOCKED" conclusion was wrong

The previous revision claimed BrowserStack's real-Android target served a non-WebGPU "stock browser" and that no Adreno/Mali number was obtainable. That was a blind-diagnostic artifact: the cross-check autorun emitted no early heartbeat and the full-size SRS download stalled over the tunnel, so the page looked dead.

Two harness fixes settle it:

  • ?autorun=env-probe — posts UA + navigator.gpu + adapter info before any SRS/WebGPU/WASM work.
  • ?cfg=<base64> — BrowserStack's /5/worker truncates the device URL at the first &; encoding the whole query as one base64 param survives it.

Result: BrowserStack's "Android Browser" on a Galaxy S25 Ultra is Chrome 145 with full WebGPU on a real Adreno (vendor:qualcomm, architecture:adreno-8xx); the Pixel 9 Pro XL is Chrome 145 on Mali (vendor:arm, architecture:valhall). Both fully measurable.

Real-hardware A/B (BrowserStack, min ms over 8 reps, drift-controlled single page load)

coop G=1 speedup vs stream-walker:

logN Apple M2 (Chrome 148) Adreno · S25 Ultra (Chrome 145) Mali · Pixel 9 Pro XL (Chrome 145)
12 1.16× 1.67× 1.31×
14 0.61× 2.05× 0.87×
16 0.52× 1.85× 0.79×

Full inversion-granularity sweep (speedup vs walker; G = threads sharing one batched inversion):

Adreno · Galaxy S25 Ultra — coop wins, monotonic in G:

logN walker (ms) g1 g8 g16 g32
12 52.2 1.67× 1.51× 1.35× 1.14×
14 129.9 2.05× 1.56× 1.13× 0.99×
16 274.9 1.85× 1.49× 1.28× 0.96×

(g64 — the old coop default scan — is worst here and triggered a device-lost at logN≥14.)

Mali · Pixel 9 Pro XL — coop wins only at logN 12, regresses at 14/16:

logN walker (ms) g1 g8 g16 g32
12 103.3 1.31× 0.77× 0.59× 0.41×
14 173.7 0.87× 0.45× 0.33× 0.20×
16 352.1 0.79× 0.39× 0.33× 0.18×

Apple M2 — walker wins at the sizes that matter:

logN walker (ms) g1 g8 g16 g32 g64
12 15.1 1.16× 0.85× 0.64× 0.44× 1.02×
14 23.4 0.61× 0.41× 0.30× 0.19× 0.54×
16 49.7 0.52× 0.36× 0.26× 0.17× 0.46×

(Apple g64 0.54×/0.46× at 14/16 matches this PR's prior Apple table — same methodology.)

Why G=1, and why Adreno only

coop sets slots-per-thread=1 and shares the affine-add batch inversion across the workgroup. The inversion-granularity knob G controls how many threads share one safegcd:

  • G=1 — each thread inverts its own dx: no workgroup memory, no in-loop barriers, one accumulator/thread ⇒ maximal occupancy. But it does ~S× more safegcd inversions than the walker's S-wide batch (no amortisation).
  • G=TPB (the old default scan) — one inversion/workgroup, but 2·log2(TPB) barriers/round + a large workgroup-memory footprint.

Adreno is memory-/occupancy-bound enough that G=1's huge occupancy gain hides the extra inversions → a real 1.67–2.05× win. Mali does not hide them (it has fewer resident-workgroup headroom / different inversion throughput), so G=1 wins only at tiny logN 12 and loses at 14/16. Apple (cache-rich, not occupancy-starved) loses outright above logN 12. Hence auto ships coop only on Adreno/Qualcomm; everything else stays on the walker. This avoids a one-axis Mali/Apple regression while capturing the Adreno win.

Correctness

GPU vs @noble/curves, headless SwiftShader: GREEN for both walker and coop, logN 8 and 10 (multiple configs incl. accum:'auto'). All perf runs are cross-checked.

What changed

  • env-probe autorun + base64 cfg param (truncation-proof BrowserStack config).
  • get_device: stash adapter info under device.__adapterInfo (Chrome's device.adapterInfo is a read-only getter — the prior selector read an always-undefined device.adapterInfo and could never recognise a mobile GPU; this was a latent bug that made the old "adaptive" path inert).
  • resolveAccum: auto → coop G=1 on Adreno/Qualcomm (measured win), walker elsewhere; returns the per-device default granularity. Explicit accum/coopG always honoured.

Honest status

  • ✅ coop-walker kernel + granularity knob G, correct (noble/SwiftShader), drop-in.
  • ✅ Real-HW proven: coop G=1 is 1.67–2.05× over the walker on Adreno (S25, all of logN 12/14/16).
  • ✅ Mali + Apple measured → coop regresses at logN≥14 → auto correctly keeps them on the walker (no regression).
  • ✅ Adaptive selector verified against real adapter strings (Adreno qualcomm/adreno-8xx, Mali arm/valhall, Apple, SwiftShader).
  • Caveats: one device per GPU family; the workgroup-scan mode (G=TPB) is unstable on Adreno (device-lost at logN≥14) so it is never auto-selected; at logN 12 coop-G=1 also edges out the walker on Mali/Apple, but auto conservatively keeps them on the walker to avoid the larger-n regression.

@AztecBot AztecBot added the claudebox Owned by claudebox. it can push to this PR. label May 30, 2026
AztecBot added 5 commits May 30, 2026 03:44
Sweep walker vs coop across logns in one page load (one BrowserStack worker
per device) for the real-hardware comparison.
Absorbs one-time driver JIT / cold GPU-clock cost so the first logN in the
sweep is not anomalously slow.
AztecBot added 3 commits May 30, 2026 07:43
…up/scan)

Add a compile-time granularity G to ba_coop_walker controlling how many
threads share one batched inversion:
  - G==1   (local):      each thread inverts its own dx; no workgroup memory,
                         no in-loop barriers.
  - 1<G<TPB (per-group): per-group serial Montgomery batch inversion; 2
                         barriers/round regardless of G (vs scan's 2*log2 TPB),
                         TPB/G inversions run concurrently across group leaders.
  - G==TPB (scan):       existing workgroup prefix/suffix scan (default).

Wired via MsmV2 coopG config, a coopg query param, and a new
msm-coop-gsweep autorun that benches walker + coop at every G in one
pass. Cross-checked GREEN vs noble (SwiftShader) for G=1,8,16,32,64 at
logn 8 and 10.
…arness fixes

- Add accum:'auto' (new default) resolving per-device via adapterInfo; gated
  behind COOP_AUTO_ON_STARVED_MOBILE (off) so it picks the kernel proven
  fastest on measurable hardware (walker) until a WebGPU-capable Android A/B
  proves coop's mobile niche. Documents why coop is, by analysis, dominated by
  the walker + #23726's var<private> occupancy lever.
- msm-accum-ab autorun: emit /progress heartbeats (boot-start, srs, build, rep)
  under one shared runId so the BrowserStack watchdog survives slow mobile SRS
  loads; add ?srs_logn=N to cap the SRS download.
Adds ?autorun=env-probe to the dev page: posts UA + navigator.gpu presence
+ adapter info (vendor/arch/limits) BEFORE any SRS download or pipeline
build, then exits. Isolates the three confounded failure modes that made
earlier BrowserStack-Android attempts ambiguous (page didn't load / no
WebGPU / SRS download stalled over the tunnel) into one decisive signal.
AztecBot added 3 commits May 30, 2026 10:52
BrowserStack's /5/worker launcher truncates the device URL at the first
unescaped '&', so only the first query param survived and every later
param silently fell back to its page default (observed: a gsweep meant
for logns=12,14,16 ran only the default logn=14, full SRS). Decode an
optional ?cfg=<base64 query string> before any URLSearchParams read and
rewrite location.search, so a full multi-param config reaches the page as
one truncation-proof param.
…ix adapterInfo plumbing

- get_device: stash adapter info under device.__adapterInfo (Chrome's
  device.adapterInfo is a read-only getter — assigning to it throws; the
  prior selector also read an always-undefined device.adapterInfo, so it
  could never recognise a mobile GPU).
- resolveAccum: 'auto' now selects coop with G=1 on memory-/register-starved
  mobile GPUs (Adreno/Mali) and the walker on cache-rich desktop GPUs, and
  returns the per-device default inversion granularity. Measured on a real
  Galaxy S25 Ultra (Adreno, Chrome 145): coop G=1 is 1.67-2.05x faster than
  the stream-walker across logN 12/14/16; speedup decays monotonically with
  G; the workgroup-scan default (G=TPB) is the worst mode and triggered a
  device-lost at logN>=14, so mobile must use G=1, not the scan.
…ing)

Pixel 9 Pro XL (Mali/Tensor, Chrome 145): coop G=1 wins only at logN 12
(1.31x) and regresses at logN 14/16 (0.87x, 0.79x) — Mali does not hide
G=1's ~Sx extra safegcd inversions the way Adreno does. So 'auto' now
selects coop G=1 only on Adreno/Qualcomm; Mali and all other GPUs keep the
walker, which is faster there. Avoids a one-axis Mali regression.
@AztecBot AztecBot changed the title feat(bb/msm): coop-walker — workgroup-cooperative inversion bucket accumulator feat(bb/msm): adaptive bucket-accumulate — coop-walker G=1 on Adreno (1.67–2.05× over stream-walker), walker elsewhere May 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

claudebox Owned by claudebox. it can push to this PR.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant