feat(bb/msm): adaptive bucket-accumulate — coop-walker G=1 on Adreno (1.67–2.05× over stream-walker), walker elsewhere by AztecBot · Pull Request #23739 · AztecProtocol/aztec-packages

AztecBot · 2026-05-30T03:23:35Z

Summary

A single adaptive codebase, proven fastest on real Apple, Adreno, and Mali GPUs:

Adreno (Qualcomm) → coop accumulator with inversion-granularity G=1: measured 1.67–2.05× faster than the stream-walker on a Galaxy S25 Ultra across logN 12/14/16.
Mali, Apple, and everything else → the stream-walker, which is fastest there (coop regresses).

accum:'auto' (the default) reads the adapter vendor/architecture and selects per device. The choice is data-driven, not a blanket "mobile → coop" — Mali measured opposite to Adreno, so coop is selected only on Adreno.

The earlier "mobile is BLOCKED" conclusion was wrong

The previous revision claimed BrowserStack's real-Android target served a non-WebGPU "stock browser" and that no Adreno/Mali number was obtainable. That was a blind-diagnostic artifact: the cross-check autorun emitted no early heartbeat and the full-size SRS download stalled over the tunnel, so the page looked dead.

Two harness fixes settle it:

?autorun=env-probe — posts UA + navigator.gpu + adapter info before any SRS/WebGPU/WASM work.
?cfg=<base64> — BrowserStack's /5/worker truncates the device URL at the first &; encoding the whole query as one base64 param survives it.

Result: BrowserStack's "Android Browser" on a Galaxy S25 Ultra is Chrome 145 with full WebGPU on a real Adreno (vendor:qualcomm, architecture:adreno-8xx); the Pixel 9 Pro XL is Chrome 145 on Mali (vendor:arm, architecture:valhall). Both fully measurable.

Real-hardware A/B (BrowserStack, min ms over 8 reps, drift-controlled single page load)

coop G=1 speedup vs stream-walker:

logN	Apple M2 (Chrome 148)	Adreno · S25 Ultra (Chrome 145)	Mali · Pixel 9 Pro XL (Chrome 145)
12	1.16×	1.67×	1.31×
14	0.61×	2.05×	0.87×
16	0.52×	1.85×	0.79×

Full inversion-granularity sweep (speedup vs walker; G = threads sharing one batched inversion):

Adreno · Galaxy S25 Ultra — coop wins, monotonic in G:

logN	walker (ms)	g1	g8	g16	g32
12	52.2	1.67×	1.51×	1.35×	1.14×
14	129.9	2.05×	1.56×	1.13×	0.99×
16	274.9	1.85×	1.49×	1.28×	0.96×

(g64 — the old coop default scan — is worst here and triggered a device-lost at logN≥14.)

Mali · Pixel 9 Pro XL — coop wins only at logN 12, regresses at 14/16:

logN	walker (ms)	g1	g8	g16	g32
12	103.3	1.31×	0.77×	0.59×	0.41×
14	173.7	0.87×	0.45×	0.33×	0.20×
16	352.1	0.79×	0.39×	0.33×	0.18×

Apple M2 — walker wins at the sizes that matter:

logN	walker (ms)	g1	g8	g16	g32	g64
12	15.1	1.16×	0.85×	0.64×	0.44×	1.02×
14	23.4	0.61×	0.41×	0.30×	0.19×	0.54×
16	49.7	0.52×	0.36×	0.26×	0.17×	0.46×

(Apple g64 0.54×/0.46× at 14/16 matches this PR's prior Apple table — same methodology.)

Why G=1, and why Adreno only

coop sets slots-per-thread=1 and shares the affine-add batch inversion across the workgroup. The inversion-granularity knob G controls how many threads share one safegcd:

G=1 — each thread inverts its own dx: no workgroup memory, no in-loop barriers, one accumulator/thread ⇒ maximal occupancy. But it does ~S× more safegcd inversions than the walker's S-wide batch (no amortisation).
G=TPB (the old default scan) — one inversion/workgroup, but 2·log2(TPB) barriers/round + a large workgroup-memory footprint.

Adreno is memory-/occupancy-bound enough that G=1's huge occupancy gain hides the extra inversions → a real 1.67–2.05× win. Mali does not hide them (it has fewer resident-workgroup headroom / different inversion throughput), so G=1 wins only at tiny logN 12 and loses at 14/16. Apple (cache-rich, not occupancy-starved) loses outright above logN 12. Hence auto ships coop only on Adreno/Qualcomm; everything else stays on the walker. This avoids a one-axis Mali/Apple regression while capturing the Adreno win.

Correctness

GPU vs @noble/curves, headless SwiftShader: GREEN for both walker and coop, logN 8 and 10 (multiple configs incl. accum:'auto'). All perf runs are cross-checked.

What changed

env-probe autorun + base64 cfg param (truncation-proof BrowserStack config).
get_device: stash adapter info under device.__adapterInfo (Chrome's device.adapterInfo is a read-only getter — the prior selector read an always-undefined device.adapterInfo and could never recognise a mobile GPU; this was a latent bug that made the old "adaptive" path inert).
resolveAccum: auto → coop G=1 on Adreno/Qualcomm (measured win), walker elsewhere; returns the per-device default granularity. Explicit accum/coopG always honoured.

Honest status

✅ coop-walker kernel + granularity knob G, correct (noble/SwiftShader), drop-in.
✅ Real-HW proven: coop G=1 is 1.67–2.05× over the walker on Adreno (S25, all of logN 12/14/16).
✅ Mali + Apple measured → coop regresses at logN≥14 → auto correctly keeps them on the walker (no regression).
✅ Adaptive selector verified against real adapter strings (Adreno qualcomm/adreno-8xx, Mali arm/valhall, Apple, SwiftShader).
Caveats: one device per GPU family; the workgroup-scan mode (G=TPB) is unstable on Adreno (device-lost at logN≥14) so it is never auto-selected; at logN 12 coop-G=1 also edges out the walker on Mali/Apple, but auto conservatively keeps them on the walker to avoid the larger-n regression.

…cumulator

Sweep walker vs coop across logns in one page load (one BrowserStack worker per device) for the real-hardware comparison.

Absorbs one-time driver JIT / cold GPU-clock cost so the first logN in the sweep is not anomalously slow.

…up/scan) Add a compile-time granularity G to ba_coop_walker controlling how many threads share one batched inversion: - G==1 (local): each thread inverts its own dx; no workgroup memory, no in-loop barriers. - 1<G<TPB (per-group): per-group serial Montgomery batch inversion; 2 barriers/round regardless of G (vs scan's 2*log2 TPB), TPB/G inversions run concurrently across group leaders. - G==TPB (scan): existing workgroup prefix/suffix scan (default). Wired via MsmV2 coopG config, a coopg query param, and a new msm-coop-gsweep autorun that benches walker + coop at every G in one pass. Cross-checked GREEN vs noble (SwiftShader) for G=1,8,16,32,64 at logn 8 and 10.

…arness fixes - Add accum:'auto' (new default) resolving per-device via adapterInfo; gated behind COOP_AUTO_ON_STARVED_MOBILE (off) so it picks the kernel proven fastest on measurable hardware (walker) until a WebGPU-capable Android A/B proves coop's mobile niche. Documents why coop is, by analysis, dominated by the walker + #23726's var<private> occupancy lever. - msm-accum-ab autorun: emit /progress heartbeats (boot-start, srs, build, rep) under one shared runId so the BrowserStack watchdog survives slow mobile SRS loads; add ?srs_logn=N to cap the SRS download.

Adds ?autorun=env-probe to the dev page: posts UA + navigator.gpu presence + adapter info (vendor/arch/limits) BEFORE any SRS download or pipeline build, then exits. Isolates the three confounded failure modes that made earlier BrowserStack-Android attempts ambiguous (page didn't load / no WebGPU / SRS download stalled over the tunnel) into one decisive signal.

BrowserStack's /5/worker launcher truncates the device URL at the first unescaped '&', so only the first query param survived and every later param silently fell back to its page default (observed: a gsweep meant for logns=12,14,16 ran only the default logn=14, full SRS). Decode an optional ?cfg=<base64 query string> before any URLSearchParams read and rewrite location.search, so a full multi-param config reaches the page as one truncation-proof param.

…ix adapterInfo plumbing - get_device: stash adapter info under device.__adapterInfo (Chrome's device.adapterInfo is a read-only getter — assigning to it throws; the prior selector also read an always-undefined device.adapterInfo, so it could never recognise a mobile GPU). - resolveAccum: 'auto' now selects coop with G=1 on memory-/register-starved mobile GPUs (Adreno/Mali) and the walker on cache-rich desktop GPUs, and returns the per-device default inversion granularity. Measured on a real Galaxy S25 Ultra (Adreno, Chrome 145): coop G=1 is 1.67-2.05x faster than the stream-walker across logN 12/14/16; speedup decays monotonically with G; the workgroup-scan default (G=TPB) is the worst mode and triggered a device-lost at logN>=14, so mobile must use G=1, not the scan.

…ing) Pixel 9 Pro XL (Mali/Tensor, Chrome 145): coop G=1 wins only at logN 12 (1.31x) and regresses at logN 14/16 (0.87x, 0.79x) — Mali does not hide G=1's ~Sx extra safegcd inversions the way Adreno does. So 'auto' now selects coop G=1 only on Adreno/Qualcomm; Mali and all other GPUs keep the walker, which is faster there. Avoids a one-axis Mali regression.

…pple regress)

feat(bb/msm): coop-walker — workgroup-cooperative inversion bucket ac…

868c887

…cumulator

AztecBot added the claudebox Owned by claudebox. it can push to this PR. label May 30, 2026

AztecBot added 5 commits May 30, 2026 03:44

update PR #23739

b8a2bba

update PR #23739

c28816a

update PR #23739

8e6432c

feat(bb/msm): multi-logn same-device A/B autorun for coop-walker bench

04685ab

Sweep walker vs coop across logns in one page load (one BrowserStack worker per device) for the real-hardware comparison.

feat(bb/msm): pre-sweep GPU warmup in A/B autorun

cc462e2

Absorbs one-time driver JIT / cold GPU-clock cost so the first logN in the sweep is not anomalously slow.

AztecBot mentioned this pull request May 30, 2026

feat(bb/msm): tunable coop-walker TPB — occupancy sweep for memory-starved mobile #23746

Draft

AztecBot added 3 commits May 30, 2026 07:43

AztecBot added 3 commits May 30, 2026 10:52

AztecBot changed the title ~~feat(bb/msm): coop-walker — workgroup-cooperative inversion bucket accumulator~~ feat(bb/msm): adaptive bucket-accumulate — coop-walker G=1 on Adreno (1.67–2.05× over stream-walker), walker elsewhere May 30, 2026

docs(bb/msm): record measured coop-walker outcome (Adreno win, Mali/A…

29ca0ca

…pple regress)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(bb/msm): adaptive bucket-accumulate — coop-walker G=1 on Adreno (1.67–2.05× over stream-walker), walker elsewhere#23739

feat(bb/msm): adaptive bucket-accumulate — coop-walker G=1 on Adreno (1.67–2.05× over stream-walker), walker elsewhere#23739
AztecBot wants to merge 13 commits into
stream-walker-implfrom
cb/msm-coop-walker

AztecBot commented May 30, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

AztecBot commented May 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

The earlier "mobile is BLOCKED" conclusion was wrong

Real-hardware A/B (BrowserStack, min ms over 8 reps, drift-controlled single page load)

Why G=1, and why Adreno only

Correctness

What changed

Honest status

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

AztecBot commented May 30, 2026 •

edited

Loading