perf(bb/msm): stream-walker l0 load-reuse — portable Adreno+Mali win; register-budget boundary mapped (fused overshoots Adreno)#23730
Draft
AztecBot wants to merge 8 commits into
Draft
Conversation
Add ?autorun=msm-gpu-bench: drives the GPU MSM directly (no Run/WASM gate), captures per-phase GPU timestamps + allocated bytes. The Run-gated msm-bench stalls on devices without threaded-WASM/cross-origin isolation. Also trim the autorun SRS download to the chosen logn so mobile page-load does not stall on the full 2^20 (64 MB) prefix.
…ate cache The prior pass cached both the packed l0_index handles and the 32-byte x-coordinates (plx/prx, ~1 KB/thread private) across the three batch-inversion passes. First principles + the #23726 occupancy profiling show the x-coord cache is the wrong cost: the re-reads it eliminates are same-address cache hits (no DRAM bandwidth saved), and ~1 KB/thread of extra private state competes with the very occupancy that limits this kernel (pref_scratch→private, TPB 64->128 in #23726). Cache only the 4-byte packed handles (l0a/l0b, 8 bytes/slot) so the dependent l0_index gather is issued once per point; re-read point_x from the cached handle in the inverse pass and backward peel (a cache hit, point index already resolved). Bit-identical arithmetic; ~1 KB/thread less private state, composing cleanly with #23726. Cross-checked GREEN vs Noble at logn 10/11/12 (SwiftShader).
…o WebGPU probe Adds a reuse_loads knob (MsmConfig.reuseLoads, ?reuse / ?variants) so one page load can build+time BOTH the pre-reuse baseline and the handle-only load-reuse variant on the same device/inputs — a controlled A/B that does not depend on a second BrowserStack session. msm-gpu-bench now interleaves variants (?rounds) and reports per-variant median/avg wall + per-phase GPU time plus the reuse-minus-base delta. Fixes the BrowserStack Android caps: /5/worker selects the browser via `browser`, so real-Android Chrome needs browser:"chrome" (the prior browser:"android"+browserName:"chrome" launched the device default browser, labelled "Android Browser", and produced no telemetry). Adds probe.html, a dependency-free (ES5 + XHR, no ESM) WebGPU capability probe that posts telemetry immediately so a mobile run can never come back as 'zero telemetry, no error row'. Both walker variants cross-check GREEN vs Noble under SwiftShader at logn 10 and 12.
… real Android BrowserStack's real-mobile intent launch drops everything after the first unescaped '&', so the interleaved-variant bench URL (?autorun=msm-gpu-bench& logn=…&variants=reuse,base&rounds=…) arrived truncated to ?autorun=msm-gpu-bench — reverting logn to its default and pulling the full 2^20 SRS (64MB), which loses the mobile GPU device. Pass a single percent-encoded q= and expand it into location.search before any module reads the query. Complements the runtime reuseLoads toggle so a single Android session can A/B both variants.
…int_x 3→2 gathers, register-free The inverse pass and the backward peel were two separate backward loops over the S slots, and each gathered point_x. Merge them: one backward walk derives inv_dx[k] = inv * prefix[k-1] and consumes it immediately for the affine add, so each slot's point_x is gathered twice (forward product + fused peel) instead of three times. The per-slot inv_dx scratch round-trip is also removed. No new per-thread private state, so it stays inside the Adreno register budget that the x-coord cache broke. Idle pad slots still peel their dx from the running inverse so the batched inversion stays correct. Gated behind fusePeel (implies reuseLoads); A/B'd as the 'fused' variant. Cross-check GREEN vs noble at logn 10,12 on SwiftShader for base/reuse/fused.
…u-bench progress beacons - run-browserstack.mjs: add --rounds/--variants and q-encode the query for real-mobile targets (the index.html q= expander already existed but the runner never used it, so multi-param mobile URLs truncated at the first &). - main.ts msm-gpu-bench: map the 'fused' variant to reuse+fusePeel; post progress beacons during warmup and per rep so the runner's first-progress watchdog sees liveness through the (slow on mobile) shader JITs; report fused−base and fused−reuse wall deltas.
….6× logn14, device-loss logn16)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Stream-walker l0 load-reuse — a portable, register-safe memory-latency win
TL;DR
The stream-walker is memory-bandwidth-bound on real mobile hardware. Caching each slot's packed
l0_indexhandle once in the forward pass and reusing it (instead of re-issuing the dependent gather 3–4× per point) is a real, cross-vendor win — proven on both Adreno and Mali with drift-controlled in-session A/Bs. This PR ships that (reuseLoads, default on) and maps the exact register-budget boundary: pushing the reuse one step further (fusePeel— merge the inverse + backward-peel passes to also cut apoint_xgather) overshoots Adreno's per-thread register file and is catastrophic there, while Mali tolerates it. SoreuseLoadsis the maximum register-safe reuse;fusePeelstays gated off.reuseLoadscaches only the 4-byte handle and re-readspoint_xfrom the resolved index (a cache hit) — it adds zero net live per-thread state, which is why it stays under the occupancy cliff. Bit-identical arithmetic; neutral-to-positive everywhere measured.Real-hardware A/B — Adreno (Galaxy S25 Ultra · Adreno 830 · Chrome 145), GPU wall time
Drift-controlled interleaved A/B (both variants in one page load, shared thermal state). Win grows with size — the memory-bound signature (reducing the dependent
l0_indexgather pays off once the working set exceeds cache):Re-confirmed this session with single-variant solo runs (n=20): reuse logn14 = 117.8 ms, logn16 = 231.2 ms — matches the interleaved numbers.
Real-hardware A/B — Mali (Pixel 9 Pro XL · Tensor G4 · Chrome 145), GPU wall time — the portability proof
Same harness, interleaved
reuse,base,fused, n=12/variant, logn16. Wall is the trustworthy metric on Android (uncalibrated timestamp-query);stream_walkerphase delta is shown as corroboration and agrees in direction:The load-reuse win generalizes to Mali (smaller magnitude than Adreno — different memory hierarchy/cache, but real and directionally consistent across both wall and the stream_walker phase). This was the missing portability proof.
The register-budget boundary — how far the reuse can be pushed (gap closed)
fusePeelextends the idea: merge the inverse pass into the backward peel so each slot'spoint_xis gathered 2× instead of 3×, and the per-slotinv_dxscratch round-trip vanishes. Expected: a small further win on top ofreuse. On Mali it delivers exactly that and is the best variant (fused−reuse = −1.2 % wall). On Adreno it is catastrophic:Why: the merged backward loop holds
inv_dx+ both points' x/y +lambda+ the result point live simultaneously (~8 field elements ≈ 64 registers/thread). That overshoots Adreno's small per-thread register file → register spill / occupancy collapse (the deterministic, low-variance ~10× slowdown at logn14) → at logn16 the pressure loses the device entirely. This is the same register-file-competition failure mode that makes this PR non-composable with #23726, and the same class as the earlier x-coord value-cache dead-end (~4×). Mali's larger budget hides all of it.Boundary, stated precisely: caching only the 4-byte handle and re-reading
point_x(reuseLoads) adds +0 net live per-thread registers → at/under Adreno's budget → the maximum register-safe reuse. Any extension that adds live per-thread field-element state — the x-coord value cache (4×) or thefusePeelmerge (≈10× / device-loss) — exceeds Adreno's budget. The lever is exhausted atreuseLoads;fusePeelis kept as an A/B knob, default-off, documented Adreno-unsafe inMsmConfig.Correctness (SwiftShader, headless, WASM-free Noble cross-check) — PASS
All three variants (
base,reuse,fused) agree with @noble/curves at logn 10 and 12 (autorun=msm-ref-check; the in-tree bb wasm is a 213-byte stub, so Noble is the oracle)._generated/shaders.tsregenerated and in sync.Harness notes
MsmV2on every variant switch, and on Adreno the rebuild churn following afusedpipeline build reproducibly loses the device. The fused-on-Adreno characterization above therefore uses single-variant solo runs (one build, no rebuild). A proper fix (cache pipelines per variant — the ~49 MB SRS pool is shared, per-MsmV2state is tiny) would enable interleaved A/B on Adreno too.Recommendation
reuseLoads(this PR's default). Proven win on Adreno (~10–16 %) and Mali (~2–4 %), register-safe on both, bit-identical.fusePeeloff. Adreno-unsafe (≈10× / device-loss); on Mali it's only ~1 % better thanreuseLoads— not worth a vendor-specific gate.Base:
stream-walker-impl.