perf(bb/msm): arch-aware stream-walker TPB (128 on ≥32 KB devices, 64 on Mali)#23729
Draft
AztecBot wants to merge 2 commits into
Draft
perf(bb/msm): arch-aware stream-walker TPB (128 on ≥32 KB devices, 64 on Mali)#23729AztecBot wants to merge 2 commits into
AztecBot wants to merge 2 commits into
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The memory-light stream-walker accumulator hardcoded
WALKER_TPB = 64inmsm_v2.ts, sizing itsvar<workgroup> pref_scratchto 16 KB so it always fits Mali Bifrost's 16 KB workgroup-storage cap. On Apple (TBDR) and Adreno — which grant 32 KB — that leaves half the workgroup-memory budget and a slice of occupancy unused, and is part of why the walker trails the V2 pair-tree near logn≈17.This wires the walker's TPB to the device's granted
maxComputeWorkgroupStorageSize(already requested at the adapter max ingpu.ts):pref_scratch= 128·S·2·16 B = 32 KB at S=8)The constants
STREAM_WALKER_TPB/STREAM_WALKER_TPB_MALI_FALLBACKalready lived inba_stream_plan.tsfor exactly this ("Mali targets … must drop this to 64 at adapter init") but were never read; this connects them.Changes
msm_v2.ts— arch-awareWALKER_TPBfromdevice.limits.maxComputeWorkgroupStorageSize; newwalkerTpbMsmConfigoverride.dev/msm-webgpu/main.ts—?walkertpb=URL knob so the bench can A/B 64 vs 128 on one device.Why this is correctness-safe (structural argument)
In
ba_stream_walker.template.wgslthe walker has no cross-thread sharing and no workgroup barriers (KNOB 1):pref_scratchis sizedTPB * S * 2and each thread touches onlypref_scratch[local_id*S*2 .. +S*2); partial slots are2*(t*S+k)+{0,1}(disjoint per thread); accumulators are private registers.partition_taskand the walker are generated from the sameWALKER_TPB, sonwg = ceil(num_active/TPB)stays consistent. Changing TPB therefore changes only (a) how the fixedNUM_THREADS=8192threads pack into workgroups and (b) the symbolicpref_scratchlength — never the per-thread computation → bit-identical result across TPB by construction.Memory
Workgroup-shared only; device buffers unchanged → still within the ≤100 MB budget up to n=2²⁰. 32 KB is exactly the Apple/Adreno cap and 2× Mali's, so the 64 fallback is required there (a 32 KB request fails device validation on Mali).
Validation status — full transparency
Type-check: no type errors from this change (the dev
tsconfig's only complaints are pre-existingTS6059 rootDirstructural notes affecting everysrc/import).SwiftShader local cross-check (logn 8/10): attempted, BLOCKED by the environment. The WebGPU harness cannot run in this container: the only available browser (Playwright's pre-installed
/opt/ms-playwright/chromium-1148, launched viaexecutablePath) does not exposenavigator.gpuunder any configuration tried —--enable-unsafe-webgpuwith--use-vulkan=swiftshader/--use-angle=swiftshader, in both headless=old and headless=new underxvfb.navigator.gpuisundefinedin every case (no working Dawn↔SwiftShader WebGPU backend in this build on a GPU-less host).yarn install,@noble/curves, andplaywright-coreall work; only the WebGPU adapter is missing. Correctness therefore rests on the structural argument above pending real-GPU confirmation.BrowserStack device numbers: attempted, BLOCKED by seat contention. The shared BrowserStack pool (2 seats across ~10 concurrent agents) was at 2/2 for essentially the entire session (confirmed across ~90 min of escalating 4→10→30 min backoffs). When a seat momentarily freed, staging the cloudflared tunnel and completing the MCP worker handshake before another agent re-claimed the seat did not succeed within the worker-id window. The driver + tunnel are fully working (
run-browserstack.mjs,cloudflaredinstalled). One worker created during an attempt was pointed at a stale tunnel URL by mistake and was immediately deleted (browserstack_workersnow returns[]— no leaked seats).To complete the measurements (exact repro, when a seat is free)
No device numbers are reported here because none were obtained — I will not publish unverified figures.
Base:
stream-walker-impl.