feat(bb/msm): per-architecture autotuner for stream-walker TPB (Mali/Adreno/Apple)#23725
Draft
AztecBot wants to merge 6 commits into
Draft
feat(bb/msm): per-architecture autotuner for stream-walker TPB (Mali/Adreno/Apple)#23725AztecBot wants to merge 6 commits into
AztecBot wants to merge 6 commits into
Conversation
… knob Cache the forward-pass dx per slot (private registers, 0 device/workgroup memory) and reuse it in the batched-inversion peel, so the inverse pass no longer reloads point_x and re-subtracts — removing ~1/3 of the walker's uncoalesced SRS x-coordinate reads. Correctness-neutral; the cached value is bit-identical to the recomputed dx. Add an opt-in walkerPrefDevice knob that places pref_scratch in a device storage buffer (occupancy lever). The walker already binds 10 storage buffers, so this needs maxStorageBuffersPerShaderStage >= 11; when the device caps at 10 (SwiftShader, many mobile GPUs) it falls back to var<workgroup> placement so the walker still launches. Wire ?tpb / ?prefdev into the xcheck harness for real-device A/B. Cross-check GREEN vs @noble/curves at logn=8,10 under SwiftShader for workgroup (TPB 64/128) and device-fallback paths; autotune unit tests 10/10.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Makes the stream-walker faster (its slow path) by combining two correctness-neutral, memory-free levers on top of the per-architecture TPB autotuner:
dx = p_rx − p_lxin private registers; the inverse pass reuses it instead of reloadingpoint_xand re-subtracting. This removes ~1/3 of the walker's uncoalesced x-coordinate reads at zero device/workgroup memory cost (S×8 = 64 u32 of registers/thread; the swarm established register pressure is not the walker's binding constraint).pref_scratchfootprint fits the device'smaxComputeWorkgroupStorageSize(TPB=128 on 32 KB Apple/Adreno, 64 on 16 KB Mali). TPB only repacks a fixed set ofNUM_THREADS = nwg×256logical threads into workgroups; each thread owns its own[local_id×S, +S)scratch slice with no cross-thread sharing and no barrier, so it is correctness-neutral.Both compose: dx-caching cuts per-add memory traffic; the autotuner raises occupancy/packing per arch.
dx-caching (the speed lever)
_generated/shaders.tsis regenerated (inlined WGSL) and included in the diff.Device-memory pref_scratch — explored, gated (honest finding)
The swarm flagged moving
pref_scratchto device storage as the #1 occupancy lever. Wiring it revealed a real hardware ceiling: the walker already binds all 10 storage buffers, so a device-memorypref_scratchneeds an 11th. SwiftShader and many mobile GPUs capmaxStorageBuffersPerShaderStageat 10, which rejects the pipeline layout.This PR adds the
walkerPrefDeviceknob + WGSL variant (binding 11, globalt*S*2slice) but gates it onmaxStorageBuffersPerShaderStage ≥ 11, falling back tovar<workgroup>otherwise so it never breaks. To engage it on the ≤10-buffer mobile targets a storage binding must first be freed (e.g. interleavingsorted_bucket_list+sorted_count_list, which share an index) — tracked as follow-up; not landed because it can't be perf-validated without a real-hardware seat.Validation
@noble/curvesBN254 Pippenger, GREEN at logn=8 and logn=10 for: workgroup pref_scratch at autotuned TPB=128, forced TPB=64, and the device-pref fallback path. SwiftShader reportsmaxStorageBuffersPerShaderStage=10, confirming the ceiling above.tsc --noEmitclean formsm_webgpu/.Real-hardware A/B — still outstanding (seat contention)
The 2 shared BrowserStack seats have been continuously saturated by other agents' Mali (Pixel 9 Pro XL) and Adreno (Galaxy S25 Ultra) jobs across this session's patient backoff. Not run rather than displace other agents' work. The harness is ready (
?tpb/?prefdevin the xcheck page;--qppassthrough in run-browserstack; cloudflared installed). When a seat frees, the dx-caching delta is a direct apples-to-apples comparison vs this PR's previously-recorded macOS M2 TPB=128stream_walkeraccumulate of 33.1 ms (same code minus dx-caching):Status
Draft. dx-caching + autotuner + gated device-pref knob landed and SwiftShader-GREEN. Mobile Apple/Adreno/Mali numbers are the only outstanding item, blocked purely on shared-seat contention.