feat(bb/msm): tunable coop-walker TPB — occupancy sweep for memory-starved mobile#23746
Draft
AztecBot wants to merge 2 commits into
Draft
feat(bb/msm): tunable coop-walker TPB — occupancy sweep for memory-starved mobile#23746AztecBot wants to merge 2 commits into
AztecBot wants to merge 2 commits into
Conversation
…e-device TPB A/B sweep
…up-mem at coopTpb=128)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Lever: coop-walker workgroup size (cooperative batch-inversion width) as an occupancy knob
Builds on the coop-walker (#23739), which won −6% on Adreno S25 by trading the
stream-walker's 16 KB
pref_scratchfor a ~4 KB cooperative batch inversion →more resident workgroups on memory-starved mobile. This PR pushes the same
occupancy lever further: the coop kernel's workgroup size
TPBsets its4·TPBvec4 workgroup footprint (TPB=64 → 4 KB, 32 → 2 KB, 128 → 8 KB),the cooperative batch width, and the Hillis-Steele scan depth (
log2 TPB). Itwas pinned at 64 (inherited from the stream-walker's KNOB 1); for the coop
kernel that is just a default, not a derived optimum.
Crucially
TPBis decoupled from the work partition:partition_threadslices at a fixed 256-grain (
NUM_THREADS = nwg*256) andpartition_taskemits the indirect grain
planner_meta[15] = ceil(num_active / WALKER_TPB), sochanging
TPBonly changes the workgroup grouping + dispatch count — theper-thread slices are identical. Lower
TPB= smaller workgroup footprint =more resident workgroups to hide memory latency (the design's own thesis),
at the cost of more (cheaper, non-bottleneck) batch inversions.
What changed
MsmConfig.coopTpb(default 64);WALKER_TPBfor the coop path is nowconfig.coopTpb, threaded intopartition_task(so the indirect grainmatches) and the coop kernel. Stream-walker stays pinned at 64.
msm-accum-abautorun:orderentries acceptaccum[:tpb](e.g.
order=walker,coop:64,coop:32,coop:128) so several TPB variants arebenchmarked in one page load (identical thermal state, one BrowserStack
worker) with min/median ms + speedup vs the first entry.
Correctness — GPU vs
@noble/curves, headless SwiftShaderGREEN (
gpu == noble) at logn 8 and 10, seed 2, for coop TPB ∈ {32, 64, 128}.Real-hardware timing
Adreno (S25) / Mali A/B sweep in flight on BrowserStack — numbers to follow.
If a smaller TPB wins on mobile, it's a free occupancy gain; if 64 is already
optimal, that's an honest negative bounding the occupancy lever.
Created by claudebox · group:
aztec