Skip to content

feat(bb/msm): tunable coop-walker TPB — occupancy sweep for memory-starved mobile#23746

Draft
AztecBot wants to merge 2 commits into
cb/msm-coop-walkerfrom
cb/msm-coop-tpb
Draft

feat(bb/msm): tunable coop-walker TPB — occupancy sweep for memory-starved mobile#23746
AztecBot wants to merge 2 commits into
cb/msm-coop-walkerfrom
cb/msm-coop-tpb

Conversation

@AztecBot
Copy link
Copy Markdown
Collaborator

Lever: coop-walker workgroup size (cooperative batch-inversion width) as an occupancy knob

Builds on the coop-walker (#23739), which won −6% on Adreno S25 by trading the
stream-walker's 16 KB pref_scratch for a ~4 KB cooperative batch inversion →
more resident workgroups on memory-starved mobile. This PR pushes the same
occupancy lever
further: the coop kernel's workgroup size TPB sets its
4·TPB vec4 workgroup footprint (TPB=64 → 4 KB, 32 → 2 KB, 128 → 8 KB),
the cooperative batch width, and the Hillis-Steele scan depth (log2 TPB). It
was pinned at 64 (inherited from the stream-walker's KNOB 1); for the coop
kernel that is just a default, not a derived optimum.

Crucially TPB is decoupled from the work partition: partition_thread
slices at a fixed 256-grain (NUM_THREADS = nwg*256) and partition_task
emits the indirect grain planner_meta[15] = ceil(num_active / WALKER_TPB), so
changing TPB only changes the workgroup grouping + dispatch count — the
per-thread slices are identical. Lower TPB = smaller workgroup footprint =
more resident workgroups to hide memory latency (the design's own thesis),
at the cost of more (cheaper, non-bottleneck) batch inversions.

What changed

  • MsmConfig.coopTpb (default 64); WALKER_TPB for the coop path is now
    config.coopTpb, threaded into partition_task (so the indirect grain
    matches) and the coop kernel. Stream-walker stays pinned at 64.
  • msm-accum-ab autorun: order entries accept accum[:tpb]
    (e.g. order=walker,coop:64,coop:32,coop:128) so several TPB variants are
    benchmarked in one page load (identical thermal state, one BrowserStack
    worker) with min/median ms + speedup vs the first entry.

Correctness — GPU vs @noble/curves, headless SwiftShader

GREEN (gpu == noble) at logn 8 and 10, seed 2, for coop TPB ∈ {32, 64, 128}.

Real-hardware timing

Adreno (S25) / Mali A/B sweep in flight on BrowserStack — numbers to follow.
If a smaller TPB wins on mobile, it's a free occupancy gain; if 64 is already
optimal, that's an honest negative bounding the occupancy lever.


Created by claudebox · group: aztec

@AztecBot AztecBot added the claudebox Owned by claudebox. it can push to this PR. label May 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

claudebox Owned by claudebox. it can push to this PR.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant