Skip to content

Refactor: enforce mix strict priority in scheduler dispatch#855

Merged
ChaoWao merged 1 commit into
hw-native-sys:mainfrom
poursoul:refactor/dispatch-mix-strict-priority
May 26, 2026
Merged

Refactor: enforce mix strict priority in scheduler dispatch#855
ChaoWao merged 1 commit into
hw-native-sys:mainfrom
poursoul:refactor/dispatch-mix-strict-priority

Conversation

@poursoul
Copy link
Copy Markdown
Collaborator

@poursoul poursoul commented May 25, 2026

Summary

Reshape Phase 4 of SchedulerContext::resolve_and_dispatch from shape-outer/phase-inner into a new dispatch_ready_tasks pass with phase-split semantics and cross-thread idle gating. Applied to both a2a3 and a5 tensormap_and_ringbuffer runtimes.

  • MIX strict priority: IDLE-MIX runs first. If mix residual is detected (local_buf + ready_queue), AIC/AIV yield both their IDLE and PENDING stages for this pass; only the next loop iteration (after another completion poll) lets them re-enter.
  • MIX-PENDING always considered: gated only on has_idle_in_other_threads(MIX), not on the skip_aic_aiv flag — pending slots keep draining mix even when AIC/AIV are blocked.
  • Cross-thread idle gating for PENDING: AIC/AIV-PENDING skip themselves if any peer scheduler thread has an idle core of the same shape, biasing pending fills to threads whose own cores would otherwise sit idle.
  • Local-buf flush points: between IDLE and PENDING so peer threads see IDLE-stage release_fanin output; again at function end so PENDING-stage release_fanin output does not carry across iterations.
  • PMU single-issue short-circuit and sync_start drain protocol preserved unchanged. a5 picks up the PMU guard alongside the new policy (its prior implementation lacked it).

has_idle_in_other_threads reads peer trackers' core_states_ without explicit synchronization; aarch64 8-byte aligned single-copy atomicity covers the load, and the value is consumed as a scheduling hint (stale reads self-correct on the next iteration).

Testing

  • a2a3 sim + onboard and a5 sim + onboard variants build clean (incremental + clang-format + clang-tidy + cpplint)
  • cpput a2a3: test_scheduler_state (10/10), test_ready_queue (25/25); a5: test_a5_fatal (3/3) — green
  • st on a2a3sim: spmd_multiblock_mix, spmd_sync_start, spmd_sync_start_stress, spmd_sync_start_edge, spmd_sync_start_aiv, spmd_starvation, mixed_example — 7/7 passed (60.96s)
  • st on a5sim: same 7 scenarios — 7/7 passed (72.62s)
  • Hardware perf bench (benchmark_bgemm) on both arches — recommended before merge

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the scheduler's dispatch logic to implement a MIX-strict-priority policy, introducing a more structured two-phase dispatch process (IDLE and PENDING) within the new dispatch_ready_tasks method. It also adds cross-thread idle gating via has_idle_in_other_threads. Feedback was provided regarding the implementation of has_idle_in_other_threads, specifically noting that performing cross-thread reads without std::atomic constitutes a data race and undefined behavior under the C++ memory model, regardless of hardware-level atomicity guarantees on specific architectures.

@poursoul poursoul force-pushed the refactor/dispatch-mix-strict-priority branch 3 times, most recently from 1e30ea2 to e1728ea Compare May 26, 2026 02:27
… dispatch

Apply to both a2a3 and a5 runtimes. Phase 4 of resolve_and_dispatch is
reshaped from shape-outer/phase-inner into a new dispatch_ready_tasks
pass with phase-split semantics:

  * IDLE-MIX runs first. If mix tasks remain (local_buf + ready_queue),
    AIC and AIV yield both their IDLE and PENDING stages for the pass.
  * MIX-PENDING is always considered next, gated only on whether any
    peer scheduler thread has an idle cluster — so residual mix continues
    to drain via pending slots regardless of skip_aic_aiv.
  * After MIX-PENDING, AIC/AIV-PENDING runs only when mix is fully
    drained and the corresponding shape has no peer idle core.
  * Local buffers are flushed between the IDLE and PENDING stages so
    PENDING-stage queue checks and peer threads see IDLE-stage results,
    and again on every return path via an RAII FlushGuard so
    release_fanin output during PENDING does not carry into the next
    iteration's IDLE.

The PMU single-issue short-circuit and the sync_start drain protocol
are preserved unchanged. a5 picks up the PMU guard alongside the new
policy (its prior implementation lacked it); there's no automated test
for this — PMU profiling correctness requires hardware PMU counters
and a single-issue baseline to compare against, neither of which the
sim suite provides. The change brings a5 in line with a2a3.

cross-thread peer-tracker reads in has_idle_in_other_threads stay
plain (not atomic) and consume the value as a hint; the comment on
the implementation spells out the aarch64 single-copy-atomicity
argument and the drain-protocol exclusion.

PTO2_SCHED_PROFILING note: local_overflow_count now accumulates each
batch separately as flush_local_bufs is called multiple times per
pass (mid flush + RAII tail flush). Each entry is still counted
exactly once (count is zeroed after push_batch), but the per-pass
total reflects "entries pushed to the global queue this pass" rather
than the pre-refactor "buf residual at pass end". Comparing traces
across commits, expect the post-refactor number to be greater-or-equal.
@ChaoWao ChaoWao merged commit 4400558 into hw-native-sys:main May 26, 2026
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants