WIP: ep_dispatch_combine idx channel uses INT32 TROWSUM compaction#843
WIP: ep_dispatch_combine idx channel uses INT32 TROWSUM compaction#843zhangqi-chen wants to merge 1 commit into
Conversation
The idx stage-out channel previously fell back to a scalar GM copy of column 0 because INT32 TROWSUM hung on a2a3 hardware. pto-isa now supports INT32 TROWSUM, so switch the idx channel to the same TLOAD + TROWSUM + TSTORE compaction already used for the FP32 weight channel, restoring symmetry between the two channels. Refs hw-native-sys/pto-isa#119
There was a problem hiding this comment.
Code Review
This pull request updates the index processing in the dispatch kernel by replacing a scalar copy loop with TROWSUM compaction, consistent with the weight channel implementation. The review feedback identifies redundant pipe_barrier calls that can be removed to optimize performance, as the vector pipe executes in-order and synchronization is already handled by flags.
| TLOAD(idx_wide_tile, idx_win_g); | ||
| set_flag(PIPE_MTE2, PIPE_V, EVENT_ID1); | ||
| wait_flag(PIPE_MTE2, PIPE_V, EVENT_ID1); | ||
| pipe_barrier(PIPE_V); | ||
| TROWSUM(idx_sum_tile, idx_wide_tile, idx_tmp_tile); | ||
| pipe_barrier(PIPE_V); | ||
| set_flag(PIPE_V, PIPE_MTE3, EVENT_ID1); | ||
| wait_flag(PIPE_V, PIPE_MTE3, EVENT_ID1); | ||
| TSTORE(idx_out_g, idx_sum_tile); |
There was a problem hiding this comment.
The pipe_barrier(PIPE_V) calls (lines 541 and 543) are redundant. In the AIV architecture, the vector pipe (PIPE_V) executes instructions in-order. Since TROWSUM is a vector instruction and is already synchronized with the MTE pipes via the set_flag/wait_flag pairs (lines 539-540 and 544-545), these barriers do not provide additional safety and only consume cycles. Removing them simplifies the code and avoids unnecessary pipeline stalls. Note that the same redundancy exists in the weight loop above.
TLOAD(idx_wide_tile, idx_win_g);
set_flag(PIPE_MTE2, PIPE_V, EVENT_ID1);
wait_flag(PIPE_MTE2, PIPE_V, EVENT_ID1);
TROWSUM(idx_sum_tile, idx_wide_tile, idx_tmp_tile);
set_flag(PIPE_V, PIPE_MTE3, EVENT_ID1);
wait_flag(PIPE_V, PIPE_MTE3, EVENT_ID1);
TSTORE(idx_out_g, idx_sum_tile);References
- A pipe_barrier is not required for performance profiling records on AICore after task execution.
Summary
ep_dispatch_combinedispatch kernel's idx stage-out channel from a scalar GM copy of column 0 to the sameTLOAD + TROWSUM + TSTOREcompaction already used for the FP32 weight channel.idx_wide_tile/idx_sum_tile/idx_tmp_tile) and tensor types (IWideG/ISumG) were already declared andTASSIGN'd — only the loop body and the surrounding comments changed.main.pydocstring to drop the a2a3-hang caveat and describe TROWSUM compaction for both channels.Context
The idx channel previously used a scalar fallback because INT32
TROWSUMhung on a2a3 hardware (hw-native-sys/pto-isa#119). pto-isa now declares this path fixed; this PR exercises it.Status / open items (WIP)
ddafa8da, ci.yml) actually contains the INT32 TROWSUM fix — issue Fix: replace dcci no-op with acquire fence in a2a3sim #119 reported the hang on687af1a6and is still open.ep_dispatch_combineon a2a3 hardware to verify INT32 TROWSUM no longer hangs (sim alone cannot validate the hardware behavior).ep_dispatch_distributedexample (also carrying an INT32 fallback) should be updated in the same PR.Testing
Refs hw-native-sys/pto-isa#119