WIP: ep_dispatch_combine idx channel uses INT32 TROWSUM compaction by zhangqi-chen · Pull Request #843 · hw-native-sys/simpler

zhangqi-chen · 2026-05-22T01:58:48Z

Summary

Switch the ep_dispatch_combine dispatch kernel's idx stage-out channel from a scalar GM copy of column 0 to the same TLOAD + TROWSUM + TSTORE compaction already used for the FP32 weight channel.
The INT32 tiles (idx_wide_tile / idx_sum_tile / idx_tmp_tile) and tensor types (IWideG / ISumG) were already declared and TASSIGN'd — only the loop body and the surrounding comments changed.
Updated the kernel header comments and main.py docstring to drop the a2a3-hang caveat and describe TROWSUM compaction for both channels.

Context

The idx channel previously used a scalar fallback because INT32 TROWSUM hung on a2a3 hardware (hw-native-sys/pto-isa#119). pto-isa now declares this path fixed; this PR exercises it.

Status / open items (WIP)

Confirm the CI-pinned pto-isa commit (ddafa8da, ci.yml) actually contains the INT32 TROWSUM fix — issue Fix: replace dcci no-op with acquire fence in a2a3sim #119 reported the hang on 687af1a6 and is still open.
Run ep_dispatch_combine on a2a3 hardware to verify INT32 TROWSUM no longer hangs (sim alone cannot validate the hardware behavior).
Decide whether the sibling ep_dispatch_distributed example (also carrying an INT32 fallback) should be updated in the same PR.

Testing

Simulation tests pass
Hardware (a2a3) tests pass

Refs hw-native-sys/pto-isa#119

The idx stage-out channel previously fell back to a scalar GM copy of column 0 because INT32 TROWSUM hung on a2a3 hardware. pto-isa now supports INT32 TROWSUM, so switch the idx channel to the same TLOAD + TROWSUM + TSTORE compaction already used for the FP32 weight channel, restoring symmetry between the two channels. Refs hw-native-sys/pto-isa#119

gemini-code-assist

Code Review

This pull request updates the index processing in the dispatch kernel by replacing a scalar copy loop with TROWSUM compaction, consistent with the weight channel implementation. The review feedback identifies redundant pipe_barrier calls that can be removed to optimize performance, as the vector pipe executes in-order and synchronization is already handled by flags.

gemini-code-assist · 2026-05-22T02:01:48Z

+        TLOAD(idx_wide_tile, idx_win_g);
+        set_flag(PIPE_MTE2, PIPE_V, EVENT_ID1);
+        wait_flag(PIPE_MTE2, PIPE_V, EVENT_ID1);
+        pipe_barrier(PIPE_V);
+        TROWSUM(idx_sum_tile, idx_wide_tile, idx_tmp_tile);
+        pipe_barrier(PIPE_V);
+        set_flag(PIPE_V, PIPE_MTE3, EVENT_ID1);
+        wait_flag(PIPE_V, PIPE_MTE3, EVENT_ID1);
+        TSTORE(idx_out_g, idx_sum_tile);


The pipe_barrier(PIPE_V) calls (lines 541 and 543) are redundant. In the AIV architecture, the vector pipe (PIPE_V) executes instructions in-order. Since TROWSUM is a vector instruction and is already synchronized with the MTE pipes via the set_flag/wait_flag pairs (lines 539-540 and 544-545), these barriers do not provide additional safety and only consume cycles. Removing them simplifies the code and avoids unnecessary pipeline stalls. Note that the same redundancy exists in the weight loop above.

TLOAD(idx_wide_tile, idx_win_g); set_flag(PIPE_MTE2, PIPE_V, EVENT_ID1); wait_flag(PIPE_MTE2, PIPE_V, EVENT_ID1); TROWSUM(idx_sum_tile, idx_wide_tile, idx_tmp_tile); set_flag(PIPE_V, PIPE_MTE3, EVENT_ID1); wait_flag(PIPE_V, PIPE_MTE3, EVENT_ID1); TSTORE(idx_out_g, idx_sum_tile);

References

A pipe_barrier is not required for performance profiling records on AICore after task execution.

gemini-code-assist Bot reviewed May 22, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: ep_dispatch_combine idx channel uses INT32 TROWSUM compaction#843

WIP: ep_dispatch_combine idx channel uses INT32 TROWSUM compaction#843
zhangqi-chen wants to merge 1 commit into
hw-native-sys:mainfrom
zhangqi-chen:ep-dispatch-combine-int32-trowsum

zhangqi-chen commented May 22, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

zhangqi-chen commented May 22, 2026

Summary

Context

Status / open items (WIP)

Testing

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 22, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant