Skip to content

WIP: ep_dispatch_combine idx channel uses INT32 TROWSUM compaction#843

Draft
zhangqi-chen wants to merge 1 commit into
hw-native-sys:mainfrom
zhangqi-chen:ep-dispatch-combine-int32-trowsum
Draft

WIP: ep_dispatch_combine idx channel uses INT32 TROWSUM compaction#843
zhangqi-chen wants to merge 1 commit into
hw-native-sys:mainfrom
zhangqi-chen:ep-dispatch-combine-int32-trowsum

Conversation

@zhangqi-chen
Copy link
Copy Markdown
Contributor

Summary

  • Switch the ep_dispatch_combine dispatch kernel's idx stage-out channel from a scalar GM copy of column 0 to the same TLOAD + TROWSUM + TSTORE compaction already used for the FP32 weight channel.
  • The INT32 tiles (idx_wide_tile / idx_sum_tile / idx_tmp_tile) and tensor types (IWideG / ISumG) were already declared and TASSIGN'd — only the loop body and the surrounding comments changed.
  • Updated the kernel header comments and main.py docstring to drop the a2a3-hang caveat and describe TROWSUM compaction for both channels.

Context

The idx channel previously used a scalar fallback because INT32 TROWSUM hung on a2a3 hardware (hw-native-sys/pto-isa#119). pto-isa now declares this path fixed; this PR exercises it.

Status / open items (WIP)

  • Confirm the CI-pinned pto-isa commit (ddafa8da, ci.yml) actually contains the INT32 TROWSUM fix — issue Fix: replace dcci no-op with acquire fence in a2a3sim #119 reported the hang on 687af1a6 and is still open.
  • Run ep_dispatch_combine on a2a3 hardware to verify INT32 TROWSUM no longer hangs (sim alone cannot validate the hardware behavior).
  • Decide whether the sibling ep_dispatch_distributed example (also carrying an INT32 fallback) should be updated in the same PR.

Testing

  • Simulation tests pass
  • Hardware (a2a3) tests pass

Refs hw-native-sys/pto-isa#119

The idx stage-out channel previously fell back to a scalar GM copy of
column 0 because INT32 TROWSUM hung on a2a3 hardware. pto-isa now
supports INT32 TROWSUM, so switch the idx channel to the same
TLOAD + TROWSUM + TSTORE compaction already used for the FP32 weight
channel, restoring symmetry between the two channels.

Refs hw-native-sys/pto-isa#119
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the index processing in the dispatch kernel by replacing a scalar copy loop with TROWSUM compaction, consistent with the weight channel implementation. The review feedback identifies redundant pipe_barrier calls that can be removed to optimize performance, as the vector pipe executes in-order and synchronization is already handled by flags.

Comment on lines +538 to +546
TLOAD(idx_wide_tile, idx_win_g);
set_flag(PIPE_MTE2, PIPE_V, EVENT_ID1);
wait_flag(PIPE_MTE2, PIPE_V, EVENT_ID1);
pipe_barrier(PIPE_V);
TROWSUM(idx_sum_tile, idx_wide_tile, idx_tmp_tile);
pipe_barrier(PIPE_V);
set_flag(PIPE_V, PIPE_MTE3, EVENT_ID1);
wait_flag(PIPE_V, PIPE_MTE3, EVENT_ID1);
TSTORE(idx_out_g, idx_sum_tile);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The pipe_barrier(PIPE_V) calls (lines 541 and 543) are redundant. In the AIV architecture, the vector pipe (PIPE_V) executes instructions in-order. Since TROWSUM is a vector instruction and is already synchronized with the MTE pipes via the set_flag/wait_flag pairs (lines 539-540 and 544-545), these barriers do not provide additional safety and only consume cycles. Removing them simplifies the code and avoids unnecessary pipeline stalls. Note that the same redundancy exists in the weight loop above.

        TLOAD(idx_wide_tile, idx_win_g);
        set_flag(PIPE_MTE2, PIPE_V, EVENT_ID1);
        wait_flag(PIPE_MTE2, PIPE_V, EVENT_ID1);
        TROWSUM(idx_sum_tile, idx_wide_tile, idx_tmp_tile);
        set_flag(PIPE_V, PIPE_MTE3, EVENT_ID1);
        wait_flag(PIPE_V, PIPE_MTE3, EVENT_ID1);
        TSTORE(idx_out_g, idx_sum_tile);
References
  1. A pipe_barrier is not required for performance profiling records on AICore after task execution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant