Add a weight-stationary INT4 GEMM for small M by digantdesai · Pull Request #20154 · pytorch/executorch

digantdesai · 2026-06-09T16:03:47Z

The INT4 dp4a kernel launched one block-row per activation row (grid.y = M) and
re-read the packed weights for every row, so weight traffic scaled with M. That
is fine for M=1 decode but makes a small-M forward (EAGLE speculative
verification over chain_len+1 tokens) cost ~M decodes.

Add an int4_w4a8_gemm_kernel that loads each weight chunk once and accumulates it
into all M output rows (grid.y = 1), so weight traffic is 1x regardless of M;
int4_plain_mm uses it for 2 <= M <= GEMM_MAX_M (8) and keeps the matvec for M=1.
MATVEC_MAX_M (the Python dispatch threshold) stays 4 by default so other models'
dynamic-prefill exports are unaffected; an export raises it locally. The
dispatch asserts MATVEC_MAX_M <= SHIM_GEMM_MAX_M so the Python and C++ limits
cannot silently diverge.

Authored with assistance from Claude Code.

[ghstack-poisoned]

digantdesai · 2026-06-09T16:03:49Z

Stack from ghstack (oldest at bottom):

pytorch-bot · 2026-06-09T16:03:53Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20154

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures, 4 Pending, 1 Unrelated Failure, 1 Unclassified Failure

As of commit c7e303c with merge base dc55469 ():

NEW FAILURES - The following jobs have failed:

pull / android / run-emulator (gh)
The process '/usr/local/lib/android/sdk/platform-tools/adb' failed with exit code 224
Test CUDA Windows Export and E2E / test-model-cuda-windows-e2e (facebook, dinov2-small-imagenet1k-1-layer, non-quantized) / windows-job (gh)
Process completed with exit code 1.

UNCLASSIFIED FAILURE - DrCI could not classify the following job because the workflow did not run on the merge base. The failure may be pre-existing on trunk or introduced by this PR:

Test CUDA Builds / export-model-cuda-artifact (openai, whisper-large-v3-turbo, quantized-int4-weight-only) / linux-job (gh) (this job did not run on the merge base, so DrCI cannot tell whether the failure is pre-existing)
RuntimeError: Command docker exec -t c2240f0ade14ad6f138c32517cb8420e513db45655c3d3fbf9a8f4248a8764fe /exec failed with exit code 1

FLAKY - The following job failed but was likely due to flakiness present on trunk:

Test CUDA Builds / unittest-cuda / linux-job (gh) (detected as infra flaky with no log or failing log classifier)

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Update

c7e303c

[ghstack-poisoned]

This was referenced Jun 9, 2026

Add EAGLE-3 draft head #20149

Draft

Add Gemma4 EAGLE-3 hidden-state taps #20150

Draft

Add Gemma4 EAGLE-3 eager reference #20151

Draft

Add KV cache to the EAGLE-3 draft head #20152

Draft

Add the Eagle3Speculator module for speculative decoding #20153

Draft

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 9, 2026

This was referenced Jun 9, 2026

Add the EAGLE-3 speculator CUDA export #20155

Draft

Add the EAGLE-3 speculative-decoding runner (CUDA) #20156

Draft

Add EAGLE-3 end-to-end speculative-decode test #20157

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a weight-stationary INT4 GEMM for small M#20154

Add a weight-stationary INT4 GEMM for small M#20154
digantdesai wants to merge 1 commit into
gh/digantdesai/58/headfrom
gh/digantdesai/59/head

digantdesai commented Jun 9, 2026

Uh oh!

digantdesai commented Jun 9, 2026 •

edited

Loading

Uh oh!

pytorch-bot Bot commented Jun 9, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

digantdesai commented Jun 9, 2026

Uh oh!

digantdesai commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot Bot commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20154

❌ 2 New Failures, 4 Pending, 1 Unrelated Failure, 1 Unclassified Failure

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

digantdesai commented Jun 9, 2026 •

edited

Loading

pytorch-bot Bot commented Jun 9, 2026 •

edited

Loading