[cuda backend] int4/8 matvec: vectorized activation load (#20144) by johnny90 · Pull Request #20233 · pytorch/executorch

johnny90 · 2026-06-12T08:39:53Z

The decode-only int4_plain_mm matvec was bound by activation
load-instruction throughput, not DRAM bandwidth (already ~64% peak) or
latency. Each inner iteration issued ~15 loads per 16-byte weight chunk:
8 scalar int32 activation loads + the same per-block scale d reloaded
4x. Same as int8_plain_mm

Align Q8Block to 16 bytes (sizeof 36->48) so each block's qs_even/qs_odd
16B halves are 16B-aligned, then load a whole activation block with two
vectorized uint4 loads + one d load (~4x fewer activation loads). dp4a
math and accumulation order are bit-identical; the int8 activation
values and scale are unchanged.

gemma4_31b decode (long-ctx harness, stacked on optimize_1):
decode 43.98 -> 46.557 tok/s (+6.4%), +12.7% compare with llama.cpp
(41.5 token/s)

profile result: int4 matvec avg 38.4 -> 34.75 us (-9.5%); quant kernel
unchanged.

pytorch-bot · 2026-06-12T08:39:57Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20233

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures, 135 Pending

As of commit d1d9256 with merge base 630ddba ():

NEW FAILURES - The following jobs have failed:

Cadence Build & Test / hifi-build / hifi4 (gh)
Input required and not supplied: aws-region
Cadence Build & Test / vision-build / vision (gh)
Input required and not supplied: aws-region

This comment was automatically generated by Dr. CI and updates every 15 minutes.

linux-foundation-easycla · 2026-06-12T08:40:00Z

meta-codesync · 2026-06-12T08:40:01Z

@johnny90 has exported this pull request. If you are a Meta employee, you can view the originating Diff in D108299291.

github-actions · 2026-06-12T08:40:45Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

johnny90 force-pushed the export-D108299291 branch from a9d2322 to d1d9256 Compare June 12, 2026 08:39

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 12, 2026

meta-codesync Bot added the meta-exported label Jun 12, 2026

johnny90 had a problem deploying to cadence June 12, 2026 08:40 — with GitHub Actions Failure

johnny90 closed this Jun 12, 2026

johnny90 force-pushed the export-D108299291 branch from d1d9256 to 630ddba Compare June 12, 2026 08:59

meta-codesync Bot changed the title ~~Migrate non-MIG A100 RE tests to A100-exclusive subplatform~~ [cuda backend] int4/8 matvec: vectorized activation load (#20144) Jun 12, 2026

Gasoonjia temporarily deployed to update-commit-hash June 12, 2026 09:04 — with GitHub Actions Inactive

johnny90 deleted the export-D108299291 branch June 12, 2026 09:05

Gasoonjia temporarily deployed to upload-benchmark-results June 12, 2026 09:06 — with GitHub Actions Inactive

meta-codesync Bot temporarily deployed to cadence June 12, 2026 09:17 Inactive

facebook-github-bot had a problem deploying to update-viable-strict June 12, 2026 09:39 — with GitHub Actions Failure

pytorch-bot Bot temporarily deployed to update-commit-hash June 12, 2026 09:42 Inactive

pytorch-bot Bot temporarily deployed to update-commit-hash June 12, 2026 09:49 Inactive

pytorch-bot Bot temporarily deployed to update-commit-hash June 12, 2026 10:05 Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[cuda backend] int4/8 matvec: vectorized activation load (#20144)#20233

[cuda backend] int4/8 matvec: vectorized activation load (#20144)#20233
johnny90 wants to merge 0 commit into
pytorch:mainfrom
johnny90:export-D108299291

johnny90 commented Jun 12, 2026 •

edited by meta-codesync Bot

Loading

Uh oh!

pytorch-bot Bot commented Jun 12, 2026 •

edited

Loading

Uh oh!

linux-foundation-easycla Bot commented Jun 12, 2026 •

edited

Loading

Uh oh!

meta-codesync Bot commented Jun 12, 2026

Uh oh!

github-actions Bot commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

johnny90 commented Jun 12, 2026 • edited by meta-codesync Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot Bot commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20233

❌ 2 New Failures, 135 Pending

Uh oh!

linux-foundation-easycla Bot commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

meta-codesync Bot commented Jun 12, 2026

Uh oh!

github-actions Bot commented Jun 12, 2026

This PR needs a release notes: label

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

johnny90 commented Jun 12, 2026 •

edited by meta-codesync Bot

Loading

pytorch-bot Bot commented Jun 12, 2026 •

edited

Loading

linux-foundation-easycla Bot commented Jun 12, 2026 •

edited

Loading

This PR needs a `release notes:` label