[cuda backend] int4/8 matvec: vectorized activation load (#20144)#20233
[cuda backend] int4/8 matvec: vectorized activation load (#20144)#20233johnny90 wants to merge 0 commit into
Conversation
a9d2322 to
d1d9256
Compare
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20233
Note: Links to docs will display an error until the docs builds have been completed. ❌ 2 New Failures, 135 PendingAs of commit d1d9256 with merge base 630ddba ( NEW FAILURES - The following jobs have failed:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
@johnny90 has exported this pull request. If you are a Meta employee, you can view the originating Diff in D108299291. |
This PR needs a
|
d1d9256 to
630ddba
Compare
A100-exclusive subplatform
The decode-only int4_plain_mm matvec was bound by activation
load-instruction throughput, not DRAM bandwidth (already ~64% peak) or
latency. Each inner iteration issued ~15 loads per 16-byte weight chunk:
8 scalar int32 activation loads + the same per-block scale d reloaded
4x. Same as int8_plain_mm
Align Q8Block to 16 bytes (sizeof 36->48) so each block's qs_even/qs_odd
16B halves are 16B-aligned, then load a whole activation block with two
vectorized uint4 loads + one d load (~4x fewer activation loads). dp4a
math and accumulation order are bit-identical; the int8 activation
values and scale are unchanged.
gemma4_31b decode (long-ctx harness, stacked on optimize_1):
decode 43.98 -> 46.557 tok/s (+6.4%), +12.7% compare with llama.cpp
(41.5 token/s)
profile result: int4 matvec avg 38.4 -> 34.75 us (-9.5%); quant kernel
unchanged.