Add GEMV INT 4 by albiol2004 · Pull Request #101 · amd/IRON

albiol2004 · 2026-04-09T17:50:14Z

Added

aie_kernels/generic/fused_dequant_gemv.cc : C++ AIE kernel that fuses INT4 weight dequantization with matrix-vector multiplication in a single pass. Uses the proven aie::unpack chain from expand.cc (uint4 → uint8 → uint16 → bf16 → scale → MAC). Includes a double-pump optimization that processes 2 groups per iteration for instruction interleaving, and compile-time DIM_K/GROUP_SIZE for loop specialization.
iron/operators/gemv_int4/ : Python operator (op.py, design.py, reference.py, test.py) following the existing gemv/ and dequant/ patterns. Packed weight buffer layout: [uint4 weights | bf16 per-group scales] per tile.

Changed

Nothing. This is a new operator with no modifications to existing code.

Removed

Nothing.

Motivation

Decode inference is bandwidth-bound, GEMV loads each weight once per token. INT4 weights are half the size of INT8 and a quarter of bf16, making INT4 GEMV the highest-impact single operator for decode throughput.

This is a foundation operator: a standalone, individually tested building block for INT4 weight inference. It is not intended as an optimized fused decode kernel, the goal is to provide a correct,
well-tested INT4 GEMV that can be composed into fused pipelines.

Credit to @jgmelber for the original fused dequant-GEMV concept in PR #79 and the INT4 GEMV benchmarks in PR #71 (21.4 GB/s on 8192×2048). That work demonstrated the viability of fused INT4 inference on AIE and informed the design here. If you have insights on further kernel optimizations I'd love to hear them, my focus was on getting the foundation block right rather than a fully optimized fused kernel.

Test results

7/7 parametrized tests pass (2048×2048, 8192×2048, 2048×8192 at 4/8 columns, tsi=1/4)
Integration tested with real Llama 3.2 1B weights (cosine similarity >0.999 vs CPU reference)
Performance (8 columns): 2048×8192 at 561 us / 16.9 GB/s, 8192×2048 at 665 us / 14.2 GB/s

Checklist

Tests pass locally
Code formatted (black + clang-format)
License headers on all files
No changes to existing code

Fused INT4 weight dequantization + matrix-vector multiplication in a single kernel pass. Loads packed uint4 weights from DDR, dequantizes in-register using the aie::unpack chain, and MACs with bf16 activation vector, 4x DDR bandwidth reduction vs bf16 GEMV. Kernel optimizations: - Compile-time GROUP_SIZE and DIM_K for loop count optimization - Double-pump: processes 2 groups (64 elements) per iteration, giving the compiler two independent unpack chains to interleave - AIE_PREPARE_FOR_PIPELINING and AIE_LOOP_MIN_ITERATION_COUNT hints Tested on AMD Ryzen AI 9 HX 370 (NPU2, 8 columns): - 2048x8192 (Llama up_proj): 561 us, 16.9 GB/s effective bandwidth - 8192x2048 (Llama down_proj): 665 us, 14.2 GB/s effective bandwidth - Integration tested with real Llama 3.2 1B weights (cosine sim >0.999)

hunhoffe · 2026-04-10T14:19:11Z

@@ -0,0 +1,161 @@
+# SPDX-FileCopyrightText: Copyright (C) 2025 Advanced Micro Devices, Inc. All rights reserved.


How much of this is similar to other GEMM ops? Can we find a way to reuse code?

Like 70% similar to gemv. But different layout and kernel. I could import from gemv to reuse _build_gemv_program()

or have it as a common helper (only 2 ops using it I think) which I would leave for refactor in another PR

Basically it’s the fused kernel that is important so depends on what you prefer for readability/maintenance

github-actions · 2026-04-10T17:12:46Z

📊 Test Results for Test Example Applications

2703e79 (2026_04_10_17_11_49)

IRONCLAD

Tested on 2026_04_10_17_11_49 at commit 2703e79.

Test	Checks	TTFT (mean)	TPS (mean)

📈 Trends (vs main branch) for Test Example Applications

2703e79 (2026_04_10_17_11_49)

IRONCLAD Trends

llama_3.2_1b

Commit/Date	Num Tokens (max)	Num Tokens (mean)	Num Tokens (median)	Num Tokens (min)	Num Tokens (stddev)	TPS (max)	TPS (mean)	TPS (median)	TPS (min)	TPS (stddev)	TTFT (max)	TTFT (mean)	TTFT (median)	TTFT (min)	TTFT (stddev)	Total (max)	Total (mean)	Total (median)	Total (min)	Total (stddev)
`130b6ea` — 2025-12-05 21:33:12	40.00 (+0.00%)	40.00 (+0.00%)	40.00 (+0.00%)	40.00 (+0.00%)	0.00 (n/a)	4.71 (-0.42%)	4.64 (-0.09%)	4.64 (+0.65%)	4.55 (-0.22%)	0.05 (-17.66%)	4.41 (-0.34%)	4.39 (-0.19%)	4.38 (-0.33%)	4.37 (-0.15%)	0.01 (-25.90%)	12.96 (-0.00%)	12.80 (+0.07%)	12.80 (-0.23%)	12.67 (+0.44%)	0.09 (-21.12%)
`0a6c11c` — 2025-12-03 23:35:15	40.00 (n/a)	40.00 (n/a)	40.00 (n/a)	40.00 (n/a)	0.00 (n/a)	4.73 (n/a)	4.64 (n/a)	4.61 (n/a)	4.56 (n/a)	0.06 (n/a)	4.42 (n/a)	4.40 (n/a)	4.40 (n/a)	4.37 (n/a)	0.02 (n/a)	12.96 (n/a)	12.79 (n/a)	12.83 (n/a)	12.62 (n/a)	0.12 (n/a)

llama_3.2_1b_prompt_1024_tokens_1

Commit/Date	TTFT (max)	TTFT (mean)	TTFT (median)	TTFT (min)	TTFT (stddev)
`912e6bc` — 2026-04-07 19:08:43	2.15 (-0.46%)	2.13 (-0.47%)	2.13 (-0.19%)	2.11 (-0.85%)	0.02 (+39.39%)
`2371174` — 2026-04-06 17:38:48	2.16 (n/a)	2.14 (n/a)	2.14 (n/a)	2.12 (n/a)	0.01 (n/a)

llama_3.2_1b_prompt_1024_tokens_40

Commit/Date	TPS (max)	TPS (mean)	TPS (median)	TPS (min)	TPS (stddev)	TTFT (max)	TTFT (mean)	TTFT (median)	TTFT (min)	TTFT (stddev)
`912e6bc` — 2026-04-07 19:08:43	4.21 (+0.86%)	4.17 (+0.13%)	4.16 (-0.05%)	4.14 (-0.34%)	0.03 (+220.64%)	2.28 (+0.71%)	2.16 (-0.29%)	2.13 (-0.84%)	2.12 (+0.19%)	0.07 (+18.56%)
`2371174` — 2026-04-06 17:38:48	4.17 (n/a)	4.16 (n/a)	4.16 (n/a)	4.15 (n/a)	0.01 (n/a)	2.27 (n/a)	2.16 (n/a)	2.15 (n/a)	2.11 (n/a)	0.06 (n/a)

llama_3.2_1b_prompt_13_tokens_1

Commit/Date	TTFT (max)	TTFT (mean)	TTFT (median)	TTFT (min)	TTFT (stddev)
`912e6bc` — 2026-04-07 19:08:43	2.10 (+0.29%)	2.09 (+0.31%)	2.09 (+0.43%)	2.09 (+0.58%)	0.01 (-40.31%)
`2371174` — 2026-04-06 17:38:48	2.10 (n/a)	2.08 (n/a)	2.08 (n/a)	2.07 (n/a)	0.01 (n/a)

llama_3.2_1b_prompt_13_tokens_40

Commit/Date	TPS (max)	TPS (mean)	TPS (median)	TPS (min)	TPS (stddev)	TTFT (max)	TTFT (mean)	TTFT (median)	TTFT (min)	TTFT (stddev)
`912e6bc` — 2026-04-07 19:08:43	4.18 (+0.12%)	4.16 (+0.13%)	4.16 (+0.14%)	4.15 (+0.00%)	0.01 (+12.35%)	2.10 (-0.76%)	2.09 (-0.22%)	2.09 (-0.10%)	2.07 (-0.48%)	0.01 (-20.82%)
`2371174` — 2026-04-06 17:38:48	4.18 (n/a)	4.16 (n/a)	4.16 (n/a)	4.15 (n/a)	0.01 (n/a)	2.12 (n/a)	2.09 (n/a)	2.09 (n/a)	2.08 (n/a)	0.02 (n/a)

llama_3.2_1b_prompt_2048_tokens_1

Commit/Date	Num_Tokens (max)	Num_Tokens (mean)	Num_Tokens (median)	Num_Tokens (min)	Num_Tokens (stddev)	TPS (max)	TPS (mean)	TPS (median)	TPS (min)	TPS (stddev)	TTFT (max)	TTFT (mean)	TTFT (median)	TTFT (min)	TTFT (stddev)
`897d04e` — 2026-03-06 22:56:07	1.00 (+0.00%)	1.00 (+0.00%)	1.00 (+0.00%)	1.00 (+0.00%)	0.00 (n/a)	0.00 (n/a)	0.00 (n/a)	0.00 (n/a)	0.00 (n/a)	0.00 (n/a)	2.68 (-1.06%)	2.68 (-1.06%)	2.68 (-1.06%)	2.68 (-1.06%)	0.00 (n/a)
`84d3478` — 2026-02-17 23:16:23	1.00 (n/a)	1.00 (n/a)	1.00 (n/a)	1.00 (n/a)	0.00 (n/a)	0.00 (n/a)	0.00 (n/a)	0.00 (n/a)	0.00 (n/a)	0.00 (n/a)	2.70 (n/a)	2.70 (n/a)	2.70 (n/a)	2.70 (n/a)	0.00 (n/a)

llama_3.2_1b_prompt_2048_tokens_40

Commit/Date	Num_Tokens (max)	Num_Tokens (mean)	Num_Tokens (median)	Num_Tokens (min)	Num_Tokens (stddev)	TPS (max)	TPS (mean)	TPS (median)	TPS (min)	TPS (stddev)	TTFT (max)	TTFT (mean)	TTFT (median)	TTFT (min)	TTFT (stddev)
`897d04e` — 2026-03-06 22:56:07	40.00 (+0.00%)	40.00 (+0.00%)	40.00 (+0.00%)	40.00 (+0.00%)	0.00 (n/a)	4.00 (-1.72%)	4.00 (-1.72%)	4.00 (-1.72%)	4.00 (-1.72%)	0.00 (n/a)	2.70 (-0.44%)	2.70 (-0.44%)	2.70 (-0.44%)	2.70 (-0.44%)	0.00 (n/a)
`84d3478` — 2026-02-17 23:16:23	40.00 (n/a)	40.00 (n/a)	40.00 (n/a)	40.00 (n/a)	0.00 (n/a)	4.07 (n/a)	4.07 (n/a)	4.07 (n/a)	4.07 (n/a)	0.00 (n/a)	2.71 (n/a)	2.71 (n/a)	2.71 (n/a)	2.71 (n/a)	0.00 (n/a)

Added external benchmark reference points (PR amd#80, PR amd#101, FastFlowLM) to recontextualize the Phase 8.2 regression. Key data: - PR amd#101 reports 561 us standalone INT4 GEMV (K=2048 N=8192). My Phase 8.2 dispatch budgets ~14 ms/FFN-layer = 7x slower than 3 * 561 + 665 = 1.9 ms expected. So per-kernel compute is not the bottleneck -- the chained-xclbin orchestration is. - FastFlowLM ships 66 t/s on Llama 3.2 1B Q4_1 on Strix Point using the same IRON+AIE-MLIR stack -- ~20x faster than our 3.4 t/s. Implies the real win is transformer-block-level fusion, not microkernel tuning. Negative result (kernel-side bias rewrite): Tried moving aie::sub(8) out of per-block inner loop into per-group scalar accumulator. Hypothesis: data dependency (as_bf16 -> sub -> mul -> mac) breaks AIE pipelining. Result: WORSE -- 2.80 -> 1.10 t/s (2.5x regression). Reverted. Conclusion: the per-block sub is not the bottleneck. AIE compiler likely spills the static bfloat16 S_g[256] buffer or fails to pipeline the scalar float row_bias accumulator. Real kernel optimization needs AIE compiler/IR analysis, not high-level algebra tricks. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…n revised) XRT.pdf projected ~5x decode improvement from async dispatch (3.4 -> ~16 t/s INT4). Day 1 spike on real Windows STX NPU2 hardware disproves this: Shape sync us/op async us/op win K=2048 N=8192 2434 2210 1.10x (ffn_gate/up) K=8192 N=2048 2405 2182 1.10x (ffn_down) K=2048 N=2048 640 566 1.13x (attn_q/o) K=2048 N= 512 227 157 1.44x (attn_k/v) xrt::run::start() IS confirmed non-blocking (~6-9 us submit). But the NPU itself serializes within one hw_context -- each kernel runs to completion before the next starts. Async wins only the ~80 us host overhead per op, which is 3-4% on 2 ms kernels. End-to-end projection for Llama 3.2 1B Q4_0: NPU-only matmul budget = 128 ms (async) vs 143 ms (sync) per token = ~7.8 t/s ceiling vs 3.4 t/s today = realistic 5-5.5 t/s after non-NPU overhead That's 1.6x, not 5x. Phase 9 still worth shipping to reach parity with NPU bf16 (5.2 t/s), but no longer "the breakthrough". The real bottleneck identified: my INT4 GEMV kernels run ~4x slower than the upstream PR amd#101 reference (2.2 ms vs ~561 us on same shape). Tracking that gap is the next investigation -- could be wrong tile sizes, missing compile flags, or design mismatch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Direct port of amd/IRON PR amd#101 (albiol2004/IRON@gemv-int4) to the IRON-windows aie2p tree. Three optimizations over the existing v1 kernel: 1. Compile-time DIM_K and GROUP_SIZE via -D flags -- compiler unrolls the inner loop, eliminating runtime arithmetic and enabling constexpr branch selection. 2. AIE_PREPARE_FOR_PIPELINING + AIE_LOOP_MIN_ITERATION_COUNT annotations -- AIE compiler schedules the software pipeline. 3. Double-pump path (2 groups/iter, independent A/B unpack chains) so the compiler can interleave them, hiding the dequant chain latency behind activation loads + MAC ops. Files: aie_kernels/aie2p/fused_dequant_gemv_v2.cc (~150 LOC, kernel) iron/operators/fused_dequant_gemv_v2/__init__.py iron/operators/fused_dequant_gemv_v2/op.py (~170 LOC, IRON op) iron/operators/fused_dequant_gemv_v2/design.py (~140 LOC, MLIR) V1 left untouched -- v2 is opt-in via a separate compile entry point (`compile_fused_dequant_gemv_v2`, ll.cpp side). Measured speedup on K=2048 N=8192 8col g32 via xrt_async_spike.exe: v1 sync 2307 us/op v2 sync 570 us/op (4.05x) v1 async 2204 us/op v2 async 494 us/op (4.47x) This is THE single biggest perf lever in Priority 8 -- ~4x decode improvement without touching dispatch architecture. Phase 9 async gives an additional 1.1x on top. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…rsed The "4x kernel gap" diagnosis was correct: the actual 4-4.5x was in the kernel, not the dispatch. Direct port of amd/IRON PR amd#101 (with compile-time DIM_K/GROUP_SIZE + double-pump + AIE pipelining hints) landed today as v2. End-to-end measurements on Llama 3.2 1B Q4_0: CPU Q4_0 11.10 t/s NPU bf16 5.30 t/s NPU INT4 (8.1 only) 3.40 t/s <- regression NPU INT4 v2 5.60 t/s <- +65% vs 8.1, +6% vs bf16 NPU INT4 (8.1+8.2) 2.80 t/s INT4 is now FASTER than bf16 on NPU for the first time. All 3 INT4 correctness tests pass byte-exact vs cpu_baseline through v2. Per-kernel spike measured 4.08-5.09x sync speedup on the dominant shapes; end-to-end is "only" 1.65x because Amdahl's law caps the win at the non-NPU fraction of per-token cost (RMSNorm, sampling, FlowKV). Added new roadmap row "8.1.v2 Optimized INT4 GEMV kernel" marked DONE. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Apply the v2 optimizations (compile-time DIM_K/GROUP_SIZE, AIE pipelining hints, double-pump 2-group interleaved unpack) to the dual fused dequant GEMV + SiLU + mul kernel that powers the Phase 8.2 SwiGLU INT4 chain. Keep the inline aie::sub(8) inside each pump chain since SiLU is non-linear -- the host-side bias compensation trick used by the standalone v2 kernel does not apply here. Also swap the down stage inside swiglu_decode_int4 from AIEFusedDequantGEMV (v1) to AIEFusedDequantGEMVv2 so the entire SwiGLU INT4 chain runs on v2 kernels end-to-end. Result on Llama 3.2 1B Q4_0 (correctness_test.py --bench, median of 2): Config decode t/s NPU INT4 +SwiGLU (before) 2.80 NPU INT4 +SwiGLU (this commit) 5.20 (1.86x, parity with NPU bf16) paris_short_q4_0_int4_swiglu passes byte-exact vs cpu_baseline. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… 10x lever Re-read ggml-org/llama.cpp#21725 in full. ic's exact wording is "64K alignment to match SMMU page AND OTHER PARAMETERS got 41 t/s". The "and other parameters" qualifier matters: 41 t/s is a tuned bundle, not a single-lever win. Our 0% gain from naive alignment flips is consistent. Also: ic tested on Strix Halo (XDNA2 large NPU); we run Strix Point (smaller compute tile + bandwidth budget). Numbers don't translate 1:1 -- our hardware ceiling is lower regardless. Key context from issue author albiol2004: - FLM (61 t/s) wins via "optimized fused kernels for specific model architectures" -- not magic, just shape-specific fused operators. albiol2004 plans both "foundational blocks of individual kernels and fused kernels for common models". - decode is memory-bound; NPU compute is largely unused at M=1 -- speculative decoding is an orthogonal lever to fix that. Implication for our roadmap: - Biggest AVAILABLE lever for us: Phase 8.3 mode B (fused INT4 QKV xclbin, 3 parallel workers on disjoint AIE columns). Profile already shows +12% on 3B from parallel Q+K+V. - Speculative decoding: orthogonal, 2-3x potential. - Priority 9 SMMU 64K: deprioritised. It's a knob in a tuned bundle; the "other parameters" we already cover via PR amd#101 v2 kernel optimisations. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

albiol2004 added 4 commits April 9, 2026 19:04

first working version of INT 4 GEMV

4a3314d

1.6x speedup, GROUP_SIZE at compile time

9cd050a

double-pump kernel + K at compile time

6c48321

albiol2004 requested review from andrej, hunhoffe and jgmelber as code owners April 9, 2026 17:50

hunhoffe reviewed Apr 10, 2026

View reviewed changes

ran clang formatter

32571c6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add GEMV INT 4 #101

Add GEMV INT 4 #101
albiol2004 wants to merge 5 commits into
amd:develfrom
albiol2004:gemv-int4

albiol2004 commented Apr 9, 2026

Uh oh!

hunhoffe Apr 10, 2026

Uh oh!

albiol2004 Apr 10, 2026

Uh oh!

github-actions Bot commented Apr 10, 2026

IRONCLAD

IRONCLAD Trends

llama_3.2_1b

llama_3.2_1b_prompt_1024_tokens_1

llama_3.2_1b_prompt_1024_tokens_40

llama_3.2_1b_prompt_13_tokens_1

llama_3.2_1b_prompt_13_tokens_40

llama_3.2_1b_prompt_2048_tokens_1

llama_3.2_1b_prompt_2048_tokens_40

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		@@ -0,0 +1,161 @@
		# SPDX-FileCopyrightText: Copyright (C) 2025 Advanced Micro Devices, Inc. All rights reserved.

Conversation

albiol2004 commented Apr 9, 2026

Added

Changed

Removed

Motivation

Test results

Checklist

Uh oh!

hunhoffe Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

albiol2004 Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Apr 10, 2026

IRONCLAD

IRONCLAD Trends

llama_3.2_1b

llama_3.2_1b_prompt_1024_tokens_1

llama_3.2_1b_prompt_1024_tokens_40

llama_3.2_1b_prompt_13_tokens_1

llama_3.2_1b_prompt_13_tokens_40

llama_3.2_1b_prompt_2048_tokens_1

llama_3.2_1b_prompt_2048_tokens_40

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants