Skip to content

Add GEMV INT 4 #101

Open
albiol2004 wants to merge 5 commits into
amd:develfrom
albiol2004:gemv-int4
Open

Add GEMV INT 4 #101
albiol2004 wants to merge 5 commits into
amd:develfrom
albiol2004:gemv-int4

Conversation

@albiol2004
Copy link
Copy Markdown
Contributor

Added

  • aie_kernels/generic/fused_dequant_gemv.cc : C++ AIE kernel that fuses INT4 weight dequantization with matrix-vector multiplication in a single pass. Uses the proven aie::unpack chain from expand.cc (uint4 → uint8 → uint16 → bf16 → scale → MAC). Includes a double-pump optimization that processes 2 groups per iteration for instruction interleaving, and compile-time DIM_K/GROUP_SIZE for loop specialization.
  • iron/operators/gemv_int4/ : Python operator (op.py, design.py, reference.py, test.py) following the existing gemv/ and dequant/ patterns. Packed weight buffer layout: [uint4 weights | bf16 per-group scales] per tile.

Changed

Nothing. This is a new operator with no modifications to existing code.

Removed

Nothing.

Motivation

Decode inference is bandwidth-bound, GEMV loads each weight once per token. INT4 weights are half the size of INT8 and a quarter of bf16, making INT4 GEMV the highest-impact single operator for decode throughput.

This is a foundation operator: a standalone, individually tested building block for INT4 weight inference. It is not intended as an optimized fused decode kernel, the goal is to provide a correct,
well-tested INT4 GEMV that can be composed into fused pipelines.

Credit to @jgmelber for the original fused dequant-GEMV concept in PR #79 and the INT4 GEMV benchmarks in PR #71 (21.4 GB/s on 8192×2048). That work demonstrated the viability of fused INT4 inference on AIE and informed the design here. If you have insights on further kernel optimizations I'd love to hear them, my focus was on getting the foundation block right rather than a fully optimized fused kernel.

Test results

  • 7/7 parametrized tests pass (2048×2048, 8192×2048, 2048×8192 at 4/8 columns, tsi=1/4)
  • Integration tested with real Llama 3.2 1B weights (cosine similarity >0.999 vs CPU reference)
  • Performance (8 columns): 2048×8192 at 561 us / 16.9 GB/s, 8192×2048 at 665 us / 14.2 GB/s

Checklist

  • Tests pass locally
  • Code formatted (black + clang-format)
  • License headers on all files
  • No changes to existing code

  Fused INT4 weight dequantization + matrix-vector multiplication in a
  single kernel pass. Loads packed uint4 weights from DDR, dequantizes
  in-register using the aie::unpack chain, and MACs with bf16 activation
  vector, 4x DDR bandwidth reduction vs bf16 GEMV.

  Kernel optimizations:
  - Compile-time GROUP_SIZE and DIM_K for loop count optimization
  - Double-pump: processes 2 groups (64 elements) per iteration, giving
    the compiler two independent unpack chains to interleave
  - AIE_PREPARE_FOR_PIPELINING and AIE_LOOP_MIN_ITERATION_COUNT hints

  Tested on AMD Ryzen AI 9 HX 370 (NPU2, 8 columns):
  - 2048x8192 (Llama up_proj): 561 us, 16.9 GB/s effective bandwidth
  - 8192x2048 (Llama down_proj): 665 us, 14.2 GB/s effective bandwidth
  - Integration tested with real Llama 3.2 1B weights (cosine sim >0.999)
@@ -0,0 +1,161 @@
# SPDX-FileCopyrightText: Copyright (C) 2025 Advanced Micro Devices, Inc. All rights reserved.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How much of this is similar to other GEMM ops? Can we find a way to reuse code?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like 70% similar to gemv. But different layout and kernel. I could import from gemv to reuse _build_gemv_program()

or have it as a common helper (only 2 ops using it I think) which I would leave for refactor in another PR

Basically it’s the fused kernel that is important so depends on what you prefer for readability/maintenance

@github-actions
Copy link
Copy Markdown
Contributor

📊 Test Results for Test Example Applications

2703e79 (2026_04_10_17_11_49)

IRONCLAD

Tested on 2026_04_10_17_11_49 at commit 2703e79.

Test Checks TTFT (mean)TPS (mean)
📈 Trends (vs main branch) for Test Example Applications

2703e79 (2026_04_10_17_11_49)

IRONCLAD Trends

llama_3.2_1b

Commit/Date Num Tokens (max)Num Tokens (mean)Num Tokens (median)Num Tokens (min)Num Tokens (stddev)TPS (max)TPS (mean)TPS (median)TPS (min)TPS (stddev)TTFT (max)TTFT (mean)TTFT (median)TTFT (min)TTFT (stddev)Total (max)Total (mean)Total (median)Total (min)Total (stddev)
130b6ea — 2025-12-05 21:33:1240.00 (+0.00%)40.00 (+0.00%)40.00 (+0.00%)40.00 (+0.00%)0.00 (n/a)4.71 (-0.42%)4.64 (-0.09%)4.64 (+0.65%)4.55 (-0.22%)0.05 (-17.66%)4.41 (-0.34%)4.39 (-0.19%)4.38 (-0.33%)4.37 (-0.15%)0.01 (-25.90%)12.96 (-0.00%)12.80 (+0.07%)12.80 (-0.23%)12.67 (+0.44%)0.09 (-21.12%)
0a6c11c — 2025-12-03 23:35:1540.00 (n/a)40.00 (n/a)40.00 (n/a)40.00 (n/a)0.00 (n/a)4.73 (n/a)4.64 (n/a)4.61 (n/a)4.56 (n/a)0.06 (n/a)4.42 (n/a)4.40 (n/a)4.40 (n/a)4.37 (n/a)0.02 (n/a)12.96 (n/a)12.79 (n/a)12.83 (n/a)12.62 (n/a)0.12 (n/a)

llama_3.2_1b_prompt_1024_tokens_1

Commit/Date TTFT (max)TTFT (mean)TTFT (median)TTFT (min)TTFT (stddev)
912e6bc — 2026-04-07 19:08:432.15 (-0.46%)2.13 (-0.47%)2.13 (-0.19%)2.11 (-0.85%)0.02 (+39.39%)
2371174 — 2026-04-06 17:38:482.16 (n/a)2.14 (n/a)2.14 (n/a)2.12 (n/a)0.01 (n/a)

llama_3.2_1b_prompt_1024_tokens_40

Commit/Date TPS (max)TPS (mean)TPS (median)TPS (min)TPS (stddev)TTFT (max)TTFT (mean)TTFT (median)TTFT (min)TTFT (stddev)
912e6bc — 2026-04-07 19:08:434.21 (+0.86%)4.17 (+0.13%)4.16 (-0.05%)4.14 (-0.34%)0.03 (+220.64%)2.28 (+0.71%)2.16 (-0.29%)2.13 (-0.84%)2.12 (+0.19%)0.07 (+18.56%)
2371174 — 2026-04-06 17:38:484.17 (n/a)4.16 (n/a)4.16 (n/a)4.15 (n/a)0.01 (n/a)2.27 (n/a)2.16 (n/a)2.15 (n/a)2.11 (n/a)0.06 (n/a)

llama_3.2_1b_prompt_13_tokens_1

Commit/Date TTFT (max)TTFT (mean)TTFT (median)TTFT (min)TTFT (stddev)
912e6bc — 2026-04-07 19:08:432.10 (+0.29%)2.09 (+0.31%)2.09 (+0.43%)2.09 (+0.58%)0.01 (-40.31%)
2371174 — 2026-04-06 17:38:482.10 (n/a)2.08 (n/a)2.08 (n/a)2.07 (n/a)0.01 (n/a)

llama_3.2_1b_prompt_13_tokens_40

Commit/Date TPS (max)TPS (mean)TPS (median)TPS (min)TPS (stddev)TTFT (max)TTFT (mean)TTFT (median)TTFT (min)TTFT (stddev)
912e6bc — 2026-04-07 19:08:434.18 (+0.12%)4.16 (+0.13%)4.16 (+0.14%)4.15 (+0.00%)0.01 (+12.35%)2.10 (-0.76%)2.09 (-0.22%)2.09 (-0.10%)2.07 (-0.48%)0.01 (-20.82%)
2371174 — 2026-04-06 17:38:484.18 (n/a)4.16 (n/a)4.16 (n/a)4.15 (n/a)0.01 (n/a)2.12 (n/a)2.09 (n/a)2.09 (n/a)2.08 (n/a)0.02 (n/a)

llama_3.2_1b_prompt_2048_tokens_1

Commit/Date Num_Tokens (max)Num_Tokens (mean)Num_Tokens (median)Num_Tokens (min)Num_Tokens (stddev)TPS (max)TPS (mean)TPS (median)TPS (min)TPS (stddev)TTFT (max)TTFT (mean)TTFT (median)TTFT (min)TTFT (stddev)
897d04e — 2026-03-06 22:56:071.00 (+0.00%)1.00 (+0.00%)1.00 (+0.00%)1.00 (+0.00%)0.00 (n/a)0.00 (n/a)0.00 (n/a)0.00 (n/a)0.00 (n/a)0.00 (n/a)2.68 (-1.06%)2.68 (-1.06%)2.68 (-1.06%)2.68 (-1.06%)0.00 (n/a)
84d3478 — 2026-02-17 23:16:231.00 (n/a)1.00 (n/a)1.00 (n/a)1.00 (n/a)0.00 (n/a)0.00 (n/a)0.00 (n/a)0.00 (n/a)0.00 (n/a)0.00 (n/a)2.70 (n/a)2.70 (n/a)2.70 (n/a)2.70 (n/a)0.00 (n/a)

llama_3.2_1b_prompt_2048_tokens_40

Commit/Date Num_Tokens (max)Num_Tokens (mean)Num_Tokens (median)Num_Tokens (min)Num_Tokens (stddev)TPS (max)TPS (mean)TPS (median)TPS (min)TPS (stddev)TTFT (max)TTFT (mean)TTFT (median)TTFT (min)TTFT (stddev)
897d04e — 2026-03-06 22:56:0740.00 (+0.00%)40.00 (+0.00%)40.00 (+0.00%)40.00 (+0.00%)0.00 (n/a)4.00 (-1.72%)4.00 (-1.72%)4.00 (-1.72%)4.00 (-1.72%)0.00 (n/a)2.70 (-0.44%)2.70 (-0.44%)2.70 (-0.44%)2.70 (-0.44%)0.00 (n/a)
84d3478 — 2026-02-17 23:16:2340.00 (n/a)40.00 (n/a)40.00 (n/a)40.00 (n/a)0.00 (n/a)4.07 (n/a)4.07 (n/a)4.07 (n/a)4.07 (n/a)0.00 (n/a)2.71 (n/a)2.71 (n/a)2.71 (n/a)2.71 (n/a)0.00 (n/a)

Soket1 pushed a commit to Soket1/IRON-windows that referenced this pull request May 21, 2026
Added external benchmark reference points (PR amd#80, PR amd#101,
FastFlowLM) to recontextualize the Phase 8.2 regression.

Key data:
  - PR amd#101 reports 561 us standalone INT4 GEMV (K=2048 N=8192).
    My Phase 8.2 dispatch budgets ~14 ms/FFN-layer = 7x slower than
    3 * 561 + 665 = 1.9 ms expected. So per-kernel compute is not
    the bottleneck -- the chained-xclbin orchestration is.
  - FastFlowLM ships 66 t/s on Llama 3.2 1B Q4_1 on Strix Point
    using the same IRON+AIE-MLIR stack -- ~20x faster than our
    3.4 t/s. Implies the real win is transformer-block-level fusion,
    not microkernel tuning.

Negative result (kernel-side bias rewrite):
  Tried moving aie::sub(8) out of per-block inner loop into
  per-group scalar accumulator. Hypothesis: data dependency
  (as_bf16 -> sub -> mul -> mac) breaks AIE pipelining. Result:
  WORSE -- 2.80 -> 1.10 t/s (2.5x regression). Reverted.
  Conclusion: the per-block sub is not the bottleneck. AIE compiler
  likely spills the static bfloat16 S_g[256] buffer or fails to
  pipeline the scalar float row_bias accumulator. Real kernel
  optimization needs AIE compiler/IR analysis, not high-level
  algebra tricks.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Soket1 pushed a commit to Soket1/IRON-windows that referenced this pull request May 22, 2026
…n revised)

XRT.pdf projected ~5x decode improvement from async dispatch (3.4 ->
~16 t/s INT4). Day 1 spike on real Windows STX NPU2 hardware
disproves this:

  Shape            sync us/op  async us/op  win
  K=2048 N=8192    2434        2210         1.10x  (ffn_gate/up)
  K=8192 N=2048    2405        2182         1.10x  (ffn_down)
  K=2048 N=2048     640         566         1.13x  (attn_q/o)
  K=2048 N= 512     227         157         1.44x  (attn_k/v)

xrt::run::start() IS confirmed non-blocking (~6-9 us submit). But the
NPU itself serializes within one hw_context -- each kernel runs to
completion before the next starts. Async wins only the ~80 us host
overhead per op, which is 3-4% on 2 ms kernels.

End-to-end projection for Llama 3.2 1B Q4_0:
  NPU-only matmul budget = 128 ms (async) vs 143 ms (sync) per token
  = ~7.8 t/s ceiling vs 3.4 t/s today
  = realistic 5-5.5 t/s after non-NPU overhead

That's 1.6x, not 5x. Phase 9 still worth shipping to reach parity
with NPU bf16 (5.2 t/s), but no longer "the breakthrough".

The real bottleneck identified: my INT4 GEMV kernels run ~4x slower
than the upstream PR amd#101 reference (2.2 ms vs ~561 us on same shape).
Tracking that gap is the next investigation -- could be wrong tile
sizes, missing compile flags, or design mismatch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Soket1 pushed a commit to Soket1/IRON-windows that referenced this pull request May 22, 2026
Direct port of amd/IRON PR amd#101 (albiol2004/IRON@gemv-int4) to the
IRON-windows aie2p tree. Three optimizations over the existing v1
kernel:

  1. Compile-time DIM_K and GROUP_SIZE via -D flags -- compiler unrolls
     the inner loop, eliminating runtime arithmetic and enabling
     constexpr branch selection.
  2. AIE_PREPARE_FOR_PIPELINING + AIE_LOOP_MIN_ITERATION_COUNT
     annotations -- AIE compiler schedules the software pipeline.
  3. Double-pump path (2 groups/iter, independent A/B unpack chains)
     so the compiler can interleave them, hiding the dequant chain
     latency behind activation loads + MAC ops.

Files:
  aie_kernels/aie2p/fused_dequant_gemv_v2.cc       (~150 LOC, kernel)
  iron/operators/fused_dequant_gemv_v2/__init__.py
  iron/operators/fused_dequant_gemv_v2/op.py       (~170 LOC, IRON op)
  iron/operators/fused_dequant_gemv_v2/design.py   (~140 LOC, MLIR)

V1 left untouched -- v2 is opt-in via a separate compile entry point
(`compile_fused_dequant_gemv_v2`, ll.cpp side).

Measured speedup on K=2048 N=8192 8col g32 via xrt_async_spike.exe:
  v1 sync   2307 us/op    v2 sync    570 us/op  (4.05x)
  v1 async  2204 us/op    v2 async   494 us/op  (4.47x)

This is THE single biggest perf lever in Priority 8 -- ~4x decode
improvement without touching dispatch architecture. Phase 9 async
gives an additional 1.1x on top.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Soket1 pushed a commit to Soket1/IRON-windows that referenced this pull request May 22, 2026
…rsed

The "4x kernel gap" diagnosis was correct: the actual 4-4.5x was in
the kernel, not the dispatch. Direct port of amd/IRON PR amd#101 (with
compile-time DIM_K/GROUP_SIZE + double-pump + AIE pipelining hints)
landed today as v2.

End-to-end measurements on Llama 3.2 1B Q4_0:
  CPU Q4_0             11.10 t/s
  NPU bf16              5.30 t/s
  NPU INT4 (8.1 only)   3.40 t/s   <- regression
  NPU INT4 v2           5.60 t/s   <- +65% vs 8.1, +6% vs bf16
  NPU INT4 (8.1+8.2)    2.80 t/s

INT4 is now FASTER than bf16 on NPU for the first time. All 3 INT4
correctness tests pass byte-exact vs cpu_baseline through v2.

Per-kernel spike measured 4.08-5.09x sync speedup on the dominant
shapes; end-to-end is "only" 1.65x because Amdahl's law caps the win
at the non-NPU fraction of per-token cost (RMSNorm, sampling, FlowKV).

Added new roadmap row "8.1.v2 Optimized INT4 GEMV kernel" marked DONE.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Soket1 pushed a commit to Soket1/IRON-windows that referenced this pull request May 22, 2026
Apply the v2 optimizations (compile-time DIM_K/GROUP_SIZE, AIE pipelining
hints, double-pump 2-group interleaved unpack) to the dual fused dequant
GEMV + SiLU + mul kernel that powers the Phase 8.2 SwiGLU INT4 chain.
Keep the inline aie::sub(8) inside each pump chain since SiLU is
non-linear -- the host-side bias compensation trick used by the
standalone v2 kernel does not apply here.

Also swap the down stage inside swiglu_decode_int4 from
AIEFusedDequantGEMV (v1) to AIEFusedDequantGEMVv2 so the entire SwiGLU
INT4 chain runs on v2 kernels end-to-end.

Result on Llama 3.2 1B Q4_0 (correctness_test.py --bench, median of 2):

  Config                       decode t/s
  NPU INT4 +SwiGLU  (before)        2.80
  NPU INT4 +SwiGLU  (this commit)   5.20  (1.86x, parity with NPU bf16)

paris_short_q4_0_int4_swiglu passes byte-exact vs cpu_baseline.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Soket1 pushed a commit to Soket1/IRON-windows that referenced this pull request May 23, 2026
… 10x lever

Re-read ggml-org/llama.cpp#21725 in full. ic's exact wording is
"64K alignment to match SMMU page AND OTHER PARAMETERS got 41 t/s".
The "and other parameters" qualifier matters: 41 t/s is a tuned
bundle, not a single-lever win. Our 0% gain from naive alignment
flips is consistent.

Also: ic tested on Strix Halo (XDNA2 large NPU); we run Strix Point
(smaller compute tile + bandwidth budget). Numbers don't translate
1:1 -- our hardware ceiling is lower regardless.

Key context from issue author albiol2004:

  - FLM (61 t/s) wins via "optimized fused kernels for specific
    model architectures" -- not magic, just shape-specific
    fused operators. albiol2004 plans both "foundational blocks
    of individual kernels and fused kernels for common models".
  - decode is memory-bound; NPU compute is largely unused at M=1
    -- speculative decoding is an orthogonal lever to fix that.

Implication for our roadmap:

  - Biggest AVAILABLE lever for us: Phase 8.3 mode B (fused INT4
    QKV xclbin, 3 parallel workers on disjoint AIE columns).
    Profile already shows +12% on 3B from parallel Q+K+V.
  - Speculative decoding: orthogonal, 2-3x potential.
  - Priority 9 SMMU 64K: deprioritised. It's a knob in a tuned
    bundle; the "other parameters" we already cover via PR amd#101
    v2 kernel optimisations.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants