Optimized ARM NEON q1_0 dot by pl752 · Pull Request #33 · PrismML-Eng/llama.cpp

pl752 · 2026-05-16T16:34:38Z

Continuation of #10 for ARM NEON

Reimplemented kernel for NEON with and without dotprod extension, added nrc==2 path for i8mm (int8 2x2x8->int32 2x2 matmul) extension like some of the other formats.
Main optimization is faster unpacking of q1 values via LUT, also separate scaling and accumulation implemented like in other implementations of Q1 dot.
Plain NEON is rewritten like SSSE3/AVX2 impl, due to it being more efficient when dotprod is not available.
2x2 tile dot for i8mm ext was added to speed up pp.

Benchmarks were performed with:
Honor 400 (smartphone) with Snapdragon 7 gen 3, 12 gb ram
Android 16 via ADB
Used 4 performance cores due to all 8 cores causing unstable performance
Command: ./llama-bench -m Bonsai-1.7B.gguf -p 128 -n 32 -r 3 -C 0xF0 -t 4 -fa 1 -mmp 0
Perplexity for 5x512 chunks: Mean KLD 0.00021, PPL 21.08, Same top p 99,22%

flow	run	baseline	updated	delta
NEON	pp128	15.81 t/s	27.55 t/s	+74.26%
NEON	tg32	12.29 t/s	21.26 t/s	+72.99%
NEON+DP	pp128	27.17 t/s	39.22 t/s	+44.35%
NEON+DP	tg32	20.80 t/s	29.54 t/s	+42.02%
NEON+DP+I8MM*	pp128	27.17 t/s	69.38 t/s	+155.36%

* doesn't affect tg.

@khosravipasha, can you, please, compare performance between your and my implementations on your Mac, so we have more data on performance?

As always, I would appreciate your feedback

khosravipasha · 2026-05-16T22:10:44Z

This looks amazing, massive gain, I have a M4 Pro, I can try on that.

This was the ARM NEON speeds from initial PR I had, I can rerun both and compare, I can try on my Android phone too see the difference.

ARM NEON

model	size	params	backend	threads	test	t/s
1.7B Q1_0	231.13 MiB	1.72 B	BLAS	10	pp512	502.21 ± 13.70
1.7B Q1_0	231.13 MiB	1.72 B	BLAS	10	tg128	112.46 ± 5.89

Question: How do you run the different flows? Is there a special compile flag?
NEON vs NEON+DP vs NEON+DP+I8MM*

Perplexity for 5x512 chunks: Mean KLD 0.00021, PPL 21.08, Same top p 99,22%

Nice

Copilot

Pull request overview

This PR continues the ARM NEON optimization work for q1_0 dot products by introducing faster bit unpacking (LUT-based) and adding an nrc==2 path that leverages the ARM i8mm (__ARM_FEATURE_MATMUL_INT8) 2x2 int8 matmul instructions to accelerate prompt processing.

Changes:

Enable nrows==2 for GGML_TYPE_Q1_0 when compiling with __ARM_FEATURE_MATMUL_INT8 so the CPU matmul path can request 2-row dot kernels.
Reimplement ggml_vec_dot_q1_0_q8_0 for ARM with separate implementations for i8mm (nrc==2), DOTPROD, and baseline NEON.
Add q1 LUT tables for faster expansion of q1 sign/mask bytes.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File	Description
ggml/src/ggml-cpu/ggml-cpu.c	Advertises `GGML_TYPE_Q1_0` as supporting 2-row dot kernels when i8mm is available.
ggml/src/ggml-cpu/arch/arm/quants.c	Reworks ARM `q1_0 × q8_0` dot product with LUT-based unpacking and new i8mm/DOTPROD/baseline NEON paths.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

pl752 · 2026-05-17T06:14:58Z

@khosravipasha I would be grateful if you reran the M4 Pro as I am interested in how the code behaves on apple hardware; as for smartphone, there is no need, I think.

As for compile flags, I am mostly interested in native rerun and I think that both dotprod (aka dp) and i8mm should be available on apple implementation. Cmake flags can be used to control instruction set if needed: -DGGML_NATIVE=OFF -DGGML_CPU_ARM_ARCH=armv8.6-a+dotprod+i8mm+fp16+bf16 or -DGGML_NATIVE=OFF -DGGML_CPU_ARM_ARCH=armv8.2-a+fp16+bf16 for no NEON extensions, aka plain NEON (native to be disabled to prevent auto-detect of available features when setting target explicitly, enabled by default; 8.6 should imply dp+i8mm (so the mentioning of the extensions is likely refundant), so 8.2 is needed for plain neon)

pl752 · 2026-05-17T17:14:02Z

As repack for arm is much more straightforward, decided to implement 4x4 NEON+DP variant in #34.

It doesn't use i8mm and gives significant performance uplift.
Benchmark numbers are different in this comment, as in PR tests were conducted without this PR`s code.

flow	run	PR33	repack+PR33	delta
NEON+DP	pp128	39.22 t/s	103.01 t/s	+162.65%
NEON+DP	tg32	29.54 t/s	39.58 t/s	+33.99%
NEON+DP+I8MM*	pp128	69.38 t/s	103.01 t/s	+48.47%

* doesn't affect tg and is not used in repack kernels, just for reference.

pl752 · 2026-05-18T08:08:25Z

~~UPD: I have also tried i8mm specific 4x8 but it made the following tradeoff: (103, 40) -> (121, 37), so idk...~~
Tradeoff is not that bad and it is mostly absorbed by attention computations when context grows, so here it is:

flow	run	4x4	4x8	delta
NEON+DP	pp128	103.01 t/s	121.93 t/s	+19.34%
NEON+DP	tg32	39.58 t/s	36.59 t/s	-7.55%

khosravipasha · 2026-05-19T00:13:39Z

Seems prompt processing improve massively, and token generation did not change much.

Apple M4 Pro / macOS 15 / 8 threads / -fa 1 -mmp 0 -r 5
Model: Bonsai-1.7B Q1_0 (231 MiB), CPU flags: +dotprod+i8mm

flow	run	baseline (`bbeb89d`)	PR (`197f7ca`)	delta
NEON+DP+I8MM	pp128	172.41 ± 2.87 t/s	337.61 ± 1.05 t/s	+95.82%
NEON+DP	tg32	114.17 ± 0.24 t/s	113.15 ± 4.98 t/s	-0.89%

Details:

cmake -B build-pr \
  -DCMAKE_BUILD_TYPE=Release \
  -DLLAMA_CURL=OFF \
  -DGGML_METAL=OFF \
  -DGGML_BLAS=OFF \
  -DLLAMA_BUILD_SERVER=OFF
cmake --build build-pr --target llama-bench -j$(sysctl -n hw.ncpu)

./build-pr/bin/llama-bench -m Bonsai-1.7B.gguf -p 128 -n 32 -r 5 -t 8 -fa 1 -mmp 0

Baseline (master @ bbeb89d):

model	size	params	backend	threads	fa	mmap	test	t/s
qwen3 1.7B Q1_0	231.13 MiB	1.72 B	CPU	8	1	0	pp128	172.41 ± 2.87
qwen3 1.7B Q1_0	231.13 MiB	1.72 B	CPU	8	1	0	tg32	114.17 ± 0.24

build: bbeb89d (241)
PR (perf/q1_0_arm_dot @ 197f7ca):

model	size	params	backend	threads	fa	mmap	test	t/s
qwen3 1.7B Q1_0	231.13 MiB	1.72 B	CPU	8	1	0	pp128	337.61 ± 1.05
qwen3 1.7B Q1_0	231.13 MiB	1.72 B	CPU	8	1	0	tg32	113.15 ± 4.98

build: 197f7ca (244)

pl752 · 2026-05-19T04:07:02Z

@khosravipasha Thanks, though it was not necessary to limit mac to (128/32) tokens, also what else have changed since the run with 500+ t/s (or BLAS very significantly affects pp)? Also if the metal was enabled could the iGPU accelerate pp even without layer upload?

pl752 · 2026-05-19T17:03:17Z

@khosravipasha Also I noticed that tg speed is very similar for both -t 8 runs and -t 10 run, can it be that memory bandwidth is saturated? Can you try running with -t 2,4,6,8 ... <num of cores> so we can see if it flatlines due to memory throughput at some point (for pp ofc. that is not the case, otherwise there would be much less singnificant difference)? For an example my laptop during optimization of x86 kernels at some point did hit the wall too despite kernel getting more efficient (for quants with lower weight density it occurs much sooner and correlates -1:1 with model size.

khosravipasha · 2026-05-19T17:04:09Z

Yeah I can do the pp512, tg128
Where do you see the 500+? was it on cpu or gpu?

Summary

Apple M4 Pro / macOS / 8 threads / -fa 1 -mmp 0 -r 5
Model: Bonsai-1.7B Q1_0 (231 MiB), CPU flags: +dotprod+i8mm

flow	run	baseline (`bbeb89d`)	PR (`197f7ca`)	delta
NEON+DP+I8MM	pp512	170.99 ± 1.05 t/s	330.36 ± 0.52 t/s	+93.20%
NEON+DP	tg128	111.85 ± 0.11 t/s	112.02 ± 1.21 t/s	+0.15%

Baseline (master @ `bbeb89d`):

model	size	params	backend	threads	fa	mmap	test	t/s
qwen3 1.7B Q1_0	231.13 MiB	1.72 B	CPU	8	1	0	pp512	170.99 ± 1.05
qwen3 1.7B Q1_0	231.13 MiB	1.72 B	CPU	8	1	0	tg128	111.85 ± 0.11

build: bbeb89d (241)

PR (perf/q1_0_arm_dot @ `197f7ca`):

model	size	params	backend	threads	fa	mmap	test	t/s
qwen3 1.7B Q1_0	231.13 MiB	1.72 B	CPU	8	1	0	pp512	330.36 ± 0.52
qwen3 1.7B Q1_0	231.13 MiB	1.72 B	CPU	8	1	0	tg128	112.02 ± 1.21

build: 197f7ca (244)

pl752 · 2026-05-19T17:06:57Z

This looks amazing, massive gain, I have a M4 Pro, I can try on that.
...
model size params backend threads test t/s
1.7B Q1_0 231.13 MiB 1.72 B BLAS 10 pp512 (HERE) 502.21 ± 13.70
1.7B Q1_0 231.13 MiB 1.72 B BLAS 10 tg128 112.46 ± 5.89
...
Nice

(You have posted this table in first reply)

khosravipasha · 2026-05-19T17:09:04Z

10 cores seems to be best on the Mac for TG (since I guess has 10 Performance cores and 4 E cores)

PR (perf/q1_0_arm_dot @ `197f7ca`):

threads	pp128	tg128
2	89.12 ± 0.21	33.96 ± 0.07
4	172.56 ± 3.45	64.09 ± 0.11
6	254.47 ± 0.18	89.03 ± 0.04
8	337.88 ± 0.57	113.03 ± 0.36
10	374.60 ± 37.55	133.43 ± 6.51
12	419.67 ± 0.37	75.88 ± 0.07
14	415.81 ± 9.49	86.40 ± 8.74

Baseline (master @ `bbeb89d`):

threads	pp128	tg128
2	44.69 ± 0.45	33.59 ± 0.19
4	89.42 ± 0.11	63.81 ± 0.19
6	131.15 ± 0.05	88.47 ± 0.05
8	168.24 ± 0.58	110.63 ± 0.24
10	190.26 ± 4.46	119.29 ± 11.41
12	208.09 ± 0.32	99.01 ± 1.66
14	216.26 ± 1.40	57.13 ± 15.12

pl752 · 2026-05-19T17:11:48Z

@khosravipasha Pretty funny that user with intel E-cores and I with smartphone encounter exact same performance degradation when mixing P and E cores. Would be interesting to see if core affinity (-C <mask> like -C 0xF0 for cores 3-7) affects the result when enforcing no core mixing.

khosravipasha · 2026-05-19T17:11:56Z

(You have posted this table in first reply)

Good question, the initial table with 500 tok/s I pasted from here: ggml-org#21273

Need to see why its slower now

pl752 · 2026-05-19T17:15:33Z

@khosravipasha In picture it states METAL, so might be it. Also if the build of metal was enabled, it might try to accelerate large matmuls by streaming weights through iGPU even if -ngl is 0 and device is not stated, or some kind of other optimized kernel libs were present (aka backend is state as BLAS instead of CPU, but I am not 100% sure).

pl752 · 2026-05-19T17:19:01Z

@khosravipasha And also there was comparison between NEON and scalar fallback, and PP there didn't change, so indeed some acceleration got involved

pl752 · 2026-05-19T17:20:46Z

Also kind of crasy how i8mm is ~66% of igpu performance.

pl752 · 2026-05-19T17:23:35Z

Anyway, at least my changes didn't harm performance on different hardware from my, so mission can be declared successful.

khosravipasha · 2026-05-19T17:32:47Z

yeah this is amazing :D

ran with BLAS and Metal in case you are curios:

Apple M4 Pro / -t 10 (P-cores) / -fa 1 -mmp 0 -r 5 / Bonsai-1.7B Q1_0 (231 MiB)

build flags	-ngl	test	master (`bbeb89d`)	PR (`197f7ca`)	delta
CPU-only (no BLAS)	0	pp128	172.4 ± 2.9	337.6 ± 1.1	+95.8%
CPU-only (no BLAS)	0	pp512	171.0 ± 1.1	330.4 ± 0.5	+93.2%
CPU-only (no BLAS)	0	tg128	111.9 ± 0.1	112.0 ± 1.2	+0.2%
+Metal +BLAS	0	pp128	308.2 ± 5.0	332.7 ± 7.1	+7.9%
+Metal +BLAS	0	pp512	505.7 ± 12.0	515.0 ± 6.3	+1.8%
+Metal +BLAS	0	tg128	119.6 ± 1.9	123.6 ± 13.6	+3.3%
+Metal +BLAS	99	pp128	2032.0 ± 13.4	2044.5 ± 9.8	+0.6%
+Metal +BLAS	99	pp512	2184.4 ± 12.5	2188.6 ± 10.1	+0.2%
+Metal +BLAS	99	tg128	298.1 ± 3.4	301.4 ± 6.1	+1.1%

pl752 · 2026-05-19T17:38:29Z

tg is mostly apples to apples (pun intended), pp gets nice boost

pl752 added 2 commits May 16, 2026 16:44

Optimized arm NEON(+DOTPROD) q1 dot

3773f67

Implemented arm I8MM nrc==2 for q1 dot

18acd09

github-actions Bot added the ggml label May 16, 2026

khosravipasha requested a review from Copilot May 16, 2026 22:11

Copilot started reviewing on behalf of khosravipasha May 16, 2026 22:11 View session

Copilot AI reviewed May 16, 2026

View reviewed changes

Comment thread ggml/src/ggml-cpu/arch/arm/quants.c Outdated

pl752 mentioned this pull request May 17, 2026

Q1_0 repack kernels for Arm NEON+DP #34

Open

Applied copilot advice about feature guards for Q1 Arm LUTs

197f7ca

Conversation

pl752 commented May 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

khosravipasha commented May 16, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

pl752 commented May 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pl752 commented May 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pl752 commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

khosravipasha commented May 19, 2026

Uh oh!

pl752 commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pl752 commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

khosravipasha commented May 19, 2026

Summary

Baseline (master @ bbeb89d):

PR (perf/q1_0_arm_dot @ 197f7ca):

Uh oh!

pl752 commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

khosravipasha commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR (perf/q1_0_arm_dot @ 197f7ca):

Baseline (master @ bbeb89d):

Uh oh!

pl752 commented May 19, 2026

Uh oh!

khosravipasha commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pl752 commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pl752 commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pl752 commented May 19, 2026

Uh oh!

pl752 commented May 19, 2026

Uh oh!

khosravipasha commented May 19, 2026

Uh oh!

pl752 commented May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

pl752 commented May 16, 2026 •

edited

Loading

pl752 commented May 17, 2026 •

edited

Loading

pl752 commented May 17, 2026 •

edited

Loading

pl752 commented May 18, 2026 •

edited

Loading

pl752 commented May 19, 2026 •

edited

Loading

pl752 commented May 19, 2026 •

edited

Loading

Baseline (master @ `bbeb89d`):

PR (perf/q1_0_arm_dot @ `197f7ca`):

pl752 commented May 19, 2026 •

edited

Loading

khosravipasha commented May 19, 2026 •

edited

Loading

PR (perf/q1_0_arm_dot @ `197f7ca`):

Baseline (master @ `bbeb89d`):

khosravipasha commented May 19, 2026 •

edited

Loading

pl752 commented May 19, 2026 •

edited

Loading

pl752 commented May 19, 2026 •

edited

Loading