Skip to content

Optimized ARM NEON q1_0 dot#33

Open
pl752 wants to merge 3 commits into
PrismML-Eng:masterfrom
pl752:perf/q1_0_arm_dot
Open

Optimized ARM NEON q1_0 dot#33
pl752 wants to merge 3 commits into
PrismML-Eng:masterfrom
pl752:perf/q1_0_arm_dot

Conversation

@pl752
Copy link
Copy Markdown

@pl752 pl752 commented May 16, 2026

Continuation of #10 for ARM NEON

Reimplemented kernel for NEON with and without dotprod extension, added nrc==2 path for i8mm (int8 2x2x8->int32 2x2 matmul) extension like some of the other formats.
Main optimization is faster unpacking of q1 values via LUT, also separate scaling and accumulation implemented like in other implementations of Q1 dot.
Plain NEON is rewritten like SSSE3/AVX2 impl, due to it being more efficient when dotprod is not available.
2x2 tile dot for i8mm ext was added to speed up pp.

Benchmarks were performed with:
Honor 400 (smartphone) with Snapdragon 7 gen 3, 12 gb ram
Android 16 via ADB
Used 4 performance cores due to all 8 cores causing unstable performance
Command: ./llama-bench -m Bonsai-1.7B.gguf -p 128 -n 32 -r 3 -C 0xF0 -t 4 -fa 1 -mmp 0
Perplexity for 5x512 chunks: Mean KLD 0.00021, PPL 21.08, Same top p 99,22%

flow run baseline updated delta
NEON pp128 15.81 t/s 27.55 t/s +74.26%
NEON tg32 12.29 t/s 21.26 t/s +72.99%
NEON+DP pp128 27.17 t/s 39.22 t/s +44.35%
NEON+DP tg32 20.80 t/s 29.54 t/s +42.02%
NEON+DP+I8MM* pp128 27.17 t/s 69.38 t/s +155.36%
  • * doesn't affect tg.

@khosravipasha, can you, please, compare performance between your and my implementations on your Mac, so we have more data on performance?

As always, I would appreciate your feedback

@github-actions github-actions Bot added the ggml label May 16, 2026
@khosravipasha
Copy link
Copy Markdown
Collaborator

This looks amazing, massive gain, I have a M4 Pro, I can try on that.

This was the ARM NEON speeds from initial PR I had, I can rerun both and compare, I can try on my Android phone too see the difference.

ARM NEON

model size params backend threads test t/s
1.7B Q1_0 231.13 MiB 1.72 B BLAS 10 pp512 502.21 ± 13.70
1.7B Q1_0 231.13 MiB 1.72 B BLAS 10 tg128 112.46 ± 5.89

Question: How do you run the different flows? Is there a special compile flag?
NEON vs NEON+DP vs NEON+DP+I8MM*

Perplexity for 5x512 chunks: Mean KLD 0.00021, PPL 21.08, Same top p 99,22%

Nice

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR continues the ARM NEON optimization work for q1_0 dot products by introducing faster bit unpacking (LUT-based) and adding an nrc==2 path that leverages the ARM i8mm (__ARM_FEATURE_MATMUL_INT8) 2x2 int8 matmul instructions to accelerate prompt processing.

Changes:

  • Enable nrows==2 for GGML_TYPE_Q1_0 when compiling with __ARM_FEATURE_MATMUL_INT8 so the CPU matmul path can request 2-row dot kernels.
  • Reimplement ggml_vec_dot_q1_0_q8_0 for ARM with separate implementations for i8mm (nrc==2), DOTPROD, and baseline NEON.
  • Add q1 LUT tables for faster expansion of q1 sign/mask bytes.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
ggml/src/ggml-cpu/ggml-cpu.c Advertises GGML_TYPE_Q1_0 as supporting 2-row dot kernels when i8mm is available.
ggml/src/ggml-cpu/arch/arm/quants.c Reworks ARM q1_0 × q8_0 dot product with LUT-based unpacking and new i8mm/DOTPROD/baseline NEON paths.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread ggml/src/ggml-cpu/arch/arm/quants.c Outdated
@pl752
Copy link
Copy Markdown
Author

pl752 commented May 17, 2026

@khosravipasha I would be grateful if you reran the M4 Pro as I am interested in how the code behaves on apple hardware; as for smartphone, there is no need, I think.

As for compile flags, I am mostly interested in native rerun and I think that both dotprod (aka dp) and i8mm should be available on apple implementation. Cmake flags can be used to control instruction set if needed: -DGGML_NATIVE=OFF -DGGML_CPU_ARM_ARCH=armv8.6-a+dotprod+i8mm+fp16+bf16 or -DGGML_NATIVE=OFF -DGGML_CPU_ARM_ARCH=armv8.2-a+fp16+bf16 for no NEON extensions, aka plain NEON (native to be disabled to prevent auto-detect of available features when setting target explicitly, enabled by default; 8.6 should imply dp+i8mm (so the mentioning of the extensions is likely refundant), so 8.2 is needed for plain neon)

@pl752
Copy link
Copy Markdown
Author

pl752 commented May 17, 2026

As repack for arm is much more straightforward, decided to implement 4x4 NEON+DP variant in #34.

It doesn't use i8mm and gives significant performance uplift.
Benchmark numbers are different in this comment, as in PR tests were conducted without this PR`s code.

flow run PR33 repack+PR33 delta
NEON+DP pp128 39.22 t/s 103.01 t/s +162.65%
NEON+DP tg32 29.54 t/s 39.58 t/s +33.99%
NEON+DP+I8MM* pp128 69.38 t/s 103.01 t/s +48.47%
  • * doesn't affect tg and is not used in repack kernels, just for reference.

@pl752
Copy link
Copy Markdown
Author

pl752 commented May 18, 2026

UPD: I have also tried i8mm specific 4x8 but it made the following tradeoff: (103, 40) -> (121, 37), so idk...
Tradeoff is not that bad and it is mostly absorbed by attention computations when context grows, so here it is:

flow run 4x4 4x8 delta
NEON+DP pp128 103.01 t/s 121.93 t/s +19.34%
NEON+DP tg32 39.58 t/s 36.59 t/s -7.55%

@khosravipasha
Copy link
Copy Markdown
Collaborator

Seems prompt processing improve massively, and token generation did not change much.

Apple M4 Pro / macOS 15 / 8 threads / -fa 1 -mmp 0 -r 5
Model: Bonsai-1.7B Q1_0 (231 MiB), CPU flags: +dotprod+i8mm

flow run baseline (bbeb89d) PR (197f7ca) delta
NEON+DP+I8MM pp128 172.41 ± 2.87 t/s 337.61 ± 1.05 t/s +95.82%
NEON+DP tg32 114.17 ± 0.24 t/s 113.15 ± 4.98 t/s -0.89%

Details:

cmake -B build-pr \
  -DCMAKE_BUILD_TYPE=Release \
  -DLLAMA_CURL=OFF \
  -DGGML_METAL=OFF \
  -DGGML_BLAS=OFF \
  -DLLAMA_BUILD_SERVER=OFF
cmake --build build-pr --target llama-bench -j$(sysctl -n hw.ncpu)
./build-pr/bin/llama-bench -m Bonsai-1.7B.gguf -p 128 -n 32 -r 5 -t 8 -fa 1 -mmp 0

Baseline (master @ bbeb89d):

model size params backend threads fa mmap test t/s
qwen3 1.7B Q1_0 231.13 MiB 1.72 B CPU 8 1 0 pp128 172.41 ± 2.87
qwen3 1.7B Q1_0 231.13 MiB 1.72 B CPU 8 1 0 tg32 114.17 ± 0.24

build: bbeb89d (241)
PR (perf/q1_0_arm_dot @ 197f7ca):

model size params backend threads fa mmap test t/s
qwen3 1.7B Q1_0 231.13 MiB 1.72 B CPU 8 1 0 pp128 337.61 ± 1.05
qwen3 1.7B Q1_0 231.13 MiB 1.72 B CPU 8 1 0 tg32 113.15 ± 4.98

build: 197f7ca (244)

@pl752
Copy link
Copy Markdown
Author

pl752 commented May 19, 2026

@khosravipasha Thanks, though it was not necessary to limit mac to (128/32) tokens, also what else have changed since the run with 500+ t/s (or BLAS very significantly affects pp)? Also if the metal was enabled could the iGPU accelerate pp even without layer upload?

@pl752
Copy link
Copy Markdown
Author

pl752 commented May 19, 2026

@khosravipasha Also I noticed that tg speed is very similar for both -t 8 runs and -t 10 run, can it be that memory bandwidth is saturated? Can you try running with -t 2,4,6,8 ... <num of cores> so we can see if it flatlines due to memory throughput at some point (for pp ofc. that is not the case, otherwise there would be much less singnificant difference)? For an example my laptop during optimization of x86 kernels at some point did hit the wall too despite kernel getting more efficient (for quants with lower weight density it occurs much sooner and correlates -1:1 with model size.

@khosravipasha
Copy link
Copy Markdown
Collaborator

Yeah I can do the pp512, tg128
Where do you see the 500+? was it on cpu or gpu?

Summary

Apple M4 Pro / macOS / 8 threads / -fa 1 -mmp 0 -r 5
Model: Bonsai-1.7B Q1_0 (231 MiB), CPU flags: +dotprod+i8mm

flow run baseline (bbeb89d) PR (197f7ca) delta
NEON+DP+I8MM pp512 170.99 ± 1.05 t/s 330.36 ± 0.52 t/s +93.20%
NEON+DP tg128 111.85 ± 0.11 t/s 112.02 ± 1.21 t/s +0.15%

Baseline (master @ bbeb89d):

model size params backend threads fa mmap test t/s
qwen3 1.7B Q1_0 231.13 MiB 1.72 B CPU 8 1 0 pp512 170.99 ± 1.05
qwen3 1.7B Q1_0 231.13 MiB 1.72 B CPU 8 1 0 tg128 111.85 ± 0.11

build: bbeb89d (241)

PR (perf/q1_0_arm_dot @ 197f7ca):

model size params backend threads fa mmap test t/s
qwen3 1.7B Q1_0 231.13 MiB 1.72 B CPU 8 1 0 pp512 330.36 ± 0.52
qwen3 1.7B Q1_0 231.13 MiB 1.72 B CPU 8 1 0 tg128 112.02 ± 1.21

build: 197f7ca (244)

@pl752
Copy link
Copy Markdown
Author

pl752 commented May 19, 2026

This looks amazing, massive gain, I have a M4 Pro, I can try on that.
...
model size params backend threads test t/s
1.7B Q1_0 231.13 MiB 1.72 B BLAS 10 pp512 (HERE) 502.21 ± 13.70
1.7B Q1_0 231.13 MiB 1.72 B BLAS 10 tg128 112.46 ± 5.89
...
Nice

(You have posted this table in first reply)

@khosravipasha
Copy link
Copy Markdown
Collaborator

khosravipasha commented May 19, 2026

10 cores seems to be best on the Mac for TG (since I guess has 10 Performance cores and 4 E cores)

PR (perf/q1_0_arm_dot @ 197f7ca):

threads pp128 tg128
2 89.12 ± 0.21 33.96 ± 0.07
4 172.56 ± 3.45 64.09 ± 0.11
6 254.47 ± 0.18 89.03 ± 0.04
8 337.88 ± 0.57 113.03 ± 0.36
10 374.60 ± 37.55 133.43 ± 6.51
12 419.67 ± 0.37 75.88 ± 0.07
14 415.81 ± 9.49 86.40 ± 8.74

Baseline (master @ bbeb89d):

threads pp128 tg128
2 44.69 ± 0.45 33.59 ± 0.19
4 89.42 ± 0.11 63.81 ± 0.19
6 131.15 ± 0.05 88.47 ± 0.05
8 168.24 ± 0.58 110.63 ± 0.24
10 190.26 ± 4.46 119.29 ± 11.41
12 208.09 ± 0.32 99.01 ± 1.66
14 216.26 ± 1.40 57.13 ± 15.12

@pl752
Copy link
Copy Markdown
Author

pl752 commented May 19, 2026

@khosravipasha Pretty funny that user with intel E-cores and I with smartphone encounter exact same performance degradation when mixing P and E cores. Would be interesting to see if core affinity (-C <mask> like -C 0xF0 for cores 3-7) affects the result when enforcing no core mixing.

@khosravipasha
Copy link
Copy Markdown
Collaborator

khosravipasha commented May 19, 2026

(You have posted this table in first reply)

Good question, the initial table with 500 tok/s I pasted from here: ggml-org#21273

Need to see why its slower now

@pl752
Copy link
Copy Markdown
Author

pl752 commented May 19, 2026

@khosravipasha In picture it states METAL, so might be it. Also if the build of metal was enabled, it might try to accelerate large matmuls by streaming weights through iGPU even if -ngl is 0 and device is not stated, or some kind of other optimized kernel libs were present (aka backend is state as BLAS instead of CPU, but I am not 100% sure).

@pl752
Copy link
Copy Markdown
Author

pl752 commented May 19, 2026

@khosravipasha And also there was comparison between NEON and scalar fallback, and PP there didn't change, so indeed some acceleration got involved

@pl752
Copy link
Copy Markdown
Author

pl752 commented May 19, 2026

Also kind of crasy how i8mm is ~66% of igpu performance.

@pl752
Copy link
Copy Markdown
Author

pl752 commented May 19, 2026

Anyway, at least my changes didn't harm performance on different hardware from my, so mission can be declared successful.

@khosravipasha
Copy link
Copy Markdown
Collaborator

yeah this is amazing :D

ran with BLAS and Metal in case you are curios:

Apple M4 Pro / -t 10 (P-cores) / -fa 1 -mmp 0 -r 5 / Bonsai-1.7B Q1_0 (231 MiB)

build flags -ngl test master (bbeb89d) PR (197f7ca) delta
CPU-only (no BLAS) 0 pp128 172.4 ± 2.9 337.6 ± 1.1 +95.8%
CPU-only (no BLAS) 0 pp512 171.0 ± 1.1 330.4 ± 0.5 +93.2%
CPU-only (no BLAS) 0 tg128 111.9 ± 0.1 112.0 ± 1.2 +0.2%
+Metal +BLAS 0 pp128 308.2 ± 5.0 332.7 ± 7.1 +7.9%
+Metal +BLAS 0 pp512 505.7 ± 12.0 515.0 ± 6.3 +1.8%
+Metal +BLAS 0 tg128 119.6 ± 1.9 123.6 ± 13.6 +3.3%
+Metal +BLAS 99 pp128 2032.0 ± 13.4 2044.5 ± 9.8 +0.6%
+Metal +BLAS 99 pp512 2184.4 ± 12.5 2188.6 ± 10.1 +0.2%
+Metal +BLAS 99 tg128 298.1 ± 3.4 301.4 ± 6.1 +1.1%

@pl752
Copy link
Copy Markdown
Author

pl752 commented May 19, 2026

tg is mostly apples to apples (pun intended), pp gets nice boost

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants