Optimized ARM NEON q1_0 dot#33
Conversation
|
This looks amazing, massive gain, I have a M4 Pro, I can try on that. This was the ARM NEON speeds from initial PR I had, I can rerun both and compare, I can try on my Android phone too see the difference. ARM NEON
Question: How do you run the different flows? Is there a special compile flag?
Nice |
There was a problem hiding this comment.
Pull request overview
This PR continues the ARM NEON optimization work for q1_0 dot products by introducing faster bit unpacking (LUT-based) and adding an nrc==2 path that leverages the ARM i8mm (__ARM_FEATURE_MATMUL_INT8) 2x2 int8 matmul instructions to accelerate prompt processing.
Changes:
- Enable
nrows==2forGGML_TYPE_Q1_0when compiling with__ARM_FEATURE_MATMUL_INT8so the CPU matmul path can request 2-row dot kernels. - Reimplement
ggml_vec_dot_q1_0_q8_0for ARM with separate implementations for i8mm (nrc==2), DOTPROD, and baseline NEON. - Add q1 LUT tables for faster expansion of q1 sign/mask bytes.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| ggml/src/ggml-cpu/ggml-cpu.c | Advertises GGML_TYPE_Q1_0 as supporting 2-row dot kernels when i8mm is available. |
| ggml/src/ggml-cpu/arch/arm/quants.c | Reworks ARM q1_0 × q8_0 dot product with LUT-based unpacking and new i8mm/DOTPROD/baseline NEON paths. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
@khosravipasha I would be grateful if you reran the M4 Pro as I am interested in how the code behaves on apple hardware; as for smartphone, there is no need, I think. As for compile flags, I am mostly interested in native rerun and I think that both |
|
As repack for arm is much more straightforward, decided to implement 4x4 NEON+DP variant in #34. It doesn't use
|
|
|
|
Seems prompt processing improve massively, and token generation did not change much. Apple M4 Pro / macOS 15 / 8 threads /
Details: cmake -B build-pr \
-DCMAKE_BUILD_TYPE=Release \
-DLLAMA_CURL=OFF \
-DGGML_METAL=OFF \
-DGGML_BLAS=OFF \
-DLLAMA_BUILD_SERVER=OFF
cmake --build build-pr --target llama-bench -j$(sysctl -n hw.ncpu)./build-pr/bin/llama-bench -m Bonsai-1.7B.gguf -p 128 -n 32 -r 5 -t 8 -fa 1 -mmp 0Baseline (master @ bbeb89d):
build: bbeb89d (241)
build: 197f7ca (244) |
|
@khosravipasha Thanks, though it was not necessary to limit mac to (128/32) tokens, also what else have changed since the run with 500+ t/s (or BLAS very significantly affects pp)? Also if the metal was enabled could the iGPU accelerate pp even without layer upload? |
|
@khosravipasha Also I noticed that |
|
Yeah I can do the pp512, tg128 SummaryApple M4 Pro / macOS / 8 threads /
Baseline (master @ bbeb89d):
build: bbeb89d (241) PR (perf/q1_0_arm_dot @ 197f7ca):
build: 197f7ca (244) |
(You have posted this table in first reply) |
|
10 cores seems to be best on the Mac for TG (since I guess has 10 Performance cores and 4 E cores) PR (perf/q1_0_arm_dot @ 197f7ca):
Baseline (master @ bbeb89d):
|
|
@khosravipasha Pretty funny that user with intel E-cores and I with smartphone encounter exact same performance degradation when mixing P and E cores. Would be interesting to see if core affinity ( |
Good question, the initial table with 500 tok/s I pasted from here: ggml-org#21273 Need to see why its slower now |
|
@khosravipasha In picture it states METAL, so might be it. Also if the build of metal was enabled, it might try to accelerate large matmuls by streaming weights through iGPU even if -ngl is 0 and device is not stated, or some kind of other optimized kernel libs were present (aka backend is state as |
|
@khosravipasha And also there was comparison between NEON and scalar fallback, and |
|
Also kind of crasy how |
|
Anyway, at least my changes didn't harm performance on different hardware from my, so mission can be declared successful. |
|
yeah this is amazing :D ran with BLAS and Metal in case you are curios: Apple M4 Pro / -t 10 (P-cores) / -fa 1 -mmp 0 -r 5 / Bonsai-1.7B Q1_0 (231 MiB)
|
|
|
Continuation of #10 for ARM NEON
Reimplemented kernel for NEON with and without
dotprodextension, added nrc==2 path for i8mm (int8 2x2x8->int32 2x2 matmul) extension like some of the other formats.Main optimization is faster unpacking of q1 values via LUT, also separate scaling and accumulation implemented like in other implementations of Q1 dot.
Plain NEON is rewritten like SSSE3/AVX2 impl, due to it being more efficient when
dotprodis not available.2x2 tile dot for i8mm ext was added to speed up
pp.Benchmarks were performed with:
Honor 400 (smartphone) with Snapdragon 7 gen 3, 12 gb ram
Android 16 via ADB
Used 4 performance cores due to all 8 cores causing unstable performance
Command:
./llama-bench -m Bonsai-1.7B.gguf -p 128 -n 32 -r 3 -C 0xF0 -t 4 -fa 1 -mmp 0Perplexity for 5x512 chunks: Mean KLD 0.00021, PPL 21.08, Same top p 99,22%
*doesn't affecttg.@khosravipasha, can you, please, compare performance between your and my implementations on your Mac, so we have more data on performance?
As always, I would appreciate your feedback