hexagon: allow dflash lm-head offload experiment by Salanfeng · Pull Request #25166 · ggml-org/llama.cpp

Salanfeng · 2026-06-30T09:43:59Z

Overview

This PR fixes several Hexagon issues that prevented DFlash from working correctly and improves speculative decoding support on Hexagon.

guard optional rotary inputs that may exist without a backend buffer in hybrid/iSWA graphs
add opt-in HTP lm-head placement for large quantized models used by DFlash verification
make llama-speculative-simple use the common speculative path, matching server behavior for DFlash/EAGLE/MTP-style drafts
handle tied embedding/lm-head fallback when DFlash shares target tensors through ctx_other

Additional information

Tested on an SM8750 Android device with Qwen3.5-4B Q8_0 / Q4_0 target GGUFs and a Qwen3.5-4B DFlash F16 draft GGUF.

GGML_HEXAGON_LM_HEAD=1 is kept opt-in to preserve existing behavior until large lm-head kernels are validated across more Hexagon deployments. When embeddings are tied, the output tensor may point to an HTP-only repacked buffer while tok_embd still refers to the original CPU-accessible tensor. Use tok_embd for the tied fallback to avoid building CPU draft graphs with HTP-only buffers.

Build and run commands

Build:

cmake -S . -B build-android-hexagon \
  -DCMAKE_TOOLCHAIN_FILE=$ANDROID_NDK/build/cmake/android.toolchain.cmake \
  -DANDROID_ABI=arm64-v8a \
  -DANDROID_PLATFORM=android-34 \
  -DCMAKE_BUILD_TYPE=Release \
  -DGGML_CPU_ARM_ARCH=armv8.7-a+fp16+dotprod+i8mm \
  -DGGML_HEXAGON=ON \
  -DHEXAGON_SDK_ROOT=$HEXAGON_SDK_ROOT \
  -DHEXAGON_TOOLS_ROOT=$HEXAGON_TOOLS_ROOT \
  -DPREBUILT_LIB_DIR=android_aarch64 \
  -DLLAMA_BUILD_SERVER=ON \
  -DLLAMA_BUILD_EXAMPLES=ON

cmake --build build-android-hexagon \
  --target llama-cli llama-bench llama-speculative-simple htp-v79 -j 16

Target-only run:

HEXAGON_LIB_DIR is the directory containing the Hexagon shared libraries.

LD_LIBRARY_PATH=$HEXAGON_LIB_DIR ADSP_LIBRARY_PATH=$HEXAGON_LIB_DIR \
./bin/llama-cli \
  -m models/Qwen3.5-4B-Q8_0.gguf \
  -dev HTP0 -ngl 99 \
  -c 4096 -b 2048 -ub 1024 \
  -t 6 --threads-batch 6 \
  -C 0xfc --cpu-strict 1 --poll 100 \
  --temp 0 -cnv -st --jinja \
  --no-display-prompt --show-timings \
  -p "$PROMPT" -n 128

DFlash run:

GGML_HEXAGON_LM_HEAD=1 \
LD_LIBRARY_PATH=$HEXAGON_LIB_DIR ADSP_LIBRARY_PATH=$HEXAGON_LIB_DIR \
./bin/llama-speculative-simple \
  -m models/Qwen3.5-4B-Q8_0.gguf \
  -md models/Qwen3.5-4B-DFlash-F16.gguf \
  --spec-type draft-dflash --spec-draft-n-max N \
  -dev HTP0 -ngl 99 -devd HTP0 -ngld 99 \
  -c 4096 -b 2048 -ub 1024 \
  -t 6 -tb 6 -td 6 -tbd 6 \
  -C 0xfc --cpu-strict 1 \
  --temp 0 \
  -p "$PROMPT" -n 128

Decode results use 128 generated tokens and temperature 0. Target-only rows use llama-cli. DFlash rows use llama-speculative-simple to report drafted and accepted tokens. n_max was swept from 1 upward for each prompt and backend.

The same runtime settings were used across runs except for n_max and target/draft backend placement. Across the tested prompts, DFlash benefits from running the draft model on HTP when acceptance is reasonable. Peak speedups reach 2.51x over target-only decoding. For HTP targets, enabling the lm-head on HTP improved Q8_0 DFlash decode from 15.0 to 17.1 tok/s in the tested n_max=9 configuration.

Summary:

Prompt	Target	Target/Draft	Best n_max	Target-only tok/s	DFlash tok/s	Accept rate	Mean accept len	Speedup
quicksort	Q4_0	HTP0/HTP0	11	10.6	16.535	67.045%	8.375	1.56x
quicksort	Q8_0	HTP0/HTP0	9	8.4	21.095	78.704%	8.083	2.51x
Pythagorean	Q4_0	CPU/HTP0	3	16.7	21.855	72.358%	3.171	1.31x
Pythagorean	Q8_0	HTP0/HTP0	3	8.4	15.753	76.923%	3.308	1.88x
DC trip	Q4_0	HTP0/CPU	1	10.4	9.798	66.667%	1.667	0.94x
DC trip	Q8_0	HTP0/HTP0	2	8.4	9.750	43.478%	1.870	1.16x

Q8_0 n_max=9 lm-head placement with HTP0 target and HTP0 draft:

lm-head placement	Env	Decode tok/s
CPU fallback	unset	15.0
HTP0	`GGML_HEXAGON_LM_HEAD=1`	17.1

Full n_max sweep and per-prompt results

Prompt: `Write a quicksort algorithm in Python. Write code only.`

Best results:

Target	Run	Target backend	Draft backend	Best n_max	Decode tok/s	Draft / accepted	Accept rate	Mean accept len	Speedup
Q4_0	target-only	CPU	n/a	n/a	16.8	n/a	n/a	n/a	1.00x
Q4_0	target-only	HTP0	n/a	n/a	10.6	n/a	n/a	n/a	1.00x
Q4_0	DFlash	CPU	CPU	3	22.113	108 / 96	88.889%	3.667	1.32x
Q4_0	DFlash	CPU	HTP0	3	26.259	108 / 96	88.889%	3.667	1.56x
Q4_0	DFlash	HTP0	CPU	7	16.333	147 / 115	78.231%	6.476	1.54x
Q4_0	DFlash	HTP0	HTP0	11	16.535	176 / 118	67.045%	8.375	1.56x
Q8_0	target-only	CPU	n/a	n/a	11.4	n/a	n/a	n/a	1.00x
Q8_0	target-only	HTP0	n/a	n/a	8.4	n/a	n/a	n/a	1.00x
Q8_0	DFlash	CPU	CPU	3	19.860	78 / 71	91.026%	3.731	1.74x
Q8_0	DFlash	CPU	HTP0	3	23.713	78 / 71	91.026%	3.731	2.08x
Q8_0	DFlash	HTP0	CPU	9	17.133	108 / 85	78.704%	8.083	2.04x
Q8_0	DFlash	HTP0	HTP0	9	21.095	108 / 85	78.704%	8.083	2.51x

n_max sweep decode tok/s:

Target	Target/Draft	n=1	n=2	n=3	n=4	n=5	n=6	n=7	n=8	n=9	n=10	n=11	n=12	n=13	n=14
Q4_0	CPU/CPU	12.686	12.068	22.113	18.153	18.092	16.159
Q4_0	CPU/HTP0	12.449	12.659	26.259	18.408	17.528	16.262
Q4_0	HTP0/CPU	11.624	12.037	15.148	14.128	14.906	13.519	16.333	14.725	15.462	13.944	15.722	13.813	13.082	11.300
Q4_0	HTP0/HTP0	9.827	10.832	14.912	13.049	12.719	13.103	15.585	11.547	14.568	14.339	16.535	15.192	14.402	13.485
Q8_0	CPU/CPU	9.444	10.611	19.860	17.146	15.374	15.231
Q8_0	CPU/HTP0	10.868	11.981	23.713	19.156	16.794	16.650
Q8_0	HTP0/CPU	10.191	12.564	14.437	13.462	13.790	14.631	14.854	15.197	17.133	15.600
Q8_0	HTP0/HTP0	11.557	15.133	17.585	14.949	15.835	17.735	17.485	18.137	21.095	20.564	19.351	18.572

Prompt: `Explain the Pythagorean theorem`

Best results:

Target	Run	Target backend	Draft backend	Best n_max	Decode tok/s	Draft / accepted	Accept rate	Mean accept len	Speedup
Q4_0	target-only	CPU	n/a	n/a	16.7	n/a	n/a	n/a	1.00x
Q4_0	target-only	HTP0	n/a	n/a	10.5	n/a	n/a	n/a	1.00x
Q4_0	DFlash	CPU	CPU	3	16.312	129 / 86	66.667%	3.000	0.98x
Q4_0	DFlash	CPU	HTP0	3	21.855	123 / 89	72.358%	3.171	1.31x
Q4_0	DFlash	HTP0	CPU	3	11.336	132 / 87	65.909%	2.977	1.08x
Q4_0	DFlash	HTP0	HTP0	3	13.431	120 / 90	75.000%	3.250	1.28x
Q8_0	target-only	CPU	n/a	n/a	11.3	n/a	n/a	n/a	1.00x
Q8_0	target-only	HTP0	n/a	n/a	8.4	n/a	n/a	n/a	1.00x
Q8_0	DFlash	CPU	CPU	3	17.193	123 / 91	73.984%	3.220	1.52x
Q8_0	DFlash	CPU	HTP0	3	20.913	123 / 91	73.984%	3.220	1.85x
Q8_0	DFlash	HTP0	CPU	3	12.805	117 / 90	76.923%	3.308	1.52x
Q8_0	DFlash	HTP0	HTP0	3	15.753	117 / 90	76.923%	3.308	1.88x

n_max sweep decode tok/s:

Target	Target/Draft	n=1	n=2	n=3	n=4	n=5	n=6	n=7	n=8	n=9	n=10	n=11	n=12
Q4_0	CPU/CPU	11.304	10.628	16.312	12.122	11.433	8.807
Q4_0	CPU/HTP0	11.667	11.083	21.855	14.348	13.026	11.942
Q4_0	HTP0/CPU	10.816	9.876	11.336	9.737	9.583	8.115
Q4_0	HTP0/HTP0	9.290	9.392	13.431	10.251	10.056	9.500
Q8_0	CPU/CPU	9.922	10.082	17.193	14.179	11.956	9.428
Q8_0	CPU/HTP0	10.037	9.838	20.913	14.106	11.958	10.766
Q8_0	HTP0/CPU	9.790	11.425	12.805	11.031	10.683	10.238
Q8_0	HTP0/HTP0	11.016	13.458	15.753	12.342	12.112	12.204	11.630	10.895	11.428	11.153	10.972	10.384

Prompt: `Plan a 1 day trip to DC`

Best results:

Target	Run	Target backend	Draft backend	Best n_max	Decode tok/s	Draft / accepted	Accept rate	Mean accept len	Speedup
Q4_0	target-only	CPU	n/a	n/a	16.8	n/a	n/a	n/a	1.00x
Q4_0	target-only	HTP0	n/a	n/a	10.4	n/a	n/a	n/a	1.00x
Q4_0	DFlash	CPU	CPU	3	11.232	192 / 65	33.854%	2.016	0.67x
Q4_0	DFlash	CPU	HTP0	3	14.999	183 / 69	37.705%	2.131	0.89x
Q4_0	DFlash	HTP0	CPU	1	9.798	78 / 52	66.667%	1.667	0.94x
Q4_0	DFlash	HTP0	HTP0	3	9.270	171 / 73	42.690%	2.281	0.89x
Q8_0	target-only	CPU	n/a	n/a	11.4	n/a	n/a	n/a	1.00x
Q8_0	target-only	HTP0	n/a	n/a	8.4	n/a	n/a	n/a	1.00x
Q8_0	DFlash	CPU	CPU	3	10.592	195 / 64	32.821%	1.985	0.93x
Q8_0	DFlash	CPU	HTP0	3	12.835	195 / 64	32.821%	1.985	1.13x
Q8_0	DFlash	HTP0	CPU	2	8.231	138 / 60	43.478%	1.870	0.98x
Q8_0	DFlash	HTP0	HTP0	2	9.750	138 / 60	43.478%	1.870	1.16x

n_max sweep decode tok/s:

Target	Target/Draft	n=1	n=2	n=3	n=4	n=5	n=6
Q4_0	CPU/CPU	9.487	8.102	11.232	7.330	6.658	5.370
Q4_0	CPU/HTP0	9.528	8.177	14.999	9.344	7.881	6.680
Q4_0	HTP0/CPU	9.798	7.661	8.253	6.211	5.972	5.105
Q4_0	HTP0/HTP0	8.701	7.398	9.270	6.783	6.230	5.432
Q8_0	CPU/CPU	8.428	7.222	10.592	7.899	6.538	5.216
Q8_0	CPU/HTP0	8.176	7.206	12.835	8.309	6.640	5.636
Q8_0	HTP0/CPU	7.466	8.231	7.220	6.296	5.909
Q8_0	HTP0/HTP0	8.658	9.750	8.544	7.040	6.707

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: YES. AI assisted with summarizing benchmark results.

hexagon: allow dflash lm-head offload experiment

9e25eff

Salanfeng requested review from a team and CISC as code owners June 30, 2026 09:44

github-actions Bot added ggml changes relating to the ggml tensor library for machine learning Hexagon labels Jun 30, 2026

Salanfeng force-pushed the hexagon-dflash branch from 9e25eff to 904876f Compare July 1, 2026 14:31

Salanfeng requested a review from ggerganov as a code owner July 1, 2026 14:31

dflash: fix speculative simple Hexagon path

8e84bbb

github-actions Bot added model Model specific examples labels Jul 1, 2026

Salanfeng force-pushed the hexagon-dflash branch from 904876f to 8e84bbb Compare July 1, 2026 14:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

hexagon: allow dflash lm-head offload experiment#25166

hexagon: allow dflash lm-head offload experiment#25166
Salanfeng wants to merge 2 commits into
ggml-org:masterfrom
Salanfeng:hexagon-dflash

Salanfeng commented Jun 30, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

Salanfeng commented Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Additional information

Prompt: Write a quicksort algorithm in Python. Write code only.

Prompt: Explain the Pythagorean theorem

Prompt: Plan a 1 day trip to DC

Requirements

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Salanfeng commented Jun 30, 2026 •

edited

Loading

Prompt: `Write a quicksort algorithm in Python. Write code only.`

Prompt: `Explain the Pythagorean theorem`

Prompt: `Plan a 1 day trip to DC`