Skip to content

hexagon: allow dflash lm-head offload experiment#25166

Open
Salanfeng wants to merge 2 commits into
ggml-org:masterfrom
Salanfeng:hexagon-dflash
Open

hexagon: allow dflash lm-head offload experiment#25166
Salanfeng wants to merge 2 commits into
ggml-org:masterfrom
Salanfeng:hexagon-dflash

Conversation

@Salanfeng

@Salanfeng Salanfeng commented Jun 30, 2026

Copy link
Copy Markdown

Overview

This PR fixes several Hexagon issues that prevented DFlash from working correctly and improves speculative decoding support on Hexagon.

  • guard optional rotary inputs that may exist without a backend buffer in hybrid/iSWA graphs
  • add opt-in HTP lm-head placement for large quantized models used by DFlash verification
  • make llama-speculative-simple use the common speculative path, matching server behavior for DFlash/EAGLE/MTP-style drafts
  • handle tied embedding/lm-head fallback when DFlash shares target tensors through ctx_other

Additional information

Tested on an SM8750 Android device with Qwen3.5-4B Q8_0 / Q4_0 target GGUFs and a Qwen3.5-4B DFlash F16 draft GGUF.

GGML_HEXAGON_LM_HEAD=1 is kept opt-in to preserve existing behavior until large lm-head kernels are validated across more Hexagon deployments. When embeddings are tied, the output tensor may point to an HTP-only repacked buffer while tok_embd still refers to the original CPU-accessible tensor. Use tok_embd for the tied fallback to avoid building CPU draft graphs with HTP-only buffers.

Build and run commands

Build:

cmake -S . -B build-android-hexagon \
  -DCMAKE_TOOLCHAIN_FILE=$ANDROID_NDK/build/cmake/android.toolchain.cmake \
  -DANDROID_ABI=arm64-v8a \
  -DANDROID_PLATFORM=android-34 \
  -DCMAKE_BUILD_TYPE=Release \
  -DGGML_CPU_ARM_ARCH=armv8.7-a+fp16+dotprod+i8mm \
  -DGGML_HEXAGON=ON \
  -DHEXAGON_SDK_ROOT=$HEXAGON_SDK_ROOT \
  -DHEXAGON_TOOLS_ROOT=$HEXAGON_TOOLS_ROOT \
  -DPREBUILT_LIB_DIR=android_aarch64 \
  -DLLAMA_BUILD_SERVER=ON \
  -DLLAMA_BUILD_EXAMPLES=ON

cmake --build build-android-hexagon \
  --target llama-cli llama-bench llama-speculative-simple htp-v79 -j 16

Target-only run:

HEXAGON_LIB_DIR is the directory containing the Hexagon shared libraries.

LD_LIBRARY_PATH=$HEXAGON_LIB_DIR ADSP_LIBRARY_PATH=$HEXAGON_LIB_DIR \
./bin/llama-cli \
  -m models/Qwen3.5-4B-Q8_0.gguf \
  -dev HTP0 -ngl 99 \
  -c 4096 -b 2048 -ub 1024 \
  -t 6 --threads-batch 6 \
  -C 0xfc --cpu-strict 1 --poll 100 \
  --temp 0 -cnv -st --jinja \
  --no-display-prompt --show-timings \
  -p "$PROMPT" -n 128

DFlash run:

GGML_HEXAGON_LM_HEAD=1 \
LD_LIBRARY_PATH=$HEXAGON_LIB_DIR ADSP_LIBRARY_PATH=$HEXAGON_LIB_DIR \
./bin/llama-speculative-simple \
  -m models/Qwen3.5-4B-Q8_0.gguf \
  -md models/Qwen3.5-4B-DFlash-F16.gguf \
  --spec-type draft-dflash --spec-draft-n-max N \
  -dev HTP0 -ngl 99 -devd HTP0 -ngld 99 \
  -c 4096 -b 2048 -ub 1024 \
  -t 6 -tb 6 -td 6 -tbd 6 \
  -C 0xfc --cpu-strict 1 \
  --temp 0 \
  -p "$PROMPT" -n 128

Decode results use 128 generated tokens and temperature 0. Target-only rows use llama-cli. DFlash rows use llama-speculative-simple to report drafted and accepted tokens. n_max was swept from 1 upward for each prompt and backend.

The same runtime settings were used across runs except for n_max and target/draft backend placement. Across the tested prompts, DFlash benefits from running the draft model on HTP when acceptance is reasonable. Peak speedups reach 2.51x over target-only decoding. For HTP targets, enabling the lm-head on HTP improved Q8_0 DFlash decode from 15.0 to 17.1 tok/s in the tested n_max=9 configuration.

Summary:

Prompt Target Target/Draft Best n_max Target-only tok/s DFlash tok/s Accept rate Mean accept len Speedup
quicksort Q4_0 HTP0/HTP0 11 10.6 16.535 67.045% 8.375 1.56x
quicksort Q8_0 HTP0/HTP0 9 8.4 21.095 78.704% 8.083 2.51x
Pythagorean Q4_0 CPU/HTP0 3 16.7 21.855 72.358% 3.171 1.31x
Pythagorean Q8_0 HTP0/HTP0 3 8.4 15.753 76.923% 3.308 1.88x
DC trip Q4_0 HTP0/CPU 1 10.4 9.798 66.667% 1.667 0.94x
DC trip Q8_0 HTP0/HTP0 2 8.4 9.750 43.478% 1.870 1.16x

Q8_0 n_max=9 lm-head placement with HTP0 target and HTP0 draft:

lm-head placement Env Decode tok/s
CPU fallback unset 15.0
HTP0 GGML_HEXAGON_LM_HEAD=1 17.1
Full n_max sweep and per-prompt results

Prompt: Write a quicksort algorithm in Python. Write code only.

Best results:

Target Run Target backend Draft backend Best n_max Decode tok/s Draft / accepted Accept rate Mean accept len Speedup
Q4_0 target-only CPU n/a n/a 16.8 n/a n/a n/a 1.00x
Q4_0 target-only HTP0 n/a n/a 10.6 n/a n/a n/a 1.00x
Q4_0 DFlash CPU CPU 3 22.113 108 / 96 88.889% 3.667 1.32x
Q4_0 DFlash CPU HTP0 3 26.259 108 / 96 88.889% 3.667 1.56x
Q4_0 DFlash HTP0 CPU 7 16.333 147 / 115 78.231% 6.476 1.54x
Q4_0 DFlash HTP0 HTP0 11 16.535 176 / 118 67.045% 8.375 1.56x
Q8_0 target-only CPU n/a n/a 11.4 n/a n/a n/a 1.00x
Q8_0 target-only HTP0 n/a n/a 8.4 n/a n/a n/a 1.00x
Q8_0 DFlash CPU CPU 3 19.860 78 / 71 91.026% 3.731 1.74x
Q8_0 DFlash CPU HTP0 3 23.713 78 / 71 91.026% 3.731 2.08x
Q8_0 DFlash HTP0 CPU 9 17.133 108 / 85 78.704% 8.083 2.04x
Q8_0 DFlash HTP0 HTP0 9 21.095 108 / 85 78.704% 8.083 2.51x

n_max sweep decode tok/s:

Target Target/Draft n=1 n=2 n=3 n=4 n=5 n=6 n=7 n=8 n=9 n=10 n=11 n=12 n=13 n=14
Q4_0 CPU/CPU 12.686 12.068 22.113 18.153 18.092 16.159
Q4_0 CPU/HTP0 12.449 12.659 26.259 18.408 17.528 16.262
Q4_0 HTP0/CPU 11.624 12.037 15.148 14.128 14.906 13.519 16.333 14.725 15.462 13.944 15.722 13.813 13.082 11.300
Q4_0 HTP0/HTP0 9.827 10.832 14.912 13.049 12.719 13.103 15.585 11.547 14.568 14.339 16.535 15.192 14.402 13.485
Q8_0 CPU/CPU 9.444 10.611 19.860 17.146 15.374 15.231
Q8_0 CPU/HTP0 10.868 11.981 23.713 19.156 16.794 16.650
Q8_0 HTP0/CPU 10.191 12.564 14.437 13.462 13.790 14.631 14.854 15.197 17.133 15.600
Q8_0 HTP0/HTP0 11.557 15.133 17.585 14.949 15.835 17.735 17.485 18.137 21.095 20.564 19.351 18.572

Prompt: Explain the Pythagorean theorem

Best results:

Target Run Target backend Draft backend Best n_max Decode tok/s Draft / accepted Accept rate Mean accept len Speedup
Q4_0 target-only CPU n/a n/a 16.7 n/a n/a n/a 1.00x
Q4_0 target-only HTP0 n/a n/a 10.5 n/a n/a n/a 1.00x
Q4_0 DFlash CPU CPU 3 16.312 129 / 86 66.667% 3.000 0.98x
Q4_0 DFlash CPU HTP0 3 21.855 123 / 89 72.358% 3.171 1.31x
Q4_0 DFlash HTP0 CPU 3 11.336 132 / 87 65.909% 2.977 1.08x
Q4_0 DFlash HTP0 HTP0 3 13.431 120 / 90 75.000% 3.250 1.28x
Q8_0 target-only CPU n/a n/a 11.3 n/a n/a n/a 1.00x
Q8_0 target-only HTP0 n/a n/a 8.4 n/a n/a n/a 1.00x
Q8_0 DFlash CPU CPU 3 17.193 123 / 91 73.984% 3.220 1.52x
Q8_0 DFlash CPU HTP0 3 20.913 123 / 91 73.984% 3.220 1.85x
Q8_0 DFlash HTP0 CPU 3 12.805 117 / 90 76.923% 3.308 1.52x
Q8_0 DFlash HTP0 HTP0 3 15.753 117 / 90 76.923% 3.308 1.88x

n_max sweep decode tok/s:

Target Target/Draft n=1 n=2 n=3 n=4 n=5 n=6 n=7 n=8 n=9 n=10 n=11 n=12
Q4_0 CPU/CPU 11.304 10.628 16.312 12.122 11.433 8.807
Q4_0 CPU/HTP0 11.667 11.083 21.855 14.348 13.026 11.942
Q4_0 HTP0/CPU 10.816 9.876 11.336 9.737 9.583 8.115
Q4_0 HTP0/HTP0 9.290 9.392 13.431 10.251 10.056 9.500
Q8_0 CPU/CPU 9.922 10.082 17.193 14.179 11.956 9.428
Q8_0 CPU/HTP0 10.037 9.838 20.913 14.106 11.958 10.766
Q8_0 HTP0/CPU 9.790 11.425 12.805 11.031 10.683 10.238
Q8_0 HTP0/HTP0 11.016 13.458 15.753 12.342 12.112 12.204 11.630 10.895 11.428 11.153 10.972 10.384

Prompt: Plan a 1 day trip to DC

Best results:

Target Run Target backend Draft backend Best n_max Decode tok/s Draft / accepted Accept rate Mean accept len Speedup
Q4_0 target-only CPU n/a n/a 16.8 n/a n/a n/a 1.00x
Q4_0 target-only HTP0 n/a n/a 10.4 n/a n/a n/a 1.00x
Q4_0 DFlash CPU CPU 3 11.232 192 / 65 33.854% 2.016 0.67x
Q4_0 DFlash CPU HTP0 3 14.999 183 / 69 37.705% 2.131 0.89x
Q4_0 DFlash HTP0 CPU 1 9.798 78 / 52 66.667% 1.667 0.94x
Q4_0 DFlash HTP0 HTP0 3 9.270 171 / 73 42.690% 2.281 0.89x
Q8_0 target-only CPU n/a n/a 11.4 n/a n/a n/a 1.00x
Q8_0 target-only HTP0 n/a n/a 8.4 n/a n/a n/a 1.00x
Q8_0 DFlash CPU CPU 3 10.592 195 / 64 32.821% 1.985 0.93x
Q8_0 DFlash CPU HTP0 3 12.835 195 / 64 32.821% 1.985 1.13x
Q8_0 DFlash HTP0 CPU 2 8.231 138 / 60 43.478% 1.870 0.98x
Q8_0 DFlash HTP0 HTP0 2 9.750 138 / 60 43.478% 1.870 1.16x

n_max sweep decode tok/s:

Target Target/Draft n=1 n=2 n=3 n=4 n=5 n=6
Q4_0 CPU/CPU 9.487 8.102 11.232 7.330 6.658 5.370
Q4_0 CPU/HTP0 9.528 8.177 14.999 9.344 7.881 6.680
Q4_0 HTP0/CPU 9.798 7.661 8.253 6.211 5.972 5.105
Q4_0 HTP0/HTP0 8.701 7.398 9.270 6.783 6.230 5.432
Q8_0 CPU/CPU 8.428 7.222 10.592 7.899 6.538 5.216
Q8_0 CPU/HTP0 8.176 7.206 12.835 8.309 6.640 5.636
Q8_0 HTP0/CPU 7.466 8.231 7.220 6.296 5.909
Q8_0 HTP0/HTP0 8.658 9.750 8.544 7.040 6.707

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure: YES. AI assisted with summarizing benchmark results.

@Salanfeng Salanfeng requested review from a team and CISC as code owners June 30, 2026 09:44
@github-actions github-actions Bot added ggml changes relating to the ggml tensor library for machine learning Hexagon labels Jun 30, 2026
@Salanfeng Salanfeng requested a review from ggerganov as a code owner July 1, 2026 14:31
@github-actions github-actions Bot added model Model specific examples labels Jul 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

examples ggml changes relating to the ggml tensor library for machine learning Hexagon model Model specific

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant