hexagon: allow dflash lm-head offload experiment#25166
Open
Salanfeng wants to merge 2 commits into
Open
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Overview
This PR fixes several Hexagon issues that prevented DFlash from working correctly and improves speculative decoding support on Hexagon.
llama-speculative-simpleuse the common speculative path, matching server behavior for DFlash/EAGLE/MTP-style draftsctx_otherAdditional information
Tested on an SM8750 Android device with Qwen3.5-4B Q8_0 / Q4_0 target GGUFs and a Qwen3.5-4B DFlash F16 draft GGUF.
GGML_HEXAGON_LM_HEAD=1is kept opt-in to preserve existing behavior until large lm-head kernels are validated across more Hexagon deployments. When embeddings are tied, the output tensor may point to an HTP-only repacked buffer whiletok_embdstill refers to the original CPU-accessible tensor. Usetok_embdfor the tied fallback to avoid building CPU draft graphs with HTP-only buffers.Build and run commands
Build:
Target-only run:
HEXAGON_LIB_DIRis the directory containing the Hexagon shared libraries.DFlash run:
Decode results use 128 generated tokens and temperature 0. Target-only rows use
llama-cli. DFlash rows usellama-speculative-simpleto report drafted and accepted tokens.n_maxwas swept from 1 upward for each prompt and backend.The same runtime settings were used across runs except for
n_maxand target/draft backend placement. Across the tested prompts, DFlash benefits from running the draft model on HTP when acceptance is reasonable. Peak speedups reach 2.51x over target-only decoding. For HTP targets, enabling the lm-head on HTP improved Q8_0 DFlash decode from 15.0 to 17.1 tok/s in the testedn_max=9configuration.Summary:
Q8_0
n_max=9lm-head placement with HTP0 target and HTP0 draft:GGML_HEXAGON_LM_HEAD=1Full n_max sweep and per-prompt results
Prompt:
Write a quicksort algorithm in Python. Write code only.Best results:
n_max sweep decode tok/s:
Prompt:
Explain the Pythagorean theoremBest results:
n_max sweep decode tok/s:
Prompt:
Plan a 1 day trip to DCBest results:
n_max sweep decode tok/s:
Requirements