-
Notifications
You must be signed in to change notification settings - Fork 11.8k
CUDA: faster Deepseek FA, add Turing support #13435
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA: faster Deepseek FA, add Turing support #13435
Conversation
Yeah, I'll try and have a look at this next week. |
@JohannesGaessler Is there any advantage to make Q continuous by doing this alternative permutation order: https://github.com/ggml-org/llama.cpp/compare/master...jukofyork:llama.cpp:alt-perm-mla?expand=1 which then gets undone here: Line 1338 in 62d4250
This along with the 2D view, improved the non-FA performance, but not sure if it could also help with the FA version. |
For CUDA FA I think it doesn't matter because each CUDA block loads the Q data from VRAM exactly once. For the CPU code I don't know because there is no explicit allocation of SRAM. |
So I'm just looking at this now and could do with some input from @ggerganov on how best to proceed here I think: If we go back to disabling context shifting: cache.can_shift = !(model.arch == LLM_ARCH_DEEPSEEK2 && cparams.flash_attn); then it looks very easy to do: ggml_tensor * v = ggml_new_tensor_1d(ctx, type_v, model.arch == LLM_ARCH_DEEPSEEK2 && cparams.flash_att ? 0 : n_embd_v_gqa*kv_size); and then similarly to not copy it here: if (!v_trans) {
v_cache_view = ggml_view_1d(ctx0, kv_self->v_l[il], n_tokens*n_embd_v_gqa, ggml_row_size(kv_self->v_l[il]->type, n_embd_v_gqa)*kv_head);
} else if (!(model.arch == LLM_ARCH_DEEPSEEK2 && cparams.flash_att)) {
// note: the V cache is transposed when not using flash attention
v_cache_view = ggml_view_2d(ctx0, kv_self->v_l[il], n_tokens, n_embd_v_gqa,
( n_ctx)*ggml_element_size(kv_self->v_l[il]),
(kv_head)*ggml_element_size(kv_self->v_l[il]));
v_cur = ggml_transpose(ctx0, v_cur);
} (I've not run this code so may have typos, but the general idea should be this I think) The beauty of the "treat MLA as MQA with large heads" method, was that all the context-shifting stuff "just worked", but I fear this could end up a huge mess with lots of special cases in due to having to alter every |
The FA issues on master seem to have been sorted out and in terms of debugging I think this PR would now be fine to merge. |
Testing this PR on top of latest commits and all working fine. I get about 10-11% better perf on DeepSeek 14GB (Q5_K_M) on a single 5090, and about 1% on DeepSeek V3 0324 with CUDA + CPU (Q2_K_XL/IQ3_XXS) |
@JohannesGaessler Is it expected that speculative sampling performance is worse for the FA kernels now compared to non-FA? I can't remember the exact test case I was running, but for coding tasks I was getting around 2x increase using speculative sampling before (eg: 5.5 tokens/s --> 10.5 tokens per second). I have now got up to 6.0 tokens/s using this PR, but haven't managed to get more than about 7.5 tokens/s when using speculative sampling. Is this something expected (eg: due to performance of very small batches speculative sampling uses)? |
Looking at the numbers I posted in the OP again, while the performance did increase for all batch sizes the performance for batch size 1 (16 Q colummns/CUDA block) seems to now be better than with batch size 2 (13 Q column/CUDA block). So far the kernel selection logic always just picks the largest possible tile sizes, but I'll check whether this is suboptimal here. |
#13529 now works with this PR to give the 47% reduction in KV-cache size (less than 11GB for the full 160k tokens!). I don't feel confident making all the changes needed to get context-shifting working with the empty V-cache, so have just disabled it for now. |
There seems to be a hard limit of |
Probably the same problem as with #13384 , please tell me the exact model and command you're using. |
#!/bin/bash
host_address=192.168.1.1
port_number=8080
# Turn off NUMA balancing
echo 0 | sudo tee /proc/sys/kernel/numa_balancing > /dev/null
# Ask for permission to drop caches
read -p "Do you want to drop caches? (y/n) " -n 1 -r
echo # Move to a new line
if [[ $REPLY =~ ^[Yy]$ ]]
then
echo "Dropping caches..."
echo 3 | sudo tee /proc/sys/vm/drop_caches > /dev/null
fi
# Run the main command
~/llama.cpp/build/bin/llama-server \
--host "$host_address" \
--port "$port_number" \
--alias "deepseek-r1" \
--chat-template deepseek3 \
--temp 0.6 \
--min-p 0.05 \
--model ~/models/gguf/deepseek-r1-Q4_K_XL.gguf \
--n-gpu-layers 99 \
--numa distribute \
--threads 80 \
--override-tensor exps=CPU \
--flash-attn \
--ctx_size 65536 \
--batch-size 8191 \
--ubatch-size 8191 This works but
Is |
It's those I also hack const int min_batch_size = 2048 as this is the break-even point where pulling through the |
Also the version I'm using is just the current master, but with this and the other PR merged: git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
git fetch origin pull/13435/head:pr-13435
git fetch origin pull/13529/head:pr-13529
git merge pr-13435 --no-edit
git merge pr-13529 --no-edit so I should have the fix that helped whatever problem was solved for @danielhanchen in #13384 |
By same problem I meant same problem on a conceptual level but for a different kernel. |
Please confirm whether #13537 fixes the issue. |
With this fix and
This is with the experts offloaded (as |
Using: #!/bin/bash
host_address=192.168.1.1
port_number=8080
# Turn off NUMA balancing
echo 0 | sudo tee /proc/sys/kernel/numa_balancing > /dev/null
# Ask for permission to drop caches
read -p "Do you want to drop caches? (y/n) " -n 1 -r
echo # Move to a new line
if [[ $REPLY =~ ^[Yy]$ ]]
then
echo "Dropping caches..."
echo 3 | sudo tee /proc/sys/vm/drop_caches > /dev/null
fi
# Run the main command
~/llama.cpp/build/bin/llama-server \
--host "$host_address" \
--port "$port_number" \
--alias "deepseek-r1" \
--chat-template deepseek3 \
--temp 0.6 \
--min-p 0.05 \
--model ~/models/gguf/deepseek-r1-Q4_K_XL.gguf \
--n-gpu-layers 99 \
--numa distribute \
--threads 80 \
--override-tensor exps=CPU \
--flash-attn \
--ctx_size 65536 \
--batch-size 10240 \
--ubatch-size 10240 \
--override-kv "deepseek2.expert_used_count=int:6" \
--override-kv "deepseek2.expert_weights_scale=float:2.35"
The reason I settled on using The reason for scaling |
100 t/s PP on Q4_K_XL is really, really impressive running most of the model on RAM. How much RAM and bandwidth do you have on your system? |
The machine I'm running this on has The biggest limiting factor before this PR was the fact that it uses |
Here's another test (but using all 8 experts as I'm not convinced it's worth the drop in quality using 6...): Using the
I should add that I've found using between I think this is because Interestingly though, setting much closer to the original / pre-YaRN training context of |
I see, that's a lot of RAM! After this PR got merged, I tested with DeepSeek V3 0324, with a Consumer CPU (7800X3D, 192GB RAM), and 5090+4090x2+3090+A6000 at X8/X4/X4/X4/X4 PCIe, so huge bottleneck there. CPU RAM Bandiwdth is about 64 GB/s at 6000Mhz (Limited by 1 CCD) On IQ3_XXS running with
I get
While on Q3_K_XL, running with
I get
So despite Q3_K_XL being bigger, and using 1 less layer in GPU, it is faster. Maybe that is related of what you mentioned a bit above about IQ quants. |
Yeah, I think the other |
This PR optimizes the CUDA FlashAttention kernel for better Deepseek performance. Since there seem to still be issues with my last FA PR on master I would suggest only merging this PR after they are sorted out.
ggml_cuda_unroll
. However, the code then only compiles if a compile flag for relaxed constexpr is used. Notably this compile flag is declared experimental with no guarantee of being forwards compatible. Since the compiler was able to do the correct loop optimization with the code as it is now I would suggest keeping it that way.Performance
On my RTX 3090 and 4090
-fa
seems to now universally improve performance starting at a context size of ~2000 tokens, previously this was at ~6000 tokens.