Adding PagedAttention support for CausalLM models by vaibverm · Pull Request #982 · quic/efficient-transformers

vaibverm · 2026-05-13T07:33:30Z

This PR adds the PagedAttention (https://arxiv.org/pdf/2309.06180) support for all CausalLM models in QEfficient.
The major change is that KV cache is not treated as a contiguous memory under this implementation but rather a collection of blocks which can reside in a non-contiguous fashion inside the memory. This forces cache scatter and gather operations to happen per KV block.

Summary of changes compared to BlockedKV:

The cache shape changes from [BS, num_kv_heads, CL, dh] to [total_num_kv_blocks, num_kv_heads, kv_block_size, dh].
num_kv_blocks = -(-ctx_len // kv_block_size) = physical blocks required for 1 batch element in K cache.
Total_num_kv_blocks = BS (kv_batch_size) * num_kv_blocks = total physical blocks available for K cache.
2 new inputs block_table [BS, num_kv_blocks] and slot_id [BS] are passed as inputs to the ONNX.
4) a) block_id is each entry in the block_table and points to the physical K/V block that needs to be read/written corresponding to (position_id // kv_block_size)th entry in block_table. ‘-1’ signifies invalid/unallocated block.
4) b) slot_id tells how many entries are already filled in currently active block => read up to / write after (slot_id – 1)
Limitation - Cache writes to only 1 block at a time per batch element => CPL = kv_block_size. Hence, cache writes should not cross the block boundary.
vLLM provides KV Cache Manager implementation which maintains the KV cache block_table with logical to physical block mapping and slot_id for location mapping within the active block.

Signed-off-by: Vaibhav Verma <vaibverm@qti.qualcomm.com>

changed the code from doing the exact same math repeatedly. Signed-off-by: Anuj Gupta <anujgupt@qti.qualcomm.com>

Signed-off-by: Vaibhav Verma <vaibverm@qti.qualcomm.com>

Signed-off-by: vaibverm <vaibverm@qti.qualcomm.com>

vaibverm and others added 11 commits May 13, 2026 14:46

Rebased PagedAttention support with latest Qeff for PR

c6ca357

Signed-off-by: Vaibhav Verma <vaibverm@qti.qualcomm.com>

Added block_table and slot_id inputs + minor modelling_auto.py changes

e23b829

Signed-off-by: Vaibhav Verma <vaibverm@qti.qualcomm.com>

Working version with PagedAttention

2a71653

Signed-off-by: Vaibhav Verma <vaibverm@qti.qualcomm.com>

Minor fixes to specialization builder

728a3e1

Signed-off-by: Vaibhav Verma <vaibverm@qti.qualcomm.com>

Added support for Qwen2.5_VL PagedAttention

e5df0a6

Signed-off-by: Vaibhav Verma <vaibverm@qti.qualcomm.com>

slot_id fix for Qwen2.5_VL PagedAttention decode

83dbe17

Signed-off-by: Vaibhav Verma <vaibverm@qti.qualcomm.com>

Added support for Qwen3_VL PagedAttention

d1c6add

Signed-off-by: Vaibhav Verma <vaibverm@qti.qualcomm.com>

Removed commented code corrected in rebase

2b7ff52

Signed-off-by: Vaibhav Verma <vaibverm@qti.qualcomm.com>

Minor fix for enum bug

e497ebd

Signed-off-by: Vaibhav Verma <vaibverm@qti.qualcomm.com>

Optimize attention blocking nested loops (quic#957)

43ad29b

changed the code from doing the exact same math repeatedly. Signed-off-by: Anuj Gupta <anujgupt@qti.qualcomm.com>

Adding PagedAttetion support for Qwen3_VL_MOE

f4eefaa

Signed-off-by: Vaibhav Verma <vaibverm@qti.qualcomm.com>

vaibverm force-pushed the PR_branch branch from 65bd648 to f4eefaa Compare May 13, 2026 19:48

Merge branch 'main' into PR_branch

d3d04e7

Signed-off-by: vaibverm <vaibverm@qti.qualcomm.com>

anujgupt-github added the enhancement New feature or request label May 20, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding PagedAttention support for CausalLM models#982

Adding PagedAttention support for CausalLM models#982
vaibverm wants to merge 12 commits into
quic:mainfrom
vaibverm:PR_branch

vaibverm commented May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

vaibverm commented May 13, 2026

Summary of changes compared to BlockedKV:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants