Skip to content

Adding PagedAttention support for CausalLM models#982

Open
vaibverm wants to merge 12 commits into
quic:mainfrom
vaibverm:PR_branch
Open

Adding PagedAttention support for CausalLM models#982
vaibverm wants to merge 12 commits into
quic:mainfrom
vaibverm:PR_branch

Conversation

@vaibverm
Copy link
Copy Markdown
Contributor

This PR adds the PagedAttention (https://arxiv.org/pdf/2309.06180) support for all CausalLM models in QEfficient.
The major change is that KV cache is not treated as a contiguous memory under this implementation but rather a collection of blocks which can reside in a non-contiguous fashion inside the memory. This forces cache scatter and gather operations to happen per KV block.

Summary of changes compared to BlockedKV:

  1. The cache shape changes from [BS, num_kv_heads, CL, dh] to [total_num_kv_blocks, num_kv_heads, kv_block_size, dh].
  2. num_kv_blocks = -(-ctx_len // kv_block_size) = physical blocks required for 1 batch element in K cache.
  3. Total_num_kv_blocks = BS (kv_batch_size) * num_kv_blocks = total physical blocks available for K cache.
  4. 2 new inputs block_table [BS, num_kv_blocks] and slot_id [BS] are passed as inputs to the ONNX.
    4) a) block_id is each entry in the block_table and points to the physical K/V block that needs to be read/written corresponding to (position_id // kv_block_size)th entry in block_table. ‘-1’ signifies invalid/unallocated block.
    4) b) slot_id tells how many entries are already filled in currently active block => read up to / write after (slot_id – 1)
  5. Limitation - Cache writes to only 1 block at a time per batch element => CPL = kv_block_size. Hence, cache writes should not cross the block boundary.
  6. vLLM provides KV Cache Manager implementation which maintains the KV cache block_table with logical to physical block mapping and slot_id for location mapping within the active block.

vaibverm and others added 11 commits May 13, 2026 14:46
Signed-off-by: Vaibhav Verma <vaibverm@qti.qualcomm.com>
Signed-off-by: Vaibhav Verma <vaibverm@qti.qualcomm.com>
Signed-off-by: Vaibhav Verma <vaibverm@qti.qualcomm.com>
Signed-off-by: Vaibhav Verma <vaibverm@qti.qualcomm.com>
Signed-off-by: Vaibhav Verma <vaibverm@qti.qualcomm.com>
Signed-off-by: Vaibhav Verma <vaibverm@qti.qualcomm.com>
Signed-off-by: Vaibhav Verma <vaibverm@qti.qualcomm.com>
Signed-off-by: Vaibhav Verma <vaibverm@qti.qualcomm.com>
Signed-off-by: Vaibhav Verma <vaibverm@qti.qualcomm.com>
changed the code from doing the exact same math repeatedly.

Signed-off-by: Anuj Gupta <anujgupt@qti.qualcomm.com>
Signed-off-by: Vaibhav Verma <vaibverm@qti.qualcomm.com>
Signed-off-by: vaibverm <vaibverm@qti.qualcomm.com>
@anujgupt-github anujgupt-github added the enhancement New feature or request label May 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants