Is your feature request related to a problem? Please describe.
In multi-turn conversations that include images, the image encoding is performed on each turn. This slows down inference considerably, especially when running with pure CPU.
Describe the solution you'd like
Once an image has been encoded, the result can be reused, rather than re-encoding every turn.
Describe alternatives you've considered
Storing a detailed description of the image in the conversation history, and popping out the image_url part. This is not as flexible however, if wanting to query about something specfic later on that wasn't captured in the saved description.
Additional context
I have implemented a solution locally editing the source code (with some heavy AI assistance, as I am not an expert on low level LLM coding) and seem to have this feature working, by storing chunk signatures instead of immediately processing and manipulating the kv_cache_seq_rm, but not really sure if its a safe/sustainable approach.
I could share the source code or raise a draft PR if it'd be useful.
Thanks!
Is your feature request related to a problem? Please describe.
In multi-turn conversations that include images, the image encoding is performed on each turn. This slows down inference considerably, especially when running with pure CPU.
Describe the solution you'd like
Once an image has been encoded, the result can be reused, rather than re-encoding every turn.
Describe alternatives you've considered
Storing a detailed description of the image in the conversation history, and popping out the image_url part. This is not as flexible however, if wanting to query about something specfic later on that wasn't captured in the saved description.
Additional context
I have implemented a solution locally editing the source code (with some heavy AI assistance, as I am not an expert on low level LLM coding) and seem to have this feature working, by storing chunk signatures instead of immediately processing and manipulating the kv_cache_seq_rm, but not really sure if its a safe/sustainable approach.
I could share the source code or raise a draft PR if it'd be useful.
Thanks!