So limiting max context length also reduces VRAM needs a bit? If cache is 20% of total, 1/10th ...

edg5000 • today at 8:35 AM • 1 reply • view on HN

So limiting max context length also reduces VRAM needs a bit? If cache is 20% of total, 1/10th of context as a limit would mean 18% total memory reduction.

Replies

valine • today at 8:44 AM

Yup exactly, in principle it helps with both inference speed by reducing memory bandwidth usage and also reduces the memory footprint of your kvcache.

alt Hacker News

Replies