This is an efficiency improvement that significantly lowers the amount of RAM you have to look at, o...

wolttam • yesterday at 1:31 PM • 0 replies • view on HN

This is an efficiency improvement that significantly lowers the amount of RAM you have to look at, on average, during decode.

It should improve performance on most hardware because most LLMs are memory bandwidth bound during decode.

alt Hacker News