logoalt Hacker News

danielabinav160yesterday at 10:59 AM2 repliesview on HN

Would love to see these numbers reproduced on consumer GPUs, not just A100s.


Replies

wolttamyesterday at 1:31 PM

This is an efficiency improvement that significantly lowers the amount of RAM you have to look at, on average, during decode.

It should improve performance on most hardware because most LLMs are memory bandwidth bound during decode.

tommicayesterday at 11:02 AM

Maybe somaday an 8gb videocard can be used for coding...

show 1 reply