So for people wondering if it can be used to accelerate LLM inference, sadly not. I've been t...

mikeayles • today at 9:14 PM • 1 reply • view on HN

So for people wondering if it can be used to accelerate LLM inference, sadly not.

I've been trying to hit 100,000tokens/s with a 3.28m dumb model, and even this is an order of magnitude too large to benefit.

It appears to be focussed more on latency, than throughput. Happy to be corrected?

ag2718 • today at 9:18 PM

You're correct that this work is not very applicable for LLMs and that the focus here is primarily on latency.

alt Hacker News