logoalt Hacker News

mikeaylestoday at 9:14 PM1 replyview on HN

So for people wondering if it can be used to accelerate LLM inference, sadly not.

I've been trying to hit 100,000tokens/s with a 3.28m dumb model, and even this is an order of magnitude too large to benefit.

It appears to be focussed more on latency, than throughput. Happy to be corrected?


Replies

ag2718today at 9:18 PM

You're correct that this work is not very applicable for LLMs and that the focus here is primarily on latency.