logoalt Hacker News

storusyesterday at 8:03 PM1 replyview on HN

I run it all the time, token generation is pretty good. Just large contexts are slow but you can hook a DGX Spark via Exo Labs stack and outsource token prefill to it. Upcoming M5 Ultra should be faster than Spark in token prefill as well.


Replies

embedding-shapeyesterday at 9:10 PM

> I run it all the time, token generation is pretty good.

I feel like because you didn't actually talk about prompt processing speed or token/s, you aren't really giving the whole picture here. What is the prompt processing tok/s and the generation tok/s actually like?

show 1 reply