I run it all the time, token generation is pretty good. Just large contexts are slow but you can hook a DGX Spark via Exo Labs stack and outsource token prefill to it. Upcoming M5 Ultra should be faster than Spark in token prefill as well.
> I run it all the time, token generation is pretty good.
I feel like because you didn't actually talk about prompt processing speed or token/s, you aren't really giving the whole picture here. What is the prompt processing tok/s and the generation tok/s actually like?
> I run it all the time, token generation is pretty good.
I feel like because you didn't actually talk about prompt processing speed or token/s, you aren't really giving the whole picture here. What is the prompt processing tok/s and the generation tok/s actually like?