logoalt Hacker News

embedding-shapeyesterday at 9:10 PM1 replyview on HN

> I run it all the time, token generation is pretty good.

I feel like because you didn't actually talk about prompt processing speed or token/s, you aren't really giving the whole picture here. What is the prompt processing tok/s and the generation tok/s actually like?


Replies

storusyesterday at 9:21 PM

I addressed both points - I mentioned you can offload token prefill (the slow part, 9t/s) to DGX Spark. Token generation is at 6t/s which is acceptable.