logoalt Hacker News

storusyesterday at 9:18 PM1 replyview on HN

My guess is M5 Ultra will be like DGX Spark for token prefill and M3 Ultra for token generation, i.e. the best of both worlds, at FP4. Right now you can combine Spark with M3U, the former streaming the compute, lowering TTFT, the latter doing the token generation part; with M5U that should no longer be necessary. However given RAM prices situation I am wondering if M5U will ever get close to the price/performance of Spark + M3U we have right now.


Replies

echiontoday at 12:59 AM

> you can combine Spark with M3U, the former streaming the compute, lowering TTFT, the latter doing the token generation part

Are you doing this with vLLM, or some other model-running library/setup?

show 1 reply