My guess is M5 Ultra will be like DGX Spark for token prefill and M3 Ultra for token generation, i.e. the best of both worlds, at FP4. Right now you can combine Spark with M3U, the former streaming the compute, lowering TTFT, the latter doing the token generation part; with M5U that should no longer be necessary. However given RAM prices situation I am wondering if M5U will ever get close to the price/performance of Spark + M3U we have right now.
> you can combine Spark with M3U, the former streaming the compute, lowering TTFT, the latter doing the token generation part
Are you doing this with vLLM, or some other model-running library/setup?