logoalt Hacker News

jstummbilligtoday at 7:55 AM1 replyview on HN

Why are they selling compute instead of using it to build that SOTA model?


Replies

tristanjtoday at 9:50 AM

They tried and failed. xAi made a mistake building Colossus 1 and ended up with heterogenous cluster of H100/H200/GB200 GPUs. This is a nightmare to train huge models on because each card has different specs, features, and hardware requirements. During gradient synchronization, a heterogeneous cluster would bottleneck on the slowest GPU (H100) so the faster GPUs would end up idling. They also probably ran into unexpected compatibility issues, which are difficult to resolve.

It makes more sense to use this cluster for inference, since they can segment the cluster by GPU type and avoid GPU mixing. xAI doesn't have enough inference customers so it makes sense to monetize this to companies that need inference compute such as Anthropic or Cursor.

Apparently xAI will try building SOTA models on Colossus 2, which will be built on Blackwell GPUs only.

show 1 reply