logoalt Hacker News

tavavexyesterday at 10:38 PM2 repliesview on HN

Because that brings in the whole distributed computing mess. No matter how instantaneous the actual link is, you still have to deal with the problems of which satellites can see one another, how many simultaneous links can exist per satellite, the max throughput, the need for better error correction and all sorts of other things that will drastically slow the system down in the best case. Unlike something like Starlink, with GPUs you have to be ready that everyone may need to talk to everyone else at the same time while maintaining insane throughput. If you want to send GPUs up one by one, get ready to also equip each satellite with a fixed mass of everything required to transmit and receive so much data, redundant structural/power/compute mass, individual shielding and much more. All the wasted mass you have to launch with individual satellites makes the already nonsensical pricing even worse. It just makes no sense when you can build a warehouse on the ground, fill it with shoulder-to-shoulder servers that communicate in a simple, sane and well-known way and can be repaired on the spot. What's the point?


Replies

crotetoday at 12:24 AM

Isn't this already a major problem for AI clusters?

I vaguely recall an article a while ago about the impact of GPU reliability: a big problem with training is that the entire cluster basically operates in lock-step, with each node needing the data its neighbors calculated during the previous step to proceed. The unfortunate side-effect is that any failure stops the entire hundred-thousand-node cluster from proceeding - as the cluster grows even the tiniest failure rate is going to absolutely ruin your uptime. I think they managed to somehow solve this, but I have absolutely no idea how they managed to do it.

pantalaimonyesterday at 11:13 PM

Starlink already solved those problems, they do 200 GBit/s via laser between satellites.

And for data centers, the satellite wouldn't be as far apart as starlight satellites, they would be quite close instead.

show 1 reply