How much maintenance do you need? Lets say you have hardware whose useful lifespan due to obsolescence is 5 years, and in 4, the satellite will crash into the atmosphere anyways.
Let's say given component failure rates, you can expect for 20% of the GPUs to fail in that time. I'd say that's acceptable.
Radiation is a bitch. Especially at those nanometers and memory bandwidth.
And cooling. There is no cold water or air in space.
> How much maintenance do you need?
A lot. As someone that has been responsible for trainings with up to 10K GPUs, things fail all the time. By all the time I don't mean every few weeks, I mean daily. From disk failings, to GPU overheating, to infiniband optical connectors not being correctly fastened and disconnecting randomly, we have to send people to manually fix/debug things in the datacenter all the time.
If one GPU fails, you essentially lose the entire node (so 8 GPUs), so if your strategy is to just turn off whatever fails forever and not deal with it, it's gonna get very expensive very fast.
And thats in an environment where temperature is very well controlled and where you don't have to put your entire cluster through 4 Gs and insane vibrations during take off.