Given the price of these systems the ridiculously expensive network cards isn't such a huge huge deal, but I can't help but wonder at the absurdly amazing bandwidth hanging off Vera, the amazing brags about "7x more bandwidth than pcie gen 6" (amazing), but then having to go to pcie to network to chat with anyone else. It might be 800Gbe but it's still so many hops, pcie is weighty.
I keep expecting we see fabric gains, see something where the host chip has a better way to talk to other host chips.
It's hard to deny the advantages of central switching as something easy & effective to build, but reciprocally the amazing high radix systems Google has been building have just been amazing. Microsoft Mia 200 did a gobsmacking amount of Ethernet on chip 2.8Tbps, but it's still feels so little, like such a bare start. For reference pcie6 x16 is a bit shy of 1Tbps, vaguely ~45 ish lanes of that.
It will be interesting to see what other bandwidth massive workloads evolve over time. Or if this throughout era all really ends up serving AI alone. Hoping CXL or someone else slims down the overhead and latency of attachment, soon-ish.
Maia 200: https://www.techpowerup.com/345639/microsoft-introduces-its-...
Most of the big AI/HPC clusters these systems are aimed at aren’t running regular PCIe Ethernet between nodes, they’re usually wired up with InfiniBand fabrics (HDR/NDR now, XDR soon)
> It might be 800Gbe but it's still so many hops, pcie is weighty.
Once you need to reach beyond L2/L3 it is often the case that perfectly viable experiments cannot be executed in reasonable timeframes anymore. The current machine learning paradigm isn't that latency sensitive, but there are other paradigms that can't be parallelized in the same way and are very sensitive to latency.