logoalt Hacker News

cynicalkaneyesterday at 9:24 PM1 replyview on HN

Even before AI training clusters became important, Google has had an outstanding custom fabric (there's papers about it) together with the ability to tune NICs for their own cases, and "their own cases" meant nearly everything engineered within Google. Ethernet hardware has had low kernel latency and DMA for a long time; it's the rest of the stack that hurts. But as far back as the early 2010s (if not further back, that goes beyond my knowledge horizon), you could just make it not hurt, if you had the software engineers to do it.


Replies

mrlongrootstoday at 3:44 AM

All that would not help you with an AI training cluster interconnect. See Amin Vahdat's keynote at HotInterconnects 2025. Everyone is building a fabric for this stuff from scratch (Google/Falcon, Amazon/EFA, Azure/MANA, Cornelis/CN5000, and obviously Mellanox).