It's not cheating or a cluster based system. All the biggest high end servers use multiple exte...

stinkbeetle • today at 10:40 AM • 0 replies • view on HN

It's not cheating or a cluster based system. All the biggest high end servers use multiple externally cabled systems (chassis, sled, drawer). The biggest ones even span multiple racks (aka frames). These days it is HP and IBM remaining in the game.

These all have real hardware coherency going over the external cables, same protocol. Here is a Power10 server picture, https://www.engineering.com/ibm-introduces-power-e1080-serve... the cables attach right to headers brought out of the chip package right off the phy, there's no ->PCI->ethernet-> or anything like that.

These HP systems are similar. These are actually descendants of SGI Altix / SGI Origin systems which HP acquired, and they still use some of the same terminology (NUMAlink for the interconnect fabric). HP did make their own distinct line of big iron systems when they had PA-RISC and later Itanium but ended up acquiring and going with SGI's technology for whatever reasons.

These HP/SGI systems are slightly different from IBM mini/mainframes because they use "commodity" CPUs from Intel that don't support glueless multi socket that large or have signaling that can get across boards, so these have their own chipset that has some special coherency directories and a bunch of NUMAlink PHYs.

SGI systems came from HPC so they were actually much bigger before that, the biggest ones were something around 1024 sockets, back when you only had 1 CPU per socket. The interconnect topology used to be some tree thing that had like 10 hops between the farthest nodes. It did run Linux and wasn't technically cheating, but you really had to program it like a cluster because resource contention would quickly kill you if there was much cacheline transfer between nodes. Quite amazing machines, but not suitable for "enterprise" so IIRC they have cut it down and gone with all-to-all interconnect. It would be interesting to know what they did with coherency protocol, the SGI systems used a full directory scheme which is simple and great at scaling to huge sizes but not the best for performance. IBM systems use extremely complex broadcast source snooping designs (highly scoped and filtered) to avoid full directory overhead. Would be interesting to know if HPE finally went that way with NUMAlink too.

Found this diagram from an old HP product which was an SGI derivative https://support.hpe.com/hpesc/public/docDisplay?docId=a00062... 2 QPI busses and 16 NUMAlink ports!

Aha, it's still a directory protocol. https://support.hpe.com/hpesc/public/docDisplay?docId=sd0000...

Cheating IMO would be an actual cluster of systems using software (firmware/hypervisor) to present a single system image using MMU and IB/ethernat adapters to provide coherency.

alt Hacker News