I dont even get this trend, wouldnt OpenAI be buying ECC RAM only anyway? Who in their right mind runs this much infrastructure on NON ECC RAM??? Makes no sense to me. Same with GPUs they aren't buying your 5090s. Peoples perception is wild to me.
ECC memory is a bit like RAID: A consumer-level RAM stick will (traditionally) have 8 8-bit-wide chips operating basically in RAID-0 to provide 64-bit-wide access, whereas enterprise-level RAM sticks will operate with 9 8-bit-wide chips in something closer to RAID-4 or -5.
But they are all exactly the same chips. The ECC magic happens in the memory controller, not the RAM stick. Anyone buying ECC RAM for servers is buying on the same market as you building a new desktop computer.
At the chip level there’s no difference as far as I’m aware, you just have 9 bits per byte rather than 8 bits per byte physically on the module. More chips but not different chips.
I seriously doubt that single bit errors on the scale of OpenAI workloads really matters very much, particularly for a domain that is already noisy.
ECC modules use the same chips as non ECC modules so it eats into the consumer market too.
The 5090 is the same chip as the workstation RTX 6000.
Of course OpenAI is also not buying that but B200 DGX systems, but that is still the same process at TSMC.
On the flipside, LLMs are so inconsistent you might argue ECC is a complete waste of money. But Open Ai wasting money is hardly anything new.
ECC RAMs utility is overblown. Major companies often use off-the-shelves non enterprise parts for huge server installations, including regular RAM. The rare bit flipping is hardly a major concern at their scale, and for their specific purposes.
OpenAI bought out Samsung and SK Hynixes DRAM wafers in advance, so they'll prioritize producing whatever OpenAI wants to deploy whether that's DDR/LPDDR/GDDR/HBM, with or without ECC. That means way less wafers for everything else so even if you want a different spec you're still shit out of luck.