A trillion parameters is wild. That's not going to quantize to anything normal folks can run. E...

SwellJoe • today at 5:25 PM • 1 reply • view on HN

A trillion parameters is wild. That's not going to quantize to anything normal folks can run. Even at 1-bit, it's going to be bigger than what a Strix Halo or DGX Spark can run. Though I guess streaming from system RAM and disk makes it feasible to run it locally at <1 token per second, or whatever. GLM 5.1, at 754B parameters, is already beyond any reasonable self-hosting hardware (1-bit quantization is 206GB). Maybe a Mac Studio with 512GB can run them at very low-bit quantizations, also pretty slowly.

Replies

jauntywundrkind • today at 5:52 PM

A huge dual socket Epyc system used to be able to get to 1TB without difficulty. 16 dimms of 64gb each. Doable for ~$3000. With considerable memory bandwidth.

Our hope these days seems to be that maybe perhaps possibly High Bandwidth Flash works out. Instead of 4, 8, or maybe more for some highest end drives, having many many many dozens of channels of flash.

Ideally that can be very very near to the inference. PCIe 7.0 is 0.5Tb/s at 16x which is obviously nowhere remotely near enough throughout here. The difficulty is sort of that nand has been trying to be super dense, so as you scale channels you would normally tend to scale nand capacity too, and now instead of a 2tb drive you have a 200tb drive prices way beyond consumer means. Still, I think HBF is perhaps the only shot of the most important thing in computing going from mainframe back to consumer, and of course the models are going to balloon again if this dies hit, probably before consumers ever get a chance.

alt Hacker News

Replies