You can run the big models in RAM, including via offloading weights from disk. They will be extreme...

zozbot234 • yesterday at 10:17 PM • 0 replies • view on HN

You can run the big models in RAM, including via offloading weights from disk. They will be extremely slow on ordinary hardware, but they will run. Hundreds of gigabytes of RAM is a viable purchase for many, and the footprint can be split over multiple nodes with pipeline parallelism. If that's still too slow for the total throughput you expect to need on an ongoing 24/7 basis, that's when it becomes sensible to think about adding discrete GPUs for acceleration.

alt Hacker News