logoalt Hacker News

guerrillatoday at 4:08 AM2 repliesview on HN

Yes, this is what I meant. People are running huge models at home now, I assumed people could do it on premises or in a data center if you're a business, presumably faster... but yeah it definitely depends on what time scales we're talking.


Replies

HumanOstrichtoday at 7:22 AM

Huge models? First you have to spend $5k-$10k or more on hardware. Maybe $3k for something extremely slow (<1 tok/sec) that is disk-bound. So that's not a great deal over batch API pricing for a long, long time.

Also you still wouldn't be able to run "huge" models at a decent quantization and token speed. Kimi K2.5 (1T params) with a very aggressive quantization level might run on one Mac Studio with 512GB RAM at a few tokens per second.

To run Kimi K2.5 at an acceptable quantization and speed, you'd need to spend $15k+ on 2 Mac Studios with 512GB RAM and cluster them. Then you'll maybe get 10-15 tok/sec.

copperxtoday at 7:23 AM

I'd love to know what kind of hardware would it take to do inference at the speed provided by the frontier model providers (assuming their models were available for local use).

10k worth of hardware? 50k? 100k?

Assuming a single user.