Great effort, a strong self-hosting community for LLMs is going to be similarly important as the FLOSS movement imho. But right now I feel the bigger bottleneck is on the hardware side rather than software. The amount of fast RAM that you need for decent models (80b+ params) is just not something that's commonly available for consumer hardware right now, not even gaming machines. I heard that Macs (minis) are great for the purpose, but you don't really get them with enough RAM or at prices that don't really qualify as consumer-grade anymore. I've seen people create home clusters (eg using Exo [0]), but I wouldn't really call it practical (single digit token/sec for large models, and the price isn't exactly accessible either). Framework (the modular laptop company) has announced a desktop that can be configured up to 128GB unified RAM, but it's still going to come in at around 2-2.5k depending on your config.
Prices are still coming down. Assuming that keeps happening we will have laptops with enough RAM in the sub-2k range in 5 years.
Question is whether models will keep getting bigger. If useful model sizes plateau eventually a good model becomes something at least many people can easily run locally. If models keep usefully growing this doesn’t happen.
The largest ones I see are in the 405g range which quantized fits in 256g RAM.
Long term I expect custom hardware accelerators designed specifically for LLMs to show up, basically an ASIC. If those got affordable I could see little USB-C accelerator boxes being under $1k able to run huge LLMs fast and with less power.
GPUs are most efficient for batch inference which lends itself to hosting not local use. What I mean is a lighter chip made to run small or single batch inference very fast using less power. The bottleneck there is memory bandwidth so I suspect fast RAM would be most of the cost of such a device. Small or single batch inference is memory bandwidth bound.
What's the deal with Exo anyway? I've seen it described as an abandoned, unmaintained project.
Anyway, you don't really need a lot of fast RAM unless you insist on getting a real-time usable response. If you're fine with running a "good" model overnight or thereabouts, there are things you can do to get better use of fairly low-end hardware.
With smaller models becoming more efficient and harder continually improving I think the sweet spot for local LLM computing will arrive in a couple years.
So many comments like to highlight that you can buy a Mac Studio with 512GB of RAM for $10K, but that's a huge amount of money to spend on something that still can't compete with a $2/hour rented cloud GPU server in terms of output speed. Even that will be lower quality and slower than the $20/month plan from the LLM provider of your choice.
The only reasons to go local are if you need it (privacy, contractual obligations, regulations) or if you're a hardcore hobbiest who values running it yourself over quality and speed of output.
> Framework (the modular laptop company) has announced a desktop that can be configured up to 128GB unified RAM, but it's still going to come in at around 2-2.5k depending on your config.
Framework is getting a lot of headlines for their brand recognition but there are a growing number of options with the same AMD Strix Halo part. Here's a random example I found from a Google search - https://www.gmktec.com/products/amd-ryzen%E2%84%A2-ai-max-39...
All of these are somewhat overpriced right now due to supply and demand. If the supply situation is alleviated they should come down in price.
They're great for what they are, but their memory bandwidth is still relatively limited. If the 128GB versions came down to $1K I might pick one up, but at the $2-3K price range I'd rather put that money toward upgrading my laptop to an M4 MacBook Pro with 128GB of RAM.