logoalt Hacker News

jauntywundrkind08/08/20255 repliesview on HN

I agree and disagree. Many of the best models are open source, just too big to run for most people.

And there are plenty of ways to fit these models! A Mac Studio M3 Ultra with 512 GB unified memory though has huge capacity, and a decent chunk of bandwidth (800GB/s. Compare vs a 5090's ~1800GB/s). $10k is a lot of money, but that ability to fit these very large models & get quality results is very impressive. Performance is even less, but a single AMD Turin chip with it's 12-channels DDR5-6000 can get you to almost 600GB/s: a 12x 64GB (768GB) build is gonna be $4000+ in ram costs, plus $4800 for for example a 48 core Turin to go with it. (But if you go to older generations, affordability goes way up! Special part, but the 48-core 7R13 is <$1000).

Still, those costs come to $5000 at the low end. And come with much less token/s. The "grid compute" "utility compute" "cloud compute" model of getting work done on a hot gpu with a model already on it by someone else is very very direct & clear. And are very big investments. It's just not likely any of us will have anything but burst demands for GPUs, so structurally it makes sense. But it really feels like there's only small things getting in the way of running big models at home!

Strix Halo is kind of close. 96GB usable memory isn't quite enough to really do the thing though (and only 256GB/s). Even if/when they put the new 64GB DDR5 onto the platform (for 256GB, lets say 224 usable), one still has to sacrifice quality some to fit 400B+ models. Next gen Medusa Halo is not coming for a while, but goes from 4->6 channels, so 384GB total: not bad.

(It sucks that PCIe is so slow. PCIe 5.0 is only 64GB/s one-direction. Compared to the need here, it's no-where near enough to have a big memory host and smaller memory gpu)


Replies

Aurornis08/08/2025

> Many of the best models are open source, just too big to run for most people.

You can find all of the open models hosted across different providers. You can pay per token to try them out.

I just don't see the open models as being at the same quality level as the best from Anthropic and OpenAI. They're good but in my experience they're not as good as the benchmarks would suggest.

> $10k is a lot of money, but that ability to fit these very large models & get quality results is very impressive.

This is why I only appreciate the local LLM scene from a distance.

It’s really cool that this can be done, but $10K to run lower quality models at slower speeds is a hard sell. I can rent a lot of hours on an on-demand cloud server for a lot less than that price or I can pay $20-$200/month and get great performance and good quality from Anthropic.

I think the local LLM scene is fun where it intersects with hardware I would buy anyway (MacBook Pro with a lot of RAM) but spending $10K to run open models locally is a very expensive hobby.

jstummbillig08/08/2025

> Many of the best models are open source, just too big to run for most people

I don't think that's a likely future, when you consider all the big players doing enormous infrastructure projects and the money that this increasingly demands. Powerful LLMs are simply not a great open source candidate. The models are not a by-product of the bigger thing you do. They are the bigger thing. Open sourcing a LLM means you are essentially investing money to just give it away. That simply does not make a lot of sense from a business perspective. You can do that in a limited fashion for a limited time, for example when you are scaling or it's not really your core business and you just write it off as expenses, while you try to figure yet another thing out (looking at you Meta).

But with the current paradigm, one thing seems to be very clear: Building and running ever bigger LLMs is a money burning machine the likes of which we have rarely or ever seen, and operating that machine at a loss will make you run out of any amount of money really, really fast.

Rohansi08/09/2025

You'll want to look at benchmarks rather than the theoretical maximum bandwidth available to the system. Apple has been using bandwidth as a marketing point but you're not always able to use that bandwidth amount depending on your workload. For example, the M1 Max has 400GB/s advertised bandwidth but the CPU and GPU combined cannot utilize all of it [1]. This means Strix Halo could actually be better for LLM inference than Apple Silicon if it achieves better bandwidth utilization.

[1] https://web.archive.org/web/20250516041637/https://www.anand...

esseph08/08/2025

https://pcisig.com/pci-sig-announces-pcie-80-specification-t...

From 2003-2016, 13 years, we had PCIE 1,2,3.

2017 - PCIE 4.0

2019 - PCIE 5.0

2022 - PCIE 6.0

2025 - PCIE 7.0

2028 - PCIE 8.0

Manufacturing and vendors are having a hard time keeping up. And the PCIE 5.0 memory is.. not always the most stable.

show 2 replies
vFunct08/09/2025

The game changer technology that'll enable full 1TB+ LLM models for cheap is Sandisk's High Bandwidth Flash. Expect devices with that in about 3-4 years, maybe even on cellphones.

show 1 reply