logoalt Hacker News

Running a One Trillion-Parameter LLM Locally on AMD Ryzen AI Max+ Cluster

58 pointsby mindcrimetoday at 1:24 AM14 commentsview on HN

Comments

ibeckermayertoday at 2:11 AM

Cool that it's possible but basically unusable performance characteristics. For an 8192 token prompt they report a ~1.5 minute time-to-first-token and then 8.30tk/s from there. For context ChatGPT is typically <<1s ttft and ~50tk/s.

show 1 reply
elcritchtoday at 2:31 AM

That’s pretty awesome!

Though only 5gig Ethernet? Can’t they do usb-c / thunderbolt 40 Gb/s connections like Macs?

show 3 replies
tills13today at 2:56 AM

I set up ollama today and can barely run a 3b parameter model before the lag makes it unbearable.

How much is one of these gonna run me?

show 3 replies
verdvermtoday at 2:04 AM

The setup was around $10k, but maybe more now with mem/ssd prices.

This is a good list, I like my Beelink a lot, my Minisforum likes to turn itself off every couple of weeks, not sure why yet.

https://www.techradar.com/pro/there-are-15-amd-ryzen-ai-max-...

---

Performance is pretty bad (<10/tps) and context is quite limited. Still good to see progress

Prompt Size (tokens) | TFT (s) - Flash Attention Disabled | TFT (s) - Flash Attention Enabled

4096 | 53.7s | 39.7s

8192 | Out Of Memory (OOM) | 90.5s

16384 | Out Of Memory (OOM) | 239.1s

show 3 replies
burnt-resistortoday at 2:50 AM

Framework has gone fully in the tank of Apple consumerization route of unrepairability and unupgradeability with a nonstandard machine, soldered-on RAM, and no meaningful PCIe slots. There's only the superficial appearance of longevity and future-proofness when it's really yet another silo. There's no way to add an IB, FC, or 100/400 GbE NICs to these machines. 5 GbE is a joke. Non-ECC RAM is a joke.