logoalt Hacker News

Yukonvyesterday at 7:12 PM2 repliesview on HN

That’s exactly what I thought about. Getting my hands on an M5 Max this week and going to see hows Dan’s experiment performs with faster I/O. Also going to experiment with running active parameters at Q6 or Q8 since output is I/O bottlenecked there should room for higher accuracy compute.


Replies

anemllyesterday at 7:19 PM

Check my repo, I had added some support for GUFF/untloth, Q3,Q5/Q8 https://github.com/Anemll/flash-moe/blob/iOS-App/docs/gguf-h...

3abitonyesterday at 9:18 PM

To be fair, it's "possible" to run such setup with llama.cpp with ssd offload. It's just abysmal TG speeds. But it's possible.