logoalt Hacker News

dirk94018today at 2:19 PM7 repliesview on HN

On M4 Max 128GB we're seeing ~100 tok/s generation on a 30B parameter model in our from scratch inference engine. Very curious what the "4x faster LLM prompt processing" translates to in practice. Smallish, local 30B-70B inference is genuinely usable territory for real dev workflows, not just demos. Will require staying plugged in though.


Replies

fotcorntoday at 3:00 PM

The memory bandwith on M4 Max is 546 GB/s, M5 Max is 614GB/s, so not a huge jump.

The new tensor cores, sorry, "Neural Accelerator" only really help with prompt preprocessing aka prefill, and not with token generation. Token generation is memory bound.

Hopefully the Ultra version (if it exists) has a bigger jump in memory bandwidth and maximum RAM.

show 1 reply
hu3today at 2:32 PM

What about real workloads? Because as context gets larger, these local LLMs aproxiate the useless end of the spectrum with regards to t/s.

show 3 replies
storustoday at 2:38 PM

4x faster is about token prefill, i.e. the time to first token. It should be on par with DGX Spark there while being slightly faster than M4 for token generation. I.e. when you have long context, you don't need to wait 15 minutes, only 4 minutes.

fulafeltoday at 3:49 PM

The marketing subterfugue might be about this exactly, technically prompt processing means the prefill phase of inference. So prompt goes in 4x as fast but generates tokens slower.

This seems even likely as the memory bandwidth hasn't increased enough for those kinds of speedups, and I guess prefill is more likely to be compute-bound (vs mem bw bound).

show 1 reply
eknkctoday at 2:25 PM

I find time to first token more important then tok/s generally as these models wait an ungodly amount of time before streaming results. It looks like the claims are true based on M5: https://www.macstories.net/stories/ipad-pro-m5-neural-benchm... so this might work great.

barumrhotoday at 3:23 PM

100 tok/s sounds pretty good. What do you get with 70B? With 128GB, you need quantization to fit 70B model, right?

Wondering if local LLM (for coding) is a realistic option, otherwise I wouldn't have to max out the RAM.

show 1 reply
butILoveLifetoday at 2:24 PM

[flagged]

show 3 replies