logoalt Hacker News

barumrhotoday at 3:23 PM1 replyview on HN

100 tok/s sounds pretty good. What do you get with 70B? With 128GB, you need quantization to fit 70B model, right?

Wondering if local LLM (for coding) is a realistic option, otherwise I wouldn't have to max out the RAM.


Replies

super_mariotoday at 3:57 PM

I run gpt-oss 120b model on ollama (the model is about 65 GB on disk) with 128k context size (the model is super optimized and only uses 4.8 GB of additional RAM for KV cache at this context size) on M4 Max 128 GB RAM Mac Studio and I get 65 tokens/s.

show 1 reply