> If you bought a decently powerful inference machine 3 or 5 years ago, it's probably still plugging away with great tok/s.
I think this is the difference between people who embrace hobby LLMs and people who don’t:
The token/s output speed on affordable local hardware for large models is not great for me. I already wish the cloud hosted solutions were several times faster. Any time I go to a local model it feels like I’m writing e-mails back and forth to an LLM, not working with it.
And also, the first Apple M1 chip was released less than 5 years ago, not 7.
> Any time I go to a local model it feels like I’m writing e-mails back and forth
Do you have a good accelerator? If you're offloading to a powerful GPU it shouldn't feel like that at all. I've gotten ChatGPT speeds from a 4060 running the OSS 20B and Qwen3 30B models, both of which are competitive with OpenAI's last-gen models.
> the first Apple M1 chip was released less than 5 years ago
Core ML has been running on Apple-designed silicon for 8 years now, if we really want to get pedantic. But sure, actual LLM/transformer use is a more recent phenomenon.