> Even if intelligence scaling stays equal, you'll lose out on speed. A sota model pumping 200 tk/s is going to be impossible to ignore with a 4 year old laptop choking itself at 3 tk/s.
Unless you're YOLOing it, you can review only at a certain speed, and for a certain number of hours a day.
The only tokens/s you need is one that can keep you busy, and I expect that even a slow 5token/sec model utilised 60s in every minute, 60m of every hour and 24 hours of every day is way more than you can review in a single working day.
The goal we should be moving towards is longer-running tasks, not quicker responses, because if I can schedule 30 tasks to my local LLm before bed, then wake up in the morning and schedule a different 30, and only then start reviewing, then I will spend the whole day just reviewing while the LLM is generating code for tomorrow's review. And for this workflow a local model running 5 tokens/s is sufficient.
If you're working serially, i.e. ask the LLM to do something, then review what it gave you, then ask it to do the next thing, then sure, you need as many tokens per second as possible.
Personally, I want to move to long-running tasks and not have to babysit the thing all day, checking in at 5m intervals.