How is tok/s not a bottleneck I? I assume most people still use ai agents interactively rather than leaving them to do their own thing during the night.
I find anything below 50 tps or so entirely unusable...
Regardless its Apples to oranges anyway, inference is quite cheap for open weight models its just that Claude and OpenAI can charge very high margins compared to e.g. DeepSeek or various provider on OpenRouter since open models are a commodity.
Is interactive use for coding something that actually works today? With unsafe mode, even frontier hosted models are slow enough I end up just tabbing out to work on other tasks. It would need to be much faster if I am to sit and stare at it while it churns. Local models might be a lot slower but workflow-wise it doesn't change much for me.
It's not a bottleneck if you care about the actual code.
I startup 4 or so projects then go do other things for 4 hours. I don’t have enough energy to steer overnight, but I’m at least “semi afk” for daytime steering. So throughput is king for me, tokens per hour. Not latency or actual tokens per second.