As a user of a lot of coding tokens I’m most interested in latency - these numbers are presumably for heavily batched workloads. I dearly wish Claude had a cerebras endpoint.
I’m sure I’d use more tokens because I’d get more revs, but I don’t think token usage would increase linearly with speed: I need time to think about what I want to and what’s happened or is proposed. But I feel like I would be able to stay in flow state if the responses were faster, and that’s super appealing.
Impressive performance work. It's interesting that you still see these 40+% perf gains like this.
Makes you think that you will continue to see the costs for a fixed level of "intelligence" dropping.
Now all we need is better support for AMD gpus, both CDNA and RDNA types
Love vLLM!
Still have to update it for snakepit 0.11.0, but I did start a vLLM wrapper for Elixir
https://hex.pm/packages/vllm