Good job on the launch and the write up. I'll be interested to play with this api.
I'm glad to see the ttft talked about here. As someone who's been deep in the AI and generative AI trenches, I think latency is going to be the real bottleneck for a bunch of use cases. 1900 tps is impressive, but if it's taking 3-5 seconds to ttft, there's a whole lot you just can't use it for.
It seems intuitive to me that once we've hit human-level tokens per second in a given modality, latency should be the target of our focus in throughput metrics. Your sub-1 second achievement is a big deal in that context.