HELLO MR SEAN,
1. Of course users want lower latency, but they also want fewer instances where the LLM "misheard" them. It would be amazing to run A/B experiments on the trade-off between latency vs quality, but WebRTC makes that knob difficult to turn.
2. I'm obviously not an TTS expert, but what benefit is there to trickling out the result? The silicon doesn't care how quickly the time number increments?
3. Yeah, sometimes the client is aware when their IP changes and can do an ICE renegotiation. But often they aren't aware, and normally would rely on the server detecting the change, but that's not possible with your LB setup. It's not a big deal, just unfortunate given how many hoops you have to jump through already.
4. Okay, that draft means 7 RTTs instead of 8 RTTs? Again some can be pipelined so the real number is a bit lower. But like the real issue is the mandatory signaling server which causes a double TLS handshake just in case P2P is being used.
5. Of course WebRTC is easier for a new developer because it's a black box conferencing app. But for a large company like OpenAI, that black box starts to cause problems that really could be fixed with lower level primitives.
I absolutely think you should mess around with RTP over QUIC and would love to help. If you're worried about code size, the browser (and one day the OS) provides the QUIC library. And if you switch to something closer to MoQ, QUIC handles fragmentation, retransmissions, congestion control, etc. Your application ends up being surprisingly small.
The main shortcoming with RoQ/MoQ is that we can't implement GCC because QUIC is congestion controlled (including datagrams). We're stuck with cubic/BBR when sending from the browser for now.
[flagged]
Latency versus reliability is a false dichotomy anyway. The alternative to WebRTC isn't to wait for the user to finish speaking before you send any of the audio. Open a websocket and send the coded audio packets as they're generated. Now you're still sending audio packets immediately, but if one is dropped, TCP retransmits it until it makes it through. If the connection is really slow, packets queue up, and the user has to wait, but it still works. You get the low latency in the best case and the robustness in the worst case.