[disclaimer, i work at elevenlabs] we specifically went with a cascading model for our agents platform because it's better suited for enterprise use cases where they have full control over the brain and can bring their own llm. with that said, even with a cascading model, we can capture a decent amount of nuance with our asr model, and it also supports capturing audio events like laughter or coughing.
a true speech to speech conversational model will perform better on things like capturing tone, pronouncations, phonetics, etc, but i do believe we'll also get better at that on the asr side over time.
[dead]