Pretty cool but it seems like the mouth / lip-sync is quite a bit off, even for the video generation API? Is that the best rendering, or are the videos stale?
Also the audio cloning sounds quite a bit different from the input on https://www.tavus.io/product/video-generation
For live avatar conversations, it's going to be interesting, to see how models like OpenAI's GPT-4o that have audio-in-audio-out websocket streaming API (that came out yesterday), interesting to see how that will work with technology like this, it does look like there is likely to be a live audio transcript delta that could drive a mouth articulation model, and so on, that arrives at the same time.
Presumably Gaussian Splatting or a physical 3D could run locally for optimal speed?