Cool, I built a prototype of something very similar (face+voice cloning, no video analysis) using openly available models/APIs: https://bslsk0.appspot.com/
The video latency is definitely the biggest hurdle. With dedicated a100s I can get it down <2s, but it's pricy.
This looks awesome. Didn’t seem to hear me, but the video looks great. Can you share what models you are using? You say these are all open models.