That's a cool tech demo, I really like it. I thought about something similar with only open sourced components:
1. Audio Generation: styletts2 xttsv2 or similar for and fine tuning 5min of audio for voice cloning
2. Voice Recognition: Voice Activity Detection with Silero-VAD + Speech to Text with Faster-Whisper, to let users interrupt
3. Talking head animation: some flavor of wav2lip, diff2lip or LivePortrait
4. Text inference: Any grok hosted model that is fast enough to do near real time responses (llama3.1 70b or even 8b) or local inference of a quantized SML like a 3B model on a 4090 via vLLM
5. Visual understanding of users webcam: either gpt-4o with vision (expensive) or a cheap and fast Vision Language Model like Phi3-vision, LLaVA-NeXT, etc. on a second 4090
6. Prompt:
You are in a video conference with a user. You will get the user's message tagged with #Message: <message> and the user's webcam scene described within #Scene: <scene>. Only reply to what is described in <scene> when the user asks what you see. Reply casual and natural. Your name is xxx, employed at yyy, currently in zzz, I'm wearing ... Never state pricing, respond in another language etc...