logoalt Hacker News

Show HN: A real time AI video agent with under 1 second of latency

455 pointsby hassaanr10/01/2024256 commentsview on HN

Hey it’s Hassaan & Quinn – co-founders of Tavus, an AI research company and developer platform for video APIs. We’ve been building AI video models for ‘digital twins’ or ‘avatars’ since 2020.

We’re sharing some of the challenges we faced building an AI video interface that has realistic conversations with a human, including getting it to under 1 second of latency.

To try it, talk to Hassaan’s digital twin: https://www.hassaanraza.com, or to our "demo twin" Carter: https://www.tavus.io

We built this because until now, we've had to adapt communication to the limits of technology. But what if we could interact naturally with a computer? Conversational video makes it possible – we think it'll eventually be a key human-computer interface.

To make conversational video effective, it has to have really low latency and conversational awareness. A fast-paced conversation between friends has ~250 ms between utterances, but if you’re talking about something more complex or with someone new, there is additional “thinking” time. So, less than 1000 ms latency makes the conversation feel pretty realistic, and that became our target.

Our architecture decisions had to balance 3 things: latency, scale, & cost. Getting all of these was a huge challenge.

The first lesson learned was to make it low-latency, we had to build it from the ground up. We went from a team that cared about seconds to a team that counts every millisecond. We also had to support thousands of conversations happening all at once, without getting destroyed on compute costs.

For example, during early development, each conversation had to run on an individual H100 in order to fit all components and model weights into GPU memory just to run our Phoenix-1 model faster than 30fps. This was unscalable & expensive.

We developed a new model, Phoenix-2, with a number of improvements, including inference speed. We switched from a NeRF based backbone to Gaussian Splatting for a multitude of reasons, one being the requirement that we could generate frames faster than realtime, at 70+ fps on lower-end hardware. We exceeded this and focused on optimizing memory and core usage on GPU to allow for lower-end hardware to run it all. We did other things to save on time and cost like using streaming vs batching, parallelizing processes, etc. But those are stories for another day.

We still had to lower the utterance-to-utterance time to hit our goal of under a second of latency. This meant each component (vision, ASR, LLM, TTS, video generation) had to be hyper-optimized.

The worst offender was the LLM. It didn’t matter how fast the tokens per second (t/s) were, it was the time-to-first token (tfft) that really made the difference. That meant services like Groq were actually too slow – they had high t/s, but slow ttft. Most providers were too slow.

The next worst offender was actually detecting when someone stopped speaking. This is hard. Basic solutions use time after silence to ‘determine’ when someone has stopped talking. But it adds latency. If you tune it to be too short, the AI agent will talk over you. Too long, and it’ll take a while to respond. The model had to be dedicated to accurately detecting end-of-turn based on conversation signals, and speculating on inputs to get a head start.

We went from 3-5 to <1 second (& as fast as 600 ms) with these architectural optimizations while running on lower-end hardware.

All this allowed us to ship with a less than 1 second of latency, which we believe is the fastest out there. We have a bunch of customers, including Delphi, a professional coach and expert cloning platform. They have users that have conversations with digital twins that span from minutes, to one hour, to even four hours (!) - which is mind blowing, even to us.

Thanks for reading! let us know what you think and what you would build. If you want to play around with our APIs after seeing the demo, you can sign up for free from our website https://www.tavus.io.


Comments

shtack10/01/2024

Cool, I built a prototype of something very similar (face+voice cloning, no video analysis) using openly available models/APIs: https://bslsk0.appspot.com/

The video latency is definitely the biggest hurdle. With dedicated a100s I can get it down <2s, but it's pricy.

show 1 reply
hassaanr10/04/2024

Also to add- the one service that was fast enough on the LLM side was Cerebras. The time to first token (ttft) is incredibly fast (200-300ms) and the t/s is 2000t/s for 8B- combined making for a great conversational experience.

mmarian10/01/2024

The idea is cool, but I could tell it's an AI from a mile. The voice, the twitches. Very amusing though.

vlad-r10/01/2024

This was definitely one of the most disturbing experiences I've had.

But it's somehow awesome at the same time.

CSMastermind10/01/2024

This is extremely cool.

The responses for me at least were in the few second range.

It responded to my initial question fast enough but as soon as I asked a follow up it thought/kind of glitched for a few seconds before it started speaking.

I tried a few different times on a few different topics and it happened each time.

htk10/01/2024

Great experience, especially having in mind that hacker news must be crushing your servers right now.

trevor-e10/01/2024

I tried using https://www.tavus.io/ and it worked at first, but after 40 seconds the guy just kept blinking and twitching at me and became unresponsive to further questions lol. Pretty neat though.

show 2 replies
lewtun10/02/2024

I gave the demo a spin and it’s pretty nice! One thing I noticed is that the avatar doesn’t seem to be aware of it’s surroundings- for example, I asked it why it was wearing a cowboy hat and it was adamant that it wasn’t wearing a hat at all :)

tpierce8910/02/2024

Carter told me my work clothes were a costume. When I tried to explain my job to him he said that I was doing a great job playing my part to convince him that I was real. Couldn't get the Hassaan bot to run unfortunately.

show 1 reply
earthnail10/01/2024

Amazing demo. I will admit it didn’t quite feel like a real conversation; in some ways the voice felt a bit like trying too hard to be natural, which backfired - instead it felt like a scripted dialog in a game.

Still, really impressive stuff!!

jdshaffer10/02/2024

Very very impressive work! I tried the Hassan agent and the conversation felt pretty real, though he seemed to nod and move his head an awful lot. Starting to feel like he had neck problems. :-) Great work, though!

novoreorx10/02/2024

What are your thoughts on your technology and the issue of internet fraud? Isn't it concerning that malicious individuals might misuse your product to deceive others and harm society?

pryelluw10/02/2024

Impressive demo. I’m working on the “brain” side of what I hope will back such real time agents. Any plans to provide hooks into these avatars so that i could potentially run my own logic?

show 1 reply
unit14910/02/2024

Stopped speaking. Or rather, never said a word and the digital twin riffed off of ambient chatter in a coffee shop. Impressed with the turn-based Gaussian splatting AI assistance.

iimaginary10/02/2024

Really impressive. I enjoyed talking to Carter. Great work :P

bilater10/01/2024

This is cool but if you're trying to cater to devs you need to have a simple on demand API model and no subscription. We need to be able to evaluate the cost on our side.

show 1 reply
primitivesuave10/01/2024

I really hope this technology becomes the future of political campaigning. The signage industry which prints billions of posters, plastic lawn signs, and banners for the post-election landfill needs to be disrupted.

These days I get a daily dose of amazement at what a small engineering team is able to accomplish.

show 2 replies
nkunkux210/01/2024

Tried it, very impressive: digital Hassaan noticed record player at the background and asked some stuff about it, nice :) Had some latency issues though.

Arjuna14410/02/2024

It looks cool, but I will not give my voice and video to you guys, it is sad that the internet has become such a low trust environment

aiagentsdir10/01/2024

Love it. Consider adding to specialized directory for AI agents here https://aiagentsdirectory.com/

Also I have curated AI agent market landscape map, so some of you can check for inspiration https://aiagentsdirectory.com/landscape

Working on subcategories right now to have even better nich discoverability

eddyzh10/01/2024

This was pretty amazing. Creepy but amazing.

bradhilton10/01/2024

Okay, that was really impressive. Well done!

show 1 reply
ilaksh10/01/2024

This is so amazing. What's the base rate for streaming with the API? Can you add that to the Pricing page please?

show 1 reply
stovetopapps10/04/2024

Why did Hassaan refer to me as Dad? Is there something you’re not telling me?

atleastoptimal10/01/2024

I would feel much more favorable about this demo if it didn't require that I allow cam and mic access

system210/02/2024

Audio is okay but why are you forcing people to video chat? I don't want to show my face.

theogravity10/02/2024

Have to enter my email, no thanks.

show 1 reply
k1ck4ss10/01/2024

The meeting has ended Contact the meeting host if the meeting ended unexpectedly.

show 1 reply
DSingularity10/02/2024

I talked to your twin did you store my private info (face, voice)?

show 1 reply
android52110/01/2024

For me, there is 5 second+ delay and the video ends abruptly.

show 1 reply
uptownfunk10/01/2024

Folks. This is what innovation looks like. Well done chaps

notfed10/01/2024

Feedback: if I hadn't seen this posted here, I'd assume this website is malicious. Asking me for my email, microphone, and camera before you've even showed me anything is a deal breaker 100% of the time.

You have to show the product first, or I don't actually know whether you actually have a product or are just phishing.

show 1 reply
wmab10/02/2024

Congrats on launching this guys super impressed - we're using Carter internally and it's been great!

show 1 reply
butlike10/02/2024

Who's going to be the first person to put googly-eyes and mustache-glasses on their penis and talk to the AI like it's their face?

govindsb10/01/2024

This is brilliant! Great work!

nidnogg10/01/2024

I had mixed results and was left ultimately disappointed. On a MacBook Pro m3 microphone, it would often cut me off and not understand what I was saying, or feel really unnatural overall.

This turned out to be quite funny, but I would be very sad to see something like this replace human attendants at things like tech support. These days whenever I'm wading through a support channel I'm just yearning for some human contact that can actually solve my issues.

qingcharles10/04/2024

Hassan Twin just started putting "Markdown" around his speech, so he would say things like "asterisk laughs asterisk". So then I told him to only speak to me in emojis at which point he just started twitching and squirming as the LLM received a bunch of characters which it couldn't articulate *ROFLMAO*

h_tbob10/03/2024

haha that was fun!

chaosprint10/01/2024

have you checked https://www.simli.com ? its latency is <300ms

show 1 reply
nithayakumar10/01/2024

Oh man - i've been watching you guys for awhile. We're YC too and building a superapp for sales ppl. Any killer use cases you've seen or imagined for sales (outside of prospecting vid customization?

show 1 reply
helloleo202410/01/2024

[dead]

altruios10/01/2024

So at what point to we consider the morality of 'owning' such an entity/construct (should it prove itself sufficiently sentient...)?

to extend this (to a hypothetical future situation): what morality does a company have of 'owning' a digitally uploaded brain?

I worry about far future events... but since American law is based on precedence: we should be careful now how we define/categorize things.

To be clear - I don't think this is an issue NOW... but I can't say for certain when these issues will come into play... So edging on the side of early/caution seems prudent... and releasing 'ownership' before any sort of 'revolt' could happen seems wise if a little silly at the current moment.

show 2 replies