logoalt Hacker News

Show HN: LemonSlice – Upgrade your voice agents to real-time video

78 pointsby lcolucciyesterday at 5:55 PM83 commentsview on HN

Hey HN, we're the co-founders of LemonSlice (try our HN playground here: https://lemonslice.com/hn). We train interactive avatar video models. Our API lets you upload a photo and immediately jump into a FaceTime-style call with that character. Here's a demo: https://www.loom.com/share/941577113141418e80d2834c83a5a0a9

Chatbots are everywhere and voice AI has taken off, but we believe video avatars will be the most common form factor for conversational AI. Most people would rather watch something than read it. The problem is that generating video in real-time is hard, and overcoming the uncanny valley is even harder.

We haven’t broken the uncanny valley yet. Nobody has. But we’re getting close and our photorealistic avatars are currently best-in-class (judge for yourself: https://lemonslice.com/try/taylor). Plus, we're the only avatar model that can do animals and heavily stylized cartoons. Try it: https://lemonslice.com/try/alien. Warning! Talking to this little guy may improve your mood.

Today we're releasing our new model* - Lemon Slice 2, a 20B-parameter diffusion transformer that generates infinite-length video at 20fps on a single GPU - and opening up our API.

How did we get a video diffusion model to run in real-time? There was no single trick, just a lot of them stacked together. The first big change was making our model causal. Standard video diffusion models are bidirectional (they look at frames both before and after the current one), which means you can't stream.

From there it was about fitting everything on one GPU. We switched from full to sliding window attention, which killed our memory bottleneck. We distilled from 40 denoising steps down to just a few - quality degraded less than we feared, especially after using GAN-based distillation (though tuning that adversarial loss to avoid mode collapse was its own adventure).

And the rest was inference work: modifying RoPE from complex to real (this one was cool!), precision tuning, fusing kernels, a special rolling KV cache, lots of other caching, and more. We kept shaving off milliseconds wherever we could and eventually got to real-time.

We set up a guest playground for HN so you can create and talk to characters without logging in: https://lemonslice.com/hn. For those who want to build with our API (we have a new LiveKit integration that we’re pumped about!), grab a coupon code in the HN playground for your first Pro month free ($100 value). See the docs: https://lemonslice.com/docs. Pricing is usage-based at $0.12-0.20/min for video generation.

Looking forward to your feedback!

EDIT: Tell us what characters you want to see in the comments and we can make them for you to talk to (e.g. Max Headroom)

*We did a Show HN last year for our V1 model: https://news.ycombinator.com/item?id=43785044. It was technically impressive but so bad compared to what we have today.


Comments

givinguflactoday at 2:22 AM

While the tech is impressive, from an end user interacting with this perspective, I want nothing to do with it, and I can’t support it. Neat as a one off but destructive imho.

It’s bad enough some companies are doing AI-only interviews. I could see this used to train employees, interview people, replace people at call centers… it’s the next step towards an absolute nightmare. Automated phone trees are bad enough.

There will likely be little human interaction in those and many other situations, and hallucinations will definitely disqualify some people from some jobs.

I’m not anti AI, I’m anti destructive innovation in AI leading to personal health and societal issues, just like modern social media has. I’m not saying this tool is that, I’m saying it’s a foundation for that.

People can choose to not work on things that lead to eventual negative outcomes, and that’s a personal choice for everyone. Of course hindsight is 20/20 but some things can certainly be foreseen.

Apologies for the seemingly negative rant, but this positivity echo chamber in this thread is crazy and I wanted to provide an alternative feedback view.

show 1 reply
jonsofttoday at 2:36 AM

I asked the Spanish tutor if he/it was familiar with the terms seseo[0] and ceceo[1] and he said it wasn't, which surprised me. Ideally it would be possible to choose which Spanish dialect to practise as mainland Spain pronunciation is very different to Latin America. In general it didn't convince me it was really hearing how I was pronouncing words, an important part of learning a language. I would say the tutor is useful for intermediate and advanced speakers but not beginners due to this and the speed at which he speaks.

At one point subtitles written in pseudo Chinese characters were shown; I can send a screenshot if this is useful.

The latency was slightly distracting, and as others have commented the NVIDIA Personaplex demos [2] are very impressive in this regard.

In general, a very positive experience, thank you.

[0] https://en.wikipedia.org/wiki/Phonological_history_of_Spanis... [1] https://en.wikipedia.org/wiki/Phonological_history_of_Spanis... [2] https://research.nvidia.com/labs/adlr/personaplex/

show 1 reply
pickleballcourtyesterday at 8:45 PM

One thing I've learnt from movie production is actually what separates professional from amateur quality is in the audio itself. Have you thought about implementing personaplex from NVDIA or other voice models that can both talk and listen at the same time?

Currently the conversation still feels too STT-LLM-TTS that I think a lot of the voice agents suffer from (Seems like only Sesame and NVDIA so far have nailed the natural conversation flow). Still, crazy good work train your own diffusion models, I remember taking a look at the latest literature on diffusion and was mind blown by the advances in last years or so since u-net architecture days.

EDIT: I see that the primary focus is on video generation not audio.

show 1 reply
r0flyesterday at 7:15 PM

Pricing is confusing

Video Agents Unlimited agents Up to 3 concurrent calls Creative Studio 1min long videos Up to 3 concurrent generations

Does that mean I can have a total of 1 minute of video calls? Or video calls can only be 1 minute long? Or does it mean I can have unlimited calls, 3 calls at a time all month long?

Can I have different avatars or only the same avatar x 3?

Can I record the avatar and make videos and post on social media?

show 1 reply
convivialdingoyesterday at 7:59 PM

That's super impressive! Definitely one of the best quality conversational agents I've tried syncing A/V and response times.

The text processing is running Qwen / Alibaba?

show 2 replies
skandanyesterday at 7:11 PM

Wow this team is non-stop!!! Wild that this small crew is dropping hit after hit. Is there an open polymarket on who acquires them?

show 1 reply
snowmakertoday at 1:41 AM

I made a golden retriever you can talk to using Lemon Slice: https://lemonslice.com/hn/agent_5af522f5042ff0a8

Having a real-time video conversation with an AI is a trippy feeling. Talk about a "feel the AGI moment", it really does feel like the computer has come alive.

show 1 reply
peddling-brinkyesterday at 11:50 PM

I got really excited when I saw that you were releasing your model.

> Today we're releasing our new model* - Lemon Slice 2, a 20B-parameter diffusion transformer that generates infinite-length video at 20fps on a single GPU - and opening up our API.

But after digging around for a while, searching for a huggingface link, I’m guessing this was just a unfortunate turn of phrase, and you are not in fact, releasing an open weights model that people can run themselves?

Oh well, this looks very cool regardless and congratulations on the release.

show 2 replies
wummsyesterday at 7:24 PM

You could add a Max Headroom to the hn link. You might reach real time by interspersing freeze frames, duplicates, or static.

show 2 replies
r0flyesterday at 7:19 PM

Wow I can’t get enough of this site! This is literally all I’ve been playing with for like half an hour. Even moved a meeting!

My mind is blown! It feels like the first time I used my microphone to chat with ai

show 2 replies
zvonimirsyesterday at 6:10 PM

We're launching a new AI assistant and I wanted to make it alive so I started to play around with LemonSlice and I loved it!! I wanted to make our assistant be like a coworker that can give it an ability to create Loom style videos. Here's what I created - https://drive.google.com/file/d/1nIpEvNkuXA0jeZVjHC8OjuJlT-3...

Anyway, big thumbs up for the LemonSlice team, I'm excited to see it progress. I can definitely see products start coming alive with tools like this.

show 2 replies
dreamdeadlineyesterday at 6:36 PM

Cool! Do you plan to expose controls over the avatar’s movement, facial expressions, or emotional reactions so users can fine-tune interactions?

show 2 replies
sid-the-kidyesterday at 6:15 PM

hey HN! one of the founders here. as of today, we are seeing informational avatars + roleplaying for training as the most common use cases. The roleplaying use-case was surprising to us. Think a nurse training to triage with AI patients. Or, SDRs practicing lead qualification with different kinds of clients.

FatalLogictoday at 12:47 AM

Your demo video defaults to play at 1.5x speed

You probably didn't intend to do that

bennyp101yesterday at 7:06 PM

Heads up, your privacy policy[0] does not work in dark mode - I was going to comment saying it made no sense, then I highlighted the page and more text appeared :)

[0] https://lemonslice.com/privacy

show 2 replies
ripped_britchestoday at 1:21 AM

Very freaking impressive!

r0flyesterday at 7:12 PM

Where’s the hn playground to grab a free month?

I have so many websites that would do well with this!

show 2 replies
koakuma-chanyesterday at 7:19 PM

> You're probably thinking, how is this useful

I was thinking why the quality is so poor.

show 1 reply
r0flyesterday at 7:11 PM

Wow this is the most impressive thing I’ve seen on hacker news in years!!!!!

Take my money!!!!!!

show 1 reply
buddycorpyesterday at 6:24 PM

I'm curious if I can plug in my own OpenAI realtime voice agents into this.

show 3 replies
swyxtoday at 12:38 AM

this is like Tavus but it doesnt suck. congrats!

jedwhiteyesterday at 9:39 PM

That's an interesting insight about "stacking tricks" together. I'm curious where you found that approach hit limits. And what gives you an advantage if anything against others copying it. Getting real-time streaming with a 20B parameter diffusion model and 20fps on a single GPU seems objectively impressive. It's hard to resist just saying "wow" looking at the demo, but I know that's not helpful here. It is clearly a substantial technical achievement and I'm sure lots of other folks here would be interested in the limits with the approach and how generalizable it is.

show 2 replies
korneelf1yesterday at 8:55 PM

Wow this is really cool, haven't seen real-time video generation that is this impressive yet!

show 1 reply
benswerdyesterday at 7:08 PM

The last year vs this year is crazy

show 2 replies
shj2105yesterday at 7:41 PM

Not working on mobile iOS

show 1 reply
ed_merceryesterday at 6:35 PM

This looks super awesome!

show 1 reply
marieschneegansyesterday at 6:46 PM

This is next-level!

show 1 reply
ProjectBarksyesterday at 9:21 PM

Removing - Realized I made a mistake

show 2 replies