logoalt Hacker News

modelessyesterday at 10:49 PM6 repliesview on HN

IMO STT -> LLM -> TTS is a dead end. The future is end-to-end. I played with this two years ago and even made a demo you can install locally on a gaming GPU: https://github.com/jdarpinian/chirpy, but concluded that making something worth using for real tasks would require training of end-to-end models. A really interesting problem I would love to tackle, but out of my budget for a side project.


Replies

coppsilgoldtoday at 6:06 AM

Some of the best current voice tokenizers achieve ~12 Hz, that's many more tokens than a regular LLM would use for ultimately the same content.

nicktikhonovyesterday at 10:53 PM

If you're of that opinion, you'll enjoy the new stuff coming out from nvidia:

https://research.nvidia.com/labs/adlr/personaplex/

show 2 replies
rockwotjtoday at 3:01 AM

Fundamentally, the "guessing when its your turn thing" needs to be baked into the model. I think the full duplex mode that Moshi pioneered is probably where the puck is going to end up: https://arxiv.org/abs/2410.00037

russdilltoday at 5:27 AM

At least running things locally, such a model completely blows up your latency

com2kidtoday at 1:07 AM

The advantage is being able to plug in new models to each piece of the pipeline.

Is it super sexy? No. But each individual type of model is developing at a different rate (TTS moves really fast, low latency STT/ASR moved slower, LLMs move at a pretty good pace).

show 1 reply
donparktoday at 1:39 AM

But I've read somewhere that KV cache for speech-to-speech model explodes in size with each turn which could make on-device full-duplex S2S unusable except for quick chats.

show 1 reply