IMO STT -> LLM -> TTS is a dead end. The future is end-to-end. I played with this two years ag...

modeless • yesterday at 10:49 PM • 6 replies • view on HN

IMO STT -> LLM -> TTS is a dead end. The future is end-to-end. I played with this two years ago and even made a demo you can install locally on a gaming GPU: https://github.com/jdarpinian/chirpy, but concluded that making something worth using for real tasks would require training of end-to-end models. A really interesting problem I would love to tackle, but out of my budget for a side project.

Replies

coppsilgold • today at 6:06 AM

Some of the best current voice tokenizers achieve ~12 Hz, that's many more tokens than a regular LLM would use for ultimately the same content.

nicktikhonov • yesterday at 10:53 PM

If you're of that opinion, you'll enjoy the new stuff coming out from nvidia:

https://research.nvidia.com/labs/adlr/personaplex/

➕ show 2 replies

rockwotj • today at 3:01 AM

Fundamentally, the "guessing when its your turn thing" needs to be baked into the model. I think the full duplex mode that Moshi pioneered is probably where the puck is going to end up: https://arxiv.org/abs/2410.00037

russdill • today at 5:27 AM

At least running things locally, such a model completely blows up your latency

com2kid • today at 1:07 AM

The advantage is being able to plug in new models to each piece of the pipeline.

Is it super sexy? No. But each individual type of model is developing at a different rate (TTS moves really fast, low latency STT/ASR moved slower, LLMs move at a pretty good pace).

➕ show 1 reply

donpark • today at 1:39 AM

But I've read somewhere that KV cache for speech-to-speech model explodes in size with each turn which could make on-device full-duplex S2S unusable except for quick chats.

➕ show 1 reply

alt Hacker News

Replies