logoalt Hacker News

mgaudettoday at 1:10 AM5 repliesview on HN

Eep.

So, on my M1 mac, did `uvx pocket-tts serve`. Plugged in

> It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity, it was the season of Light, it was the season of Darkness, it was the spring of hope, it was the winter of despair, we had everything before us, we had nothing before us, we were all going direct to Heaven, we were all going direct the other way—in short, the period was so far like the present period, that some of its noisiest authorities insisted on its being received, for good or for evil, in the superlative degree of comparison only

(Beginning of Tale of Two Cities)

but the problem is Javert skips over parts of sentences! Eg, it starts:

> "It was the best of times, it was the worst of times, it was the age of wisdom, it was the epoch of belief, it was the epoch of incredulity, it was the season of Light, it was the spring of hope, it was the winter of despair, we had everything before us, ..."

Notice how it skips over "it was the age of foolishness,", "it was the winter of despair,"

Which... Doesn't exactly inspire faith in a TTS system.

(Marius seems better; posted https://github.com/kyutai-labs/pocket-tts/issues/38)


Replies

vvolhejntoday at 9:09 AM

Václav from Kyutai here. Thanks for the bug report! A workaround for now is to chunk the text into smaller parts where the model is more reliable. We already do some chunking in the Python package. There is also a more fancy way to do this chunking in a way that ensures that the stitched-together parts continue well (teacher-forcing), but we haven't implemented that yet.

Paul_Stoday at 9:08 AM

All the models I tried have similar problems. When trying to batch a whole audiobook, the only way is to run it, then run a model to transcribe and check you get the same text.

sbarretoday at 3:16 AM

Yeah Javert mangled up those sentences for me as well, it skipped whole parts and then also moved words around

- "its noisiest superlative insisted on its being received"

Win10 RTX 5070 Ti

small_scombrustoday at 4:16 AM

Using your first text block 'Eponine' skips "we had nothing before us" and doesn't speak the final "that some of its noisiest"

I wonder what's going wrong in there

memmingtoday at 7:37 AM

interesting; it skipped "we had everything before us," in my test. Yeah, not a good sign.