logoalt Hacker News

logicproglast Sunday at 10:58 AM4 repliesview on HN

I've been using this to try to make audiobooks out of various philosophy books I've been wanting to read, for accessibility reasons, and I ran into a critical problem: if the input text fed to Kokoro is too long, it'll start skipping words at the end or in the middle, or fade out at the end; and abogen chunks the text it feeds to Kokoro by sentence, so sentences of arbitrary length are fed to Kokoro without any guarding. This produces unusable audiobooks for me. I'm working on "vibe coding" my own Kokoro based tkinter personal gui app for the same purpise that uses nltk and some regex magic for better splitting.


Replies

gavinraylast Sunday at 11:12 AM

I use "kokoro-tts" CLI, which has better chunking/splitting.

https://github.com/nazdridoy/kokoro-tts

It generates a directory of audio files, along with a metadata file for ebook chapters

You have to use m4b-tool to stitch the audio files together into an audiobook and include the chapter metadata, but it works great:

https://github.com/sandreas/m4b-tool

I've been meaning to write a post on this workflow because it's incredibly useful

show 1 reply
denizsafaklast Monday at 10:24 AM

Hey, can you share an example book or text so I can test it?

Regarding "abogen chunks the text it feeds to Kokoro by sentence", that's not quite correct, it actually splits subtitles by sentence, not the chunks sent to Kokoro.

This might be happening because the "Replace single newlines with spaces" option isn’t enabled. Some books require that setting to work correctly. Could you try enabling it and see if it fixes the issue?

show 1 reply
ethan_smithlast Monday at 9:45 AM

You could try implementing a character count limit per chunk instead of sentence-based splitting. A hybrid approach that breaks at sentence boundaries but enforces a maximum chunk size of ~150-200 characters would likely solve the word-skipping issue while maintaining natural speech flow.

show 1 reply
RicoElectricolast Sunday at 11:21 AM

I just can't stand how non-deterministic many deep learning TTSes are. At least the classical ones have predictable pronunciation which can be worked around if needed.