Kokoro is fine for TTS, but it lacks emotion. But for a model of this size, that is kind of given.
I played with ebook generation a bunch and find that (at least for English text) around 1B is needed to get something usable emotionally (Chatterbox is 0.5B, Orpheus is 3B).
Ironic given the name: kokoro is Japanese for heart or sentiment.
I played with ebook generation a bunch and find that (at least for English text) around 1B is needed to get something usable emotionally (Chatterbox is 0.5B, Orpheus is 3B).