A lot of good small TTS models in recent times. Most seem to struggle hard on prosody though.
Kokoro TTS for example has a very good Norwegian voice but the rhythm and emphasizing is often so out of whack the generated speech is almost incomprehensible.
Haven't had time to check this model out yet, how does it fare here? What's needed to improve the models in this area now that the voice part is more or less solved?
That, and also using English words in the middle of another language phrase confuses them a lot.
small models struggle with prosody due to limited capacity. this version does much better than the precious one and is the best among other <25MB models. Kokoro is a really good model for its size, its competitive on artificial analysis too. i think by the next release we should have something kokoro quality but a fifth of the size. Adding control for rhythm seems to be quite important too, and we should start looking at that for other languages.