Kokoro is small and fast because all the text -> phoneme conversion is done by “dumb code” and only the phoneme -> sound part is done using a neural net.