The HF demo space was overloaded, but I got the demo working locally easily enough. The voice cloning of the 1.7B model captures the tone of the speaker very well, but I found it failed at reproducing the variation in intonation, so it sounds like a monotonous reading of a boring text.
I presume this is due to using the base model, and not the one tuned for more expressiveness.
edit: Or more likely, the demo not exposing the expressiveness controls.
The 1.7B model was much better at ignoring slight background noise in the reference audio compared to the 0.6B model though. The 0.6B would inject some of that into the generated audio, whereas the 1.7B model would not.
Also, without FlashAttention it was dog slow on my 5090, running at 0.3X realtime with just 30% GPU usage. Though I guess that's to be expected. No significant difference in generation speed between the two models.
Overall though, I'm quite impressed. I haven't checked out all the recent TTS models, but a fair number, and this one is certainly one of the better ones in terms of voice cloning quality I've heard.