I'm still looking for the "perfect" setup in order to clone my voice and use it locally to send voice replies in telegram via openclaw. Does anyone have auch a setup?
I want to be my own personal assistant...
EDIT: I can provide it a RTX 3080ti.
You need to provide info on your hardware. Pocket-TTS does cloning on CPU, but for me randomly outputs something pretty weird sounding mixed in with like 90% good outputs. So it hasn't been quite stable enough to run without checking output. But maybe it depends on your voice sample.
Qwen 3 TTS is good for voice cloning but requires GPU of some sort.
Try training a model on piper, you will need to record a lot of utterances but the results are pretty great and the output is a fast TTS model.
Is it not just to train a model on your voice recordings and just use that to generate audio clips from text?
Why not just send text replies? You can already do that