Is it not just to train a model on your voice recordings and just use that to generate audio clips from text?