Does anyone else find that there's hard to pin down reason of life-lessness in the speech of these voice models?
Especially in the fruit pricing portion of the video for this model. Sounds completely normal but I can immediately tell it is ai. Maybe it's intonation or the overly stable rate of speech?
I think it's because they've crammed vision, audio, multiple voices, prosody control, multiple languages, etc into just 30 billion parameters.
I think ChatGPT has the most lifelike speech with their voice models. They seem to have invested heavily in that area while other labs focused elsewhere.
I'm not convinced its end-to-end multimodal - in that case, you'll have a speech synthesis section and this will be some of the result. You could test by having it sing or do some accents, or have it talk back to you in an accent you give it.
> Sounds completely normal but I can immediately tell it is ai.
Maybe that's a good thing?
I'm perfectly ok with and would prefer an AI "accent".
IMHO it's not lifeless. It's just not overly emotional. I definitely prefer it that way. I do not want the AI to be excited. It feels so contrived.
On the video itself: Interesting, but "ideal" was pronounced wrong in German. For a promotional video, they should have checked that with native speakers. On the other hand its at least honest.