> and $0.24 per minute of audio output
That is substantially more expensive than TTS (text-to-speech) which already is quite expensive.
I agree. I'm wondering if it is possible to disable output streaming of audio and just get the text response event.
Fair, it wouldn't work well for on-demand generation in an app, but for ad-hoc cases like a voice-over it's not a huge expense.
If OpenAI decides to fully ignore ethics and dive deep into voice cloning, then all bets are off.