From the Realtime API blog post: https://openai.com/index/introducing-the-realtime-api/
> Audio in the Chat Completions API will be released in the coming weeks, as a new model `gpt-4o-audio-preview`. With `gpt-4o-audio-preview`, developers can input text or audio into GPT-4o and receive responses in text, audio, or both.
> The Realtime API uses both text tokens and audio tokens. Text input tokens are priced at $5 per 1M and $20 per 1M output tokens. Audio input is priced at $100 per 1M tokens and output is $200 per 1M tokens. This equates to approximately $0.06 per minute of audio input and $0.24 per minute of audio output. Audio in the Chat Completions API will be the same price.
As usual, OpenAI failed to emphasize the real-game changer feature at their Dev Day: audio output from the standard generation API.
This has severe implications for text-to-speech apps, particularly if the audio output style is as steerable as the gpt-4o voice demos.
> and $0.24 per minute of audio output
That is substantially more expensive than TTS (text-to-speech) which already is quite expensive.