> The Realtime API improves this by streaming audio inputs and outputs directly, enabling more na...

qwertox • 10/01/2024 • 3 replies • view on HN

> The Realtime API improves this by streaming audio inputs and outputs directly, enabling more natural conversational experiences. It can also handle interruptions automatically, much like Advanced Voice Mode in ChatGPT.

> Under the hood, the Realtime API lets you create a persistent WebSocket connection to exchange messages with GPT-4o. The API supports function calling(opens in a new window), which makes it possible for voice assistants to respond to user requests by triggering actions or pulling in new context.

This sounds really interesting, and I see a great use cases for it. However, I'm wondering if the API provides a text transcription of both the input and output so that I can store the data directly in a database without needing to transcribe the audio separately.

Edit: Apparently it does.

It sends `conversation.item.input_audio_transcription.completed` [0] events when the input transcription is done (I guess a couple of them in real-time)

and `response.done` [1] with the response text.

[0] https://platform.openai.com/docs/api-reference/realtime-serv...

[1] https://platform.openai.com/docs/api-reference/realtime-serv...

Replies

bcherry • 10/01/2024

yes it transcribes inputs automatically, but not in realtime.

outputs are sent in text + audio but you'll get the text very quickly and audio a bit slower, and of course the audio takes time to play back. the text also doesn't currently have timing cues so its up to you if you want to try to play it "in sync". if the user interrupts the audio, you need to send back a truncation event so it can roll its own context back, and if you never presented the text to the user you'll need to truncate it there as well to ensure your storage isn't polluted with fragments the user never heard.

pants2 • 10/01/2024

It's incredible that people are talking about the downfall of software engineering - now, at many companies, hundreds of call center roles will be replaced by a few engineering roles. With image fine-tuning, now we can replace radiologists with software engineers, etc. etc.

➕ show 5 replies

tough • 10/01/2024

saw velvet show hn the other dya, could be usful for storng these https://news.ycombinator.com/item?id=41637550

➕ show 1 reply

alt Hacker News

Replies