logoalt Hacker News

OpenAI DevDay 2024 live blog

188 pointsby plurby10/01/202484 commentsview on HN

Comments

qwertox10/01/2024

> The Realtime API improves this by streaming audio inputs and outputs directly, enabling more natural conversational experiences. It can also handle interruptions automatically, much like Advanced Voice Mode in ChatGPT.

> Under the hood, the Realtime API lets you create a persistent WebSocket connection to exchange messages with GPT-4o. The API supports function calling(opens in a new window), which makes it possible for voice assistants to respond to user requests by triggering actions or pulling in new context.

-

This sounds really interesting, and I see a great use cases for it. However, I'm wondering if the API provides a text transcription of both the input and output so that I can store the data directly in a database without needing to transcribe the audio separately.

-

Edit: Apparently it does.

It sends `conversation.item.input_audio_transcription.completed` [0] events when the input transcription is done (I guess a couple of them in real-time)

and `response.done` [1] with the response text.

[0] https://platform.openai.com/docs/api-reference/realtime-serv...

[1] https://platform.openai.com/docs/api-reference/realtime-serv...

show 3 replies
siva710/01/2024

I've never seen a company publishing consistently groundbreaking features at such a speed like this one. I really wonder how their teams work. It's unprecedented at what i've seen in 15 years software

show 5 replies
ponty_rick10/01/2024

> 11:43 Fields are generated in the same order that you defined them in the schema, even though JSON is supposed to ignore key order. This ensures you can implement things like chain-of-thought by adding those keys in the correct order in your schema design.

Why not use an array of key value pairs if you want to maintain ordering without breaking traditional JSON rules?

[ {key1:value1}, {key2:value2} ]

show 2 replies
serjester10/01/2024

The eval platform is a game changer.

It's nice to have have a solution from OpenAI given how much they use a variant of this internally. I've tried like 5 YC startups and I don't think anyone's really solved this.

There's the very real risk of vendor lock-in but quickly scanning the docs seems like it's a pretty portable implementation.

alach1110/01/2024

It's pretty amazing that they made prompt caching automatic. It's rare that a company gives a 50% discount without the customer explicitly requesting it! Of course... they might be retaining some margin, judging by their discount being 50% vs. Anthropic's 90%.

show 1 reply
10100810/01/2024

I understand the Realtime API voice novelty, and the techonological achievement it is, but I don't see it from the product point of view. It looks like one of those startups finding a solution before knowing the problem.

The two examples shown in the DevDay are the things I don't really want to do in the future. I don't want to talk to anybody, and I don't want to wait for their answer in a human form. That's why I order my food through an app or Whatsapp, or why I prefer to buy my tickets online. In the rare case I call to order food, it's because I have a weird question or a weird request (can I pick it up in X minutes? Can you prepare it in a different way?)

I hope we don't start seeing apps using conversations as interfaces because it would really horrible (leaving aside the fact that a lot of people don't know how to communicate themselves, different accents, sound environments, etc), while clicking or typing work almost the same for everyone (at least much more normalized than talking)

show 5 replies
thenameless774110/01/2024

Blog updates:

- Introducing the Realtime API: https://openai.com/index/introducing-the-realtime-api/

- Introducing vision to the fine-tuning API: https://openai.com/index/introducing-vision-to-the-fine-tuni...

- Prompt Caching in the API: https://openai.com/index/api-prompt-caching/

- Model Distillation in the API: https://openai.com/index/api-model-distillation/

Docs updates:

- Realtime API: https://platform.openai.com/docs/guides/realtime

- Vision fine-tuning: https://platform.openai.com/docs/guides/fine-tuning/vision

- Prompt Caching: https://platform.openai.com/docs/guides/prompt-caching

- Model Distillation: https://platform.openai.com/docs/guides/distillation

- Evaluating model performance: https://platform.openai.com/docs/guides/evals

Additional updates from @OpenAIDevs: https://x.com/OpenAIDevs/status/1841175537060102396

- New prompt generator on https://playground.openai.com

- Access to the o1 model is expanded to developers on usage tier 3, and rate limits are increased (to the same limits as GPT-4o)

Additional updates from @OpenAI: https://x.com/OpenAI/status/1841179938642411582

- Advanced Voice is rolling out globally to ChatGPT Enterprise, Edu, and Team users. Free users will get a sneak peak of it (except EU).

show 1 reply
superdisk10/01/2024

Holy crud, I figured they would guard this for a long time and I was really salivating to make some stuff with it. The doors are wide open for all sorts of stuff now, Advanced Voice is the first feature since ChatGPT initially came out that really has my jaw on the floor.

show 1 reply
N_A_T_E10/01/2024

I just need their API to be faster. 15-30 seconds per request using 4o-mini isn't good enough for responsive applications.

show 4 replies
minimaxir10/01/2024

From the Realtime API blog post: https://openai.com/index/introducing-the-realtime-api/

> Audio in the Chat Completions API will be released in the coming weeks, as a new model `gpt-4o-audio-preview`. With `gpt-4o-audio-preview`, developers can input text or audio into GPT-4o and receive responses in text, audio, or both.

> The Realtime API uses both text tokens and audio tokens. Text input tokens are priced at $5 per 1M and $20 per 1M output tokens. Audio input is priced at $100 per 1M tokens and output is $200 per 1M tokens. This equates to approximately $0.06 per minute of audio input and $0.24 per minute of audio output. Audio in the Chat Completions API will be the same price.

As usual, OpenAI failed to emphasize the real-game changer feature at their Dev Day: audio output from the standard generation API.

This has severe implications for text-to-speech apps, particularly if the audio output style is as steerable as the gpt-4o voice demos.

show 1 reply
modeless10/01/2024

I didn't expect an API for advanced voice so soon. That's pretty great. Here's the thing I was really wondering: Audio is $.06/min in, $.24/min out. Can't wait to try some language learning apps built with this. It'll also be fun for controlling robots.

jbaudanza10/02/2024

Interesting choice of a 24kHz sample rate for PCM audio. I wonder if the model was trained on 24kHz audio, rather than the usual 8/16kHz for ML models.

sammyteee10/01/2024

Loving these live updates, keep em coming! Thanks Simon!

nielsole10/01/2024

> The first big announcement: a realtime API, providing the ability to use WebSockets to implement voice input and output against their models.

I guess this is using their "old" turn-based voice system?

show 1 reply
og_kalu10/01/2024

Image output for 4o in the API would be very nice but i'm not sure if that's at all in the cards.

Audio output in the api now but you lose image input. Why ? That's a shame.

lysecret10/01/2024

Using structured outputs for generative ui is such a cool idea does anyone know some cool web demos related to this ?

show 1 reply
hidelooktropic10/01/2024

Any word on increased weekly caps on o1 usage?

show 1 reply
bigcat1234567810/01/2024

Seems mostly standard items so far.