logoalt Hacker News

legoheadyesterday at 8:52 PM11 repliesview on HN

The low latency is more of a pain point than a good thing, the way they have it implemented. Trying to have a casual conversation with it, as humans we naturally pause, and GPT will take this as you are "done" and start blabbing away.

I also suffer from finding the appropriate word I want as I've gotten older and slower, and this fast-voice-gpt just ends up frustrating me more than helping. I have to sit there and think out the whole sentence in my head before I say anything -- not very natural.


Replies

zamadatixyesterday at 9:04 PM

I think these are 2 different layers of "latency". The latency in the article is referring to the transport of the audio stream itself while the latency in your scenario is about how quickly to start responding inside the audio stream.

show 2 replies
janalsncmyesterday at 9:29 PM

I’ve also experienced this and it’s really annoying. There is this pressure to keep talking if I’m not done with my thought that feels pretty unnatural at least for me. If I’m searching for the right word, I want the opportunity to find it.

I think the solution is to handle pauses more intelligently rather than having a higher latency protocol. With low latency you can interrupt and the bot can immediately stop rambling.

show 3 replies
dtranyesterday at 9:58 PM

This has more to do with Voice Activity Detection (VAD) than the latency described in the article

show 2 replies
wnmurphyyesterday at 11:17 PM

Strongly agree, some of us like to choose our words more carefully when interacting with an LLM.

I've tried to convey this to OpenAI through various available channels (dev forums, app feedback, etc.).

Grok solves this by having an optional push-to-talk mode, but this is not hands-free and thus more cumbersome than just having a user-configurable variable like seconds_delay_before_sending_voice_input.

saturdaysaintyesterday at 9:24 PM

In voice conversations I tell it not to reply at all or only say “Understood” until I use some kind of code word. Not perfect, but less intrusive.

show 1 reply
richardwyesterday at 9:05 PM

Hard problem. I find myself adding in filler to stop the thing from jabbering.

I also think it spends most of its iq on sounding good rather than thinking about the problem. “Yeah absolutely I can see why you’d like to…” etc. This is likely because it’s on a timer and maybe voice is more expensive to process? Text responses spend more time on the task.

show 2 replies
ericmceryesterday at 11:05 PM

yeh exactly, you cannot get a strong signal that a user is done speaking without some amount of “wait for 500ms of silence”. You could kick of processing and abandon if they continued talking, but that seems over optimized.

1-2s replies feel natural and like you pointed out pausing for 2-3s mid sentence is super normal.

show 1 reply
throwuxiytayqyesterday at 9:25 PM

With higher latency this would be even more of an issue. When you pause and start talking again, the model wouldn't catch that until it has already interrupted you.

The actual implementation is at fault. I had some luck with instructing the model to only respond with "Mhm" until I've explicitly finished my thought and asked it a question. Makes this much less of an issue.

But I've decided that their voice mode is completely unusable for a different reason: the model feels incredibly dumb to interact with, keeps repeating and re-phrasing what I said, ends every single answer with a "hook" making the entire interaction idiotically robotic, completely ignores instructions when you ask it to stop that, and - most importantly - doesn't feel helpful for brainstorming. I was completely surprised how bad it is in practice; this should be their killer app but the model feels incredibly badly tuned.

MagicMoonlightyesterday at 9:13 PM

It’s possible to change the amount of time it waits if you’re using the API

Barbingyesterday at 9:17 PM

[flagged]