logoalt Hacker News

dtranyesterday at 9:58 PM2 repliesview on HN

This has more to do with Voice Activity Detection (VAD) than the latency described in the article


Replies

lxgryesterday at 11:08 PM

That seems to be the issue: VAD is insufficient here.

Knowing when to respond requires semantic understanding, which probably only the model itself is capable enough.

Maybe it’s hard for them to train it to only respond once it seems appropriate to do so?

show 1 reply
wnmurphyyesterday at 11:18 PM

Exactly. It's a tangent, but clearly a pain point for enough users.