Hard problem. I find myself adding in filler to stop the thing from jabbering.
I also think it spends most of its iq on sounding good rather than thinking about the problem. “Yeah absolutely I can see why you’d like to…” etc. This is likely because it’s on a timer and maybe voice is more expensive to process? Text responses spend more time on the task.
Fwiw you can prompt it to respond differently to you.
Their voice capable model is several generations behind the state of the art text-only one, as far as I know.
I don’t think it even has reasoning tokens, so it’s no surprise that it’s as most as smart as the “instant” models (i.e., not very).