I think it's just not a super smart model. They had to make a slight compromise to keep the latency low. The naturalness of the conversation that they did achieve is a great technical accomplishment with these types of constraints though.
For me, it said "are you comfortable sharing what that mark is on your forehead?" Or something like that. I said basically "I don't know maybe a wrinkle?". Lol. Kind of confirms for me why I should continue to avoid video chats. I did look like crap on general, really tired for one thing. And I am 46, so I have some wrinkles, although didn't know they were that obvious.
But a little bit of prompt guidance to avoid commenting on the visuals unless relevant would help. It's possible they actually deliberately put something in the prompt to ask it to make a comment just to demonstrate that it can see, since this is an important feature that might not be obvious otherwise.