"The first time you've seen these objects" is a weird thing to say. One presumes that this is already in their training set, and that these models aren't storing a huge amount of data in their context, so what does that even mean?
So from what I understand it actually means that they were for example never trained on a video of an apple. Maybe only on a video of bread, pineapple, chocolate.
However, as it was trained using generic text data similarly to a normal LLM, it knows how an apple is supposed to look like.
Similar than a kid that never saw a banana, but his parent described it to him.
It's normal to have a training set and a validation set and I interpreted that to mean that these items weren't in the training set.
It probably gives them confidence that they can accurately see a thing even though they don't know what that thing is.
I could also imagine a lot of safety around leaving things outside of the current task alone so you might have to bend over backwards to get new objects worked on.