Even with search grounding, it scored a 2.5/5 on a basic botanical benchmark. It would take much longer for the average human to do a similar write-up, but they would likely do better than 50% hallucination if they had access to a search engine.
Even multimodal models are still really bad when it comes to vision. The strength is still definitely language.
Even multimodal models are still really bad when it comes to vision. The strength is still definitely language.