Even multimodal models are still really bad when it comes to vision. The strength is still definitely language.