Glad to to see open source models are catching up and treat vision as first-class citizen (a.k.a native multimodal agentic model). GLM and Qwen models takes different approach, by having a base model and a vision variant (glm-4.6 vs glm-4.6v).
I guess after Kimi K2.5, other vendors are going to the same route?
Can't wait to see how this model performs on computer automation use cases like VITA AI Coworker.