Actually very close to one I'd say. It's a "visual language action" VLA model ...

KoolKat23 • 06/24/2025 • 2 replies • view on HN

Actually very close to one I'd say.

It's a "visual language action" VLA model "built on the foundations of Gemini 2.0".

As Gemini 2.0 has native language, audio and video support, I suspect it has been adapted to include native "action" data too, perhaps only on output fine-tuning rather than input/output at training stage (given its Gemini 2.0 foundation).

Natively multimodal LLM's are basically brains.

Replies

quantumHazer • 06/24/2025

> Natively multimodal LLM's are basically brains.

Absolutely not.

➕ show 1 reply

martythemaniak • 06/24/2025

OpenVLA is basically a slightly modified, fine-tuned llama2. I found the launch/intro talk by lead author to be quite accessible: https://www.youtube.com/watch?v=-0s0v3q7mBk

➕ show 2 replies

alt Hacker News

Replies