What is the model architecture? I'm assuming it's far away from LLMs, but I'm curious...

polskibus • 06/24/2025 • 1 reply • view on HN

What is the model architecture? I'm assuming it's far away from LLMs, but I'm curious about knowing more. Can anyone provide links that describe architectures for VLA?

Replies

KoolKat23 • 06/24/2025

Actually very close to one I'd say.

It's a "visual language action" VLA model "built on the foundations of Gemini 2.0".

As Gemini 2.0 has native language, audio and video support, I suspect it has been adapted to include native "action" data too, perhaps only on output fine-tuning rather than input/output at training stage (given its Gemini 2.0 foundation).

Natively multimodal LLM's are basically brains.

➕ show 2 replies

alt Hacker News

Replies