Have a look at the post - it explains how it works. There are two models: a 7-9Hz 7B vision-language...

Philpax • 02/20/2025 • 1 reply • view on HN

Have a look at the post - it explains how it works. There are two models: a 7-9Hz 7B vision-language model, and a 200Hz 80M visuomotor model. The former produces a latent vector, which is then interpreted by the latter to drive the motors.

Replies

NitpickLawyer • 02/20/2025

> a 7-9Hz 7B vision-language model, and a 200Hz 80M visuomotor model.

huh. An interesting approach. I wonder if something like this can be used for other things as well, like "computer use" with the same concept of a "large" model handling the goals, and a "small" model handling clicking and stuff, at much higher rates, useful for games and things like that.

➕ show 1 reply

alt Hacker News

Replies