logoalt Hacker News

NitpickLawyerlast Thursday at 3:55 PM1 replyview on HN

> a 7-9Hz 7B vision-language model, and a 200Hz 80M visuomotor model.

huh. An interesting approach. I wonder if something like this can be used for other things as well, like "computer use" with the same concept of a "large" model handling the goals, and a "small" model handling clicking and stuff, at much higher rates, useful for games and things like that.


Replies

whatever1last Thursday at 4:29 PM

This is typical in real time applications. A supervisor tries to guess in which region the system is currently and then invokes the correct set of lower level algorithms.