I was reading their site, and I too have some questions about this architecture. I'd be very ...

vessenes • 02/20/2025 • 0 replies • view on HN

I was reading their site, and I too have some questions about this architecture.

I'd be very interested to see what the output of their 'big model' is that feeds into the small model. I presume the small model gets a bunch of environmental input, and some input from the big model, and we know that the big model input only updates every 30 or 40 frames in terms of small model.

Like, do they just output random control tokens from big model and embed those in small model and do gradient descent to find a good control 'language'? Do they train the small model on english tokens and have the big model output those? Custom coordinates tokens? (probably). Lots of interesting possibilities here.

By the way, the dataset they describe was generated by a large (much larger presumably) vision model tasked with creating tasks from successful videos.

So the pipeline is:

* Video of robot doing something

* (o1 or some other high end model) "describe very precisely the task the robot was given"

* o1 output -> 7B model -> small model -> loss

alt Hacker News