>Using an LLM is the SOTA way to turn plain text instructions into embodied world behavior. >...

famouswaffles • yesterday at 6:20 PM • 0 replies • view on HN

>Using an LLM is the SOTA way to turn plain text instructions into embodied world behavior.

>SOTA typically refers to achieving the best performance

Multimodal Transformers are the best way to turn plain text instructions to embodied world behavior. Nothing to do with being 'trendy'. A Vision Language Action model would probably have done much better but really the only difference between that and the models trialed above is training data. Same technology.

alt Hacker News