The study assumes that the car or drone is being guided by a LLM. Is this a correct assumption? I would thought that they use custom AI for intelligence.
To the best of my knowledge every major autonomous vehicle and robotics company is integrating these LVLMs into their systems in some form or another, and an LVLM is probably what you're interacting with these days rather than an LLM. If it can generate images or read images, it is an LVLM.
The problem is no different from LLMs though, there is no generalized understanding and thus they can not differentiate the more abstract notion of context. As an easy to understand example: if you see a stop sign with a sticker that says "for no one" below you might laugh to yourself and understand that in context that this does not override the actual sign. It's just a sticker. But the L(V)LMs cannot compartmentalize and "sandbox" information like that. All information is equally processed. The best you can do is add lots of adversarial examples and hope the machine learns the general pattern but there is no inherent mechanism in them to compartmentalize these types of information or no mechanism to differentiate this nuance of context.
I think the funny thing is that the more we adopt these systems the more accurate the depiction of hacking in the show Upload[0] looks.
[0] https://www.youtube.com/watch?v=ziUqA7h-kQc
Edit:
Because I linked elsewhere and people seem to doubt this, here is Waymo a few years back talking about incorporating Gemini[1].
Also, here is the DriveLM dataset, mentioned in the article[2]. Tesla has mentioned that they use a "LLM inspired" system and that they approach the task like an image captioning task[3]. And here's 1X talking about their "world model" using a VLM[4].
I mean come on guys, that's what this stuff is about. I'm not singling these companies out, rather I'm using as examples. This is how the field does things, not just them. People are really trying to embody the AI and the whole point of going towards AGI is to be able to accomplish any task. That Genie project on the front page yesterday? It is far far more about robots than it is about videogames.
[1] https://waymo.com/blog/2024/10/introducing-emma/
[2] https://github.com/OpenDriveLab/DriveLM
Its an incorrect assumption, the inference speed and particularly the inference speed of the on-device LLMs with which AVs would need to be using is not compatible with the structural requirements of driving.