This is what VLA models are for. They would work much better. Would need a bit of fine tuning but probably not much. Lots of literature out there on using VLAs to control drones.
I don't understand. Surely training an LSTM with sensor input is more practical and reasonable way than trying to get a text generator to speak commands to a drone.
This is neat! It's a bit amusing in that I worked on a somewhat similar project for my phd thesis almost 10 years ago, although in that case we got it working on a real drone (heavily customized, based on DJI matrice) in the field, with only onboard compute. Back then it was just a fairly lightweight CNN for the perception, not that we could've gotten much more out of the jetson TX2.
On the discussion of the right or wrong tool, I find it possible that the ability to reason towards a goal is more valuable in the long run than an intrinsic ability to achieve the same result. Or maybe a mix of both is the ideal.
I think it's fascinating work even if LLMs aren't the ideal tool for this job right now.
There were some experiments with embodied LLMs on the front page recently (e.g. basic robot body + task) and SOTA models struggled with that too. And of course they would - what training data is there for embodying a random device with arbitrary controls and feedback? They have to lean on the "general" aspects of their intelligence which is still improving.
With dedicated embodiment training and an even tighter/faster feedback loop, I don't see why an LLM couldn't successfully pilot a drone. I'm sure some will still fall of the rails, but software guardrails could help by preventing certain maneuvers.
Why would you want an LLM to fly a drone? Seems like the wrong tool for the job -- it's like saying "Only one power drill can pound roofing nails". Maybe that's true, but just get a hammer
Interesting. In some benchmarks I even see flash outperforming thinking in general reasoning.
I am curious how these models would perform and how much energy they'd take to semi-realtime detect objects: SmolVLM2-500M - Moondream 0.5B/2B/2.5B - Qwen3-VL (3B) https://huggingface.co/collections/Qwen/qwen3-vl
I am sure this is already worked on in Russia, Ukraine and The Netherlands. A lot can go wrong with autonomous flying. One could load the VLM on a high end android phone on the drone and have dual control.
> I gave 7 frontier LLMs a simple task: pilot a drone through a 3D voxel world and find 3 creatures.
> Only one could do it.
If I understood the chart correctly, even the successful one only found 1/6 of the creatures across multiple runs.
In a real world test you would have a tool call for the LLM which is a bit high level like GoTo(object) and the tool calls another program which identities the objects in frame and uses standard programs to go to that.
At least he's not feeding real drones to the coyotes... oh, there's a link in the readme https://github.com/kxzk/tello-bench
LLMs are trained on text. Why would we expect them to understand a visual and tactile 3D world?
I can’t really take this too seriously. This seems to me to be a case of asking “can an LLM do X?” Instead, the question is like to see is: “I want to do X, is an LLM this right tool?”
But that said, I think the author missed something. LLMs aren’t great at this type of reasoning/state task, but they are good at writing programs. Instead of asking the LLM to search with a drone, it would be very interesting to know how they performed if you asked them to write a program to search with a drone.
This is more aligned with the strengths of LLMs, so I could see this as having more success.
Gemini Flash beats Gemini Pro? How does that work?
Gemini Pro, like the other models, didn't even find a single creature.
This sounds like a good way to get your drone shot down by a Concerned Citizen or the military.
LLMs flying weaponized drones is exactly how it starts.
"drone"
Gemini 3 is the only model I've found that can reason spatially. The results here are accurate to my experiments with putting LLM NPCs in simulated worlds.
I was surprised that most VLLMs cannot reliably tell if a character is facing left or right, they will confidently lie no matter what you do (even gemini 3 cannot do it reliably). I guess it's just not in the training data.
That said Qwen3VL models are smaller/faster and better "spatially grounded" in pixel space, because pixel coordinates are encoded in the tokens. So you can use them for detecting things in the scene, and where they are (which you can project to 3d space if you are running a sim). But they are not good reasoning models so don't ask them to think.
That means the best pipeline I've found at the moment is to tack a dumb detection prepass on before your action reasoning. This basically turns 3d sims into 1d text sims operating on labels -- which is something that LLMs are good at.