> I gave 7 frontier LLMs a simple task: pilot a drone through a 3D voxel world and find 3 creatures.
> Only one could do it.
If I understood the chart correctly, even the successful one only found 1/6 of the creatures across multiple runs.
No science detected.
Without comparison to some null hypothesis (a random policy), this article is hogwash.
No science detected.
Without comparison to some null hypothesis (a random policy), this article is hogwash.