> I try the LLMs every now and then, and they still make the same stupid hallucinations that ChatGPT did on day 1.
One of the tests I sometimes do of LLMs is a geometry puzzle:
You're on the equator facing south. You move forward 10,000 km along the surface of the Earth. You are rotate 90° clockwise. You move another 10,000 km forward along the surface of the earth. Rotate another 90° clockwise, then move another 10,000 km forward along the surface of the Earth.
Where are you now, and what direction are you facing?
They all used to get this wrong all the time. Now the best ones sometimes don't. (That said, only one to succed just as I write this comment was DeepSeek; the first I saw succeed was one of ChatGPT's models but that's now back to the usual error they all used to make).Anecdotes are of course a bad way to study this kind of thing.
Unfortunately, so are the benchmarks, because the models have quickly saturated most of them, including traditional IQ tests (on the plus side, this has demonstrated that IQ tests are definitely a learnable skill, as LLMs loose 40-50 IQ points when going from public to private IQ tests) and stuff like the maths olympiad.
Right now, AFAICT the only open benchmarks are the METR time horizon metric, the ARC-AGI family of tests, and the "make me an SVG of ${…}" stuff inspired by Simon Willison's pelican on a bike.
Out of interest, was your intended answer "where you started, facing east"?
FWIW, Claude Opus 4.5 gets this right for me, assuming that is the intended answer. On request, it also gave me a Mathematica program which (after I fixed some trivial exceptions due to errors in units) informs me that using the ITRF00 datum the actual answer is 0.0177593 degrees north and 0.168379 west of where you started (about 11.7 miles away from the starting point) and your rotation is 89.98 degrees rather than 90.
(ChatGPT 5.1 Thinking, for me, get the wrong answer because it correctly gets near the South Pole and then follows a line of latitude 200 times round the South Pole for the second leg, which strikes me as a flatly incorrect interpretation of the words "move forward along the surface of the earth". Was that the "usual error they all used to make"?)