> We asked it [GPT-5.5] to assess whether each leader had made a falsifiable claim about the future as part of its main thesis. About 1,400 did. We then extracted those predictions, and asked the AI to mark out of ten both how contrarian the leader’s outlook was at the time and how accurate the prediction turned out to be. We ran those queries several times and took an average.
I understand why it wouldn’t be feasible for a human to do this, but I’m quite sceptical about an AI assessing how accurate predictions turned out to be/how contrarian they were at the time. It seems like that would depend a lot on what sources it chooses, be liable to hallucination or getting poisoned by bad sources, etc. They don’t mention whether they used independent queries for each prediction either, or whether it was doing multiple sequentially.
Given that LLMs can’t really distinguish prompt from instructions etc, I’m sceptical that they can reason particularly well about things like how contrarian a view was at a particular point in time.
> We ran those queries several times and took an average.
The whole thing is bizarre. They could at least have drawn 100 samples and evaluated the model's response. But no, they ran it a few times and hoped that the slight randomness in sampling magically resulted in a balanced assessment of accuracy and contrariness, at some point in time for which we don't even know how much training data there was, nor if that data accurately reflects the opinions of that time. But hey, we've got a graph, so it must be true.
How can people be so stupid?
> cheap bioethanol ... [has] yet to bring about the breakthroughs we prophesied.
Oh.
you can still judge if it was correct or not, irrespective of contrarianness, still a good signal to measure
I thought the same while reading this - I can easily imagine the AI using hindsight to affect "how contrarian the leader's outlook was at the time", in a way that's similar to how we often do the same.
Would be interesting to do a similar analysis but maybe pushing the AI to search the web for articles/other reporting written during that time, to at least correct for that bias a little bit.