> I do think that SOTA LLMs like GPT-4o perform about as good as high school graduates in America with average intelligence.
Is this because the questions used in high school exams in the US are too simple, or do they have too similar patterns in the training data? I tried really simple but novel questions that required true understanding of the underlying math concepts, and the results were consistently bad. I also tried questions at the level of entrance exams of high school in China, and the results were equally bad. It was quite clear that LLM didn't understand math. It could match some patterns, but such pattern match could be useful to only skilled students.
Which model? The field moves so fast it’s hard to validate statements like this without that info.
O1-preview?