The canonical example I use is how good are (philosophical) you at programming on a whiteboard given one shot and no tools? Vs at your computer given access to everything? So judging LLMs on that rubric seems as dumb as judging humans by that rubric.