It would be interesting to see one of these evals and how it generated the score, to work out whether it is in fact arbitrary or based on some scale of points.
I found the summary above devoid of useful advice, what did you see as useful advice in it?
> if you don't understand that their primary goal is to produce plausible and coherent responses rather than ones that are necessarily correct (although they may be - hopefully).
If you really believe this you should perhaps re-evaluate the trust you appear to place in the conclusions of LLMs, particularly about their own workings and what makes a good skill or prompt for them.
> It would be interesting to see one of these evals and how it generated the score, to work out whether it is in fact arbitrary or based on some scale of points.
So go repeat the exercise yourself. I've already said this was a short-enough-to-post rollup of a much longer LLM assessment of the skills and that while most of the points were fair, some were questionable. If you were doing this "for real" you'd need to assess the full response point-by-point and decide which ones were valid.
> If you really believe this you should perhaps re-evaluate the trust you appear to place in the conclusions of LLMs, particularly about their own workings and what makes a good skill or prompt for them.
What on earth are you on about? The whole point of of the sentence you were replying to was that you can't blindly trust what comes out of them.