Because LLMs are bad at reviewing code for the same reasons they are bad at making it? They get tricked by fancy clean syntax and take long descriptions / comments for granted without considering the greater context.
I don't know, I prompted Opus 4.5 "Tell me the reasons why this report is stupid" on one of the example slop reports and it returned a list of pretty good answers.[1]
Give it a presumption of guilt and tell it to make a list, and an LLM can do a pretty good job of judging crap. You could very easily rig up a system to give this "why is it stupid" report and then grade the reports and only let humans see the ones that get better than a B+.
If you give them the right structure I've found LLMs to be much better at judging things than creating them.
Opus' judgement in the end:
"This is a textbook example of someone running a sanitizer, seeing output, and filing a report without understanding what they found."
I don't know, I prompted Opus 4.5 "Tell me the reasons why this report is stupid" on one of the example slop reports and it returned a list of pretty good answers.[1]
Give it a presumption of guilt and tell it to make a list, and an LLM can do a pretty good job of judging crap. You could very easily rig up a system to give this "why is it stupid" report and then grade the reports and only let humans see the ones that get better than a B+.
If you give them the right structure I've found LLMs to be much better at judging things than creating them.
Opus' judgement in the end:
"This is a textbook example of someone running a sanitizer, seeing output, and filing a report without understanding what they found."
1. https://claude.ai/share/8c96f19a-cf9b-4537-b663-b1cb771bfe3f