I suspect that LLMs are better at classifying novel vs junk papers than they are at creating novel papers themselves.
If so, I think the solution is obvious.
(But I remind myself that all complex problems have a simple solution that is wrong.)
> I suspect that LLMs are better at classifying novel vs junk papers than they are at creating novel papers themselves.
Doubt
LLMs are experts in generating junk. And generally terrible at anything novel. Classifying novel vs junk is a much harder problem.
Verification via LLM tends to break under quite small optimization pressure. For example I did RL to improve <insert aspect> against one of the sota models from one generation ago, and the (quite weak) learner model found out that it could emit a few nonsense words to get the max score.
That's without even being able to backprop through the annotator, and also with me actively trying to avoid reward hacking. If arxiv used an open model for review, it would be trivial for people to insert a few grammatical mistakes which cause them to receive max points.