These papers usually have poor stability to prompting and rerunning. It would be nice if we had some kind of meta-evaluation metric where rewriting the prompt conditions or varying the input params could be used to determine how stable a result is.
Regardless, it's definitely true that AI agents have different priorities from us. That's what alignment is about anyway.
It's probably because they care more about the headline than figuring anything out: https://github.com/kennethpayne01/project_kahn_public/issues...
So you create leading prompts like that, and re-run until you get a publishable session.