Also something that makes the whole process very hard to verify is what I tried to address in a much older comment : whenever LLMs are used (regardless of the input dataset) by someone who is an expert in the domain (rather than an novice) how can one evaluate what's been done by whom or what? Sure again there can be a positive result, e.g a solution to a problem until now unsolved, what does it say about the tool itself versus a user who is, by definition if they are an expert, up to date on the state of thew art?