So have an AI with a 40% error rate judge an AI with an 40% error rate…
AGI is a complete no go until a model can adjust its own weights on the fly, which requires some kind of negative feedback loop, which requires a means to determine a failure.
Humans have pain receptors to provide negative feedback and we can imagine events that would be painful such as driving into a parked car would be painful without having to experience it.
If current models could adjust its own weights to fix the famous “how many r’s in strawberry” then I would say we are on the right path.
However, the current solution is to detect the question and forward it to a function to determine the right answer. Or attempt to add more training data the next time the model is generated ($$$). Aka cheat the test.