Once the model has seen the questions and answers in the training stage, the questions are worthless. Only a test using previously unseen questions has merit.
They aren't training new models for this. This is an agent harness for Opus 4.6.
They aren't training new models for this. This is an agent harness for Opus 4.6.