Internal evals, Big AI certainly has good, proprietary training and eval data, it's one reason why their models are better
Then publish the results of those internal evals. Public benchmark saturation isn't an excuse to be un-quantitative.
Then publish the results of those internal evals. Public benchmark saturation isn't an excuse to be un-quantitative.