I wonder if at this point they read what people use to benchmark with and specifically train it to do well at this task.