So we need to generate benchmarks after the models finish training. Or we need to keep the solutions to the benchmark problems as closed source.