This made me compare the figures, and: did they accidentally switch those around, or are the Post-training Reasoning and Factuality scores actually significantly lower than the Pre-training ones?
Edit: Just noticed
> Also note pre-training and post-training benchmarks are different, so scores are not comparable across plots.
The paper gives more details about the specific benchmarks and the scores obtained in them: https://arxiv.org/html/2512.14856v1#S4