Fair points, especially on GSM8K saturation and Qwen possibly already sitting close to the solution. That said, even if this is mostly "last-mile alignment", the fact that it can be done with such a tiny signal is still interesting, it suggests the gap between capability and behavior might be much smaller (and cheaper to bridge) than we assume.