It worked for you because the paper does the experiment without allowing the model to use any reasoning tokens - something that is grossly misleading.