This seems to be incorporated into current LLM generations already -- when code execution is enabled both GPT-5.x and Claude 4.x automatically seem to execute Python code to help with reasoning steps.
I remember seeing that GPT-5 had two python tools defined in its leaked prompt one them would hide the output from user visible chain of thought UI.
Same with CoT prompting.
If you compare the outputs of a CoT input vs a control input, the outputs will have the reasoning step either way for the current generation of models.
Yeah, this is honestly one of the coolest developments of new models.
This was integrated in gpt4 2 years ago:
https://www.reddit.com/r/ChatGPT/comments/14sqcg8/anyone_els...