What about at inference time? ie. in response to a query.
We let models write code and run it. Which gives them a high chance of getting arithmetic right.
Solving the “crossing the river” problem by letting the model create and run a simulation would give a pretty high chance of getting it right.
The newest Claude update comes with a python sandbox built right into the API for exactly this reason.
https://docs.anthropic.com/en/docs/agents-and-tools/tool-use...