Why didn't OpenAI finetune the model to use the python tool it has for these tasks?
They do, in the paper they mention they evaluate the LLM without tools
They do, in the paper they mention they evaluate the LLM without tools