I think this is still useful research that calls into question how “smart” these models are. If the model needs a separate tool to solve a problem, has the model really solved the problem, or just outsourced it to a harness that it’s been trained - via reinforcement learning - to call upon?
It has "outsourced" it to another component, sure, but does that matter?
What the user sees is the total behavior of the entire system, not whether the system has internal divisions and separations.
Does it matter if the LLM can solve the problem or if it knows to use a resource?
There’s plenty of math that I couldn’t even begin to solve without a calculator or other tool. Doesn’t mean I’m not solving math problems.
In woodworking, the advice is to let the tool do the work. Does someone using a power saw have less claim to having built something than a handsaw user? Does a CNC user not count as a woodworker because the machine is doing the part that would be hard or impossible for a human?