It's been clear for some time that model tool calling is heavily fit to a few common patterns, it's unsurprising that a tool call that looks the same or has the same name, but works differently, is falling back to priors and causing problems.
Things are not quite AGI yet; which is why people are now saying that intelligence is the harness + model, because the harness makes up for limitations in generalization.