That's no surprise. When I was working on game theory and agent reasoning I reached the same conclusion a year ago.
My conclusion was that context needs to be managed well for the LLMs to manage accuracy in replies. Also, it helps to have a planning process ("graph reasoning") before task execution because it guardrails the models thought process.
This also introduces a discussion on general use vs workflow agent implementations as in the former it is much more difficult to generalize all components in structuring effective ReAct patterns.
It's probably why workflow agents feel more reliable: they're built around structure, not just raw prediction