"in this paper we primarily evaluate the LLM itself without external tool calls."
Maybe this is a factor?
No tools were used.
No tools were used.