You had to go all the way and find it in the benchmark results that specifically stress test this.
You could not come up with a single one yourself. And you also linked an example where it was not allowed to use tools when I specifically said that it should be able to use tools. I'm not sure why are you present this as though it is a big gotcha.
I think my main point pretty much stands.
I found over 500 examples that fit your criteria. Embarrassing you were arguing in bad faith this whole time.