If you provide it a benchmark script (or ask it to write one) so it has concrete numbers to go off of, it will do a better job.
I'm not saying these things don't hallucinate constantly, they do. But you can steer them toward better output by giving them better input.