GPT-4o. I tried only a few samples on o1-preview, and the results were bad. That did not have any statistical significance, though
Could you give an example?
Could you give an example?