How can we verify if you dont post the code?
I appreciate you testing, even though it's not a great comparison:
- My feedback cycle of LLM prompting forced me to be more explicit with each call, which benefited your prompt since I gave you exactly what to look for with fewer nuances.
- Maybe GPT 5.1 is old or kneecapped for newer versions of GPT
- Maybe Opus/Claud is just a way better model :P
Please post the code!
Edit: Regarding "exactly what to look for", when solving a new problem, rarely is all the nuance available for the first iteration.
I didn't prompt anything odd, just standard prompt "etiquette", actually I significantly prompted less than I would usually do, trying to do a simple prompt like you did.