I just tested this with Claude Code and Opus 4.6, with the following prompt:
"I have an arbitrary width rectangle that needs to be broken into smaller random width rectangles (maintaining depth) within a given min/max range. The solution needs to be highly performant from an algorithmic standpoint, well-tested using TDD and Red/Green testing, written in python, and not have any subtle errors."
It got the answer you ended up with (if I'm understanding you correctly) the first time in just over 2 minutes of working, and included a solid test suite examining edge cases and with input validation.
How can we verify if you dont post the code?
I appreciate you testing, even though it's not a great comparison:
- My feedback cycle of LLM prompting forced me to be more explicit with each call, which benefited your prompt since I gave you exactly what to look for with fewer nuances.
- Maybe GPT 5.1 is old or kneecapped for newer versions of GPT
- Maybe Opus/Claud is just a way better model :P
Please post the code!
Edit: Regarding "exactly what to look for", when solving a new problem, rarely is all the nuance available for the first iteration.