Honestly Im not even sure how much model improvement was in the last 12 months, or it was mainly harness improvement. It feels to me like I could’ve done the same stuff with 4, if I would be able to split every task into multiple subtasks with perfect prompts. So to me it could totally be that there is an inner harnessing happen that has been the recent improvements, but then I ask myself is this maybe the same with our own intelligence?