Nice!
I remember "Baba is Eval" (https://fi-le.net/baba/), released 11 months ago, back when Claude Opus 4 was the strongest model. Back then, I was surprised how poor was it even at the first level.
I am happy to see an another approach - and indeed, with much stronger results.
Yes that was the post that inspired me to build this.
While I did implement a more comprehensive harness with path finding tools etc. the models themselves have improved significantly.