logoalt Hacker News

staredtoday at 11:27 AM1 replyview on HN

Nice!

I remember "Baba is Eval" (https://fi-le.net/baba/), released 11 months ago, back when Claude Opus 4 was the strongest model. Back then, I was surprised how poor was it even at the first level.

I am happy to see an another approach - and indeed, with much stronger results.


Replies

meffmaddtoday at 12:09 PM

Yes that was the post that inspired me to build this.

While I did implement a more comprehensive harness with path finding tools etc. the models themselves have improved significantly.