logoalt Hacker News

isomorphic_ducktoday at 9:07 AM0 repliesview on HN

No, the purpose was to create a (automated) test set in the first place. The author builds an LLM judge which can score the LLMs participating during test-time. That would be why the author used the strongest model (Opus 4,7 at the time) as the judge.