logoalt Hacker News

NitpickLawyertoday at 7:03 AM0 repliesview on HN

This might actually be the whole value prop of this benchmark. Forget their initial scores, take open models (so we can be sure the base doesn't change), and test different combinations of harness + prompts + strategies + whatever memthing is popular today. See if the scores improve. Repeat.