This might actually be the whole value prop of this benchmark. Forget their initial scores, take ope...

NitpickLawyer • today at 7:03 AM • 0 replies • view on HN

This might actually be the whole value prop of this benchmark. Forget their initial scores, take open models (so we can be sure the base doesn't change), and test different combinations of harness + prompts + strategies + whatever memthing is popular today. See if the scores improve. Repeat.

alt Hacker News