im def working on benchmarks for how my own general harness improves task performance vs same model in a commodity setup. its hard to do!
i will say that my current harness: https://github.com/cartazio/oh-punkin-pi is a testbed for a bunch of 2nd gen harness tech, largely optimized for reasoning llms only. the next one after this harness is gonna be epicccc
im def working on benchmarks for how my own general harness improves task performance vs same model in a commodity setup. its hard to do!
i will say that my current harness: https://github.com/cartazio/oh-punkin-pi is a testbed for a bunch of 2nd gen harness tech, largely optimized for reasoning llms only. the next one after this harness is gonna be epicccc