Great effort and a bit closer to my private evals than DeepSWE. I greatly appreciate the focus on fa...

Topfi • yesterday at 10:58 PM • 0 replies • view on HN

Great effort and a bit closer to my private evals than DeepSWE. I greatly appreciate the focus on false negative and positives, along with simply being far more focused on actual, mergeable quality output over plain passing. Could see a lot of others adopt your list of metrics as a basis, they are very well defined and solid coverage of everything one should want out of code provided, not just focused on one or two narrow targets. Will incorporate a lot of these ideas in my own tests and polish some other parts where I somewhat unintentionally already went into a roughly similar direction.

alt Hacker News