logoalt Hacker News

eranationtoday at 5:16 AM2 repliesview on HN

Ran it on a subset of 10 of the 50 PRs in this benchmark https://codereview.withmartian.com

- very good recall (~74%, e.g. found a lot of the golden issues)

- not so good precision (~12%, e.g. lots of false positives)

- the precision causes the F1 to tank (~20%, if this stays the same on the full 50 sample it would puts it almost last, even less than Kilo+Grok)


Replies

akietoday at 5:47 AM

I would say that recall is the most important metric here though. I'd want it to catch all the issues.

False positives are easy to ignore.

show 1 reply
isabellehuetoday at 8:07 AM

[flagged]