Ran it on a subset of 10 of the 50 PRs in this benchmark

eranation • today at 5:16 AM • 2 replies • view on HN

Ran it on a subset of 10 of the 50 PRs in this benchmark https://codereview.withmartian.com

- very good recall (~74%, e.g. found a lot of the golden issues)

- not so good precision (~12%, e.g. lots of false positives)

- the precision causes the F1 to tank (~20%, if this stays the same on the full 50 sample it would puts it almost last, even less than Kilo+Grok)

akie • today at 5:47 AM

I would say that recall is the most important metric here though. I'd want it to catch all the issues.

False positives are easy to ignore.

➕ show 1 reply

isabellehue • today at 8:07 AM

[flagged]

alt Hacker News