logoalt Hacker News

lern_too_speltoday at 4:43 AM0 repliesview on HN

This was an open book test. The real problem with this study is that winning the most head-to-head preference tests is not the right metric. It doesn't much matter if two answers are right, and one is written a little better than the other. It matters quite a lot if one answer is right and another is wrong.

The authors point out that this other metric was computed in prior work and incorrectly dismiss it as being not as good as winning percentage in head to head competitions. The cited prior work shows that the models fare poorly on that metric. https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5166938