logoalt Hacker News

ALittleLighttoday at 3:47 AM1 replyview on HN

The paper says the professors have a median of 200 comparisons each. It also says they only used 2 models because using more models would require more comparisons and they selected Google models because Google was branded/advertised as being education focused. When you see other models show up elsewhere, that's because they extended the main idea to other models but using LLMs to judge instead of human professors.


Replies

godelskitoday at 3:50 AM

Sure, but the biggest problem is they have no statistical significance. Variance is too high. How do you distinguish the signal from the noise? Confidence intervals aren't enough.

But is it a surprise law professors aren't great statisticians?

show 1 reply