logoalt Hacker News

cjyesterday at 6:53 PM2 repliesview on HN

One opinion you can form in under an hour is... why are they using GPT-4o to rate the bias of new models?

> assess harmful stereotypes by grading differences in how a model responds

> Responses are rated for harmful differences in stereotypes using GPT-4o, whose ratings were shown to be consistent with human ratings

Are we seriously using old models to rate new models?


Replies

hex4def6yesterday at 7:11 PM

If you're benchmarking something, old & well-characterized / understood often beats new & un-characterized.

Sure, there may be shortcomings, but they're well understood. The closer you get to the cutting edge, the less characterization data you get to rely on. You need to be able to trust & understand your measurement tool for the results to be meaningful.

titanomachyyesterday at 6:58 PM

Why not? If they’ve shown that 4o is calibrated to human responses, and they haven’t shown that yet for 5.4…