logoalt Hacker News

ipaddrlast Saturday at 2:11 AM2 repliesview on HN

This could be a solved problem. Come up with problems not online and compare. Later use LLMs to sort through your problems and classify between easy-difficult


Replies

vlovich123last Saturday at 4:29 AM

Hard to do for an industry benchmark since doing the test in such a mode requires sending the question to the LLM which then basically puts it into a public training set.

This has been tried multiple times by multiple people and it ends up not doing so great over time in terms of retaining immunity to “cheating”.

kalkinlast Saturday at 5:00 AM

How do you imagine existing benchmarks were created?