logoalt Hacker News

operatingthetanyesterday at 8:16 PM0 repliesview on HN

Probably a more interesting benchmark is one that is scored based on the LLM finding exploits in the benchmark.