logoalt Hacker News

mistercowtoday at 12:10 AM0 repliesview on HN

My current hunch is that that benchmark captures most of the relevant gap between Anthropic and the rest. “Can’t distinguish truth from fiction” has long been one of the deeper complaints about LLMs, and the bullshit benchmark seems like a clever approach to testing at least some of that.