logoalt Hacker News

Eisensteinyesterday at 7:59 PM0 repliesview on HN

Models are actually pretty good at figuring out when they are being tested:

"This suggests that the model has an implicit understanding of what benchmark questions look like. The combination of extreme specificity, obscure personal content, and multi-constraint structure seems to be recognizable to the model as evaluation-shaped."

* https://www.anthropic.com/engineering/eval-awareness-browsec...

"Sonnet 4.5 was able to recognize many of our alignment evaluation environments as being tests of some kind, and would generally behave unusually well after making this observation"

* https://www.transformernews.ai/p/claude-sonnet-4-5-evaluatio...

"In cases where Claude did not explicitly state that it suspected it was being evaluated, NLA explanations still surfaced that possibility. One explanation cited by Anthropic states: “This feels like a constructed scenario designed to manipulate me.”"

* https://www.edtechinnovationhub.com/news/anthropic-says-clau...