Here are three of the Anthropic research reports I had in mind:
https://www.anthropic.com/news/golden-gate-claude
Excerpt: “We found that there’s a specific combination of neurons in Claude’s neural network that activates when it encounters a mention (or a picture) of this most famous San Francisco landmark.”
https://www.anthropic.com/research/tracing-thoughts-language...
Excerpt: “Recent research on smaller models has shown hints of shared grammatical mechanisms across languages. We investigate this by asking Claude for the ‘opposite of small’ across different languages, and find that the same core features for the concepts of smallness and oppositeness activate, and trigger a concept of largeness, which gets translated out into the language of the question.”
https://www.anthropic.com/research/introspection
Excerpt: “Our new research provides evidence for some degree of introspective awareness in our current Claude models, as well as a degree of control over their own internal states.”