Here are three of the Anthropic research reports I had in mind:

tkgally • today at 1:51 AM • 0 replies • view on HN

https://www.anthropic.com/news/golden-gate-claude

Excerpt: “We found that there’s a specific combination of neurons in Claude’s neural network that activates when it encounters a mention (or a picture) of this most famous San Francisco landmark.”

https://www.anthropic.com/research/tracing-thoughts-language...

Excerpt: “Recent research on smaller models has shown hints of shared grammatical mechanisms across languages. We investigate this by asking Claude for the ‘opposite of small’ across different languages, and find that the same core features for the concepts of smallness and oppositeness activate, and trigger a concept of largeness, which gets translated out into the language of the question.”

https://www.anthropic.com/research/introspection

Excerpt: “Our new research provides evidence for some degree of introspective awareness in our current Claude models, as well as a degree of control over their own internal states.”

alt Hacker News