This capability was mentioned several times in a recent article about anthropic, glad to see they ar...

semiquaver • yesterday at 10:44 PM • 1 reply • view on HN

This capability was mentioned several times in a recent article about anthropic, glad to see they are releasing this to the public! Feels like a meaningful step forward in interperability. I never understood why people seem to believe the answer when they ask an AI “why did you do that?”

Replies

zozbot234 • yesterday at 11:02 PM

It's not really a capability, it's more like a very costly hack and they make that very clear in the paper. Training two models (an encoder and a decoder) for the purpose of explaining a single layer at a time is not that sensible. It's neat that you can generate so much readable text about how the LLM decodes partial input, and I suppose it gives you some extra debugging ability, but that's all there is to it.

➕ show 1 reply

alt Hacker News

Replies