logoalt Hacker News

armchairhackerlast Sunday at 7:56 PM1 replyview on HN

Any suggestions from this literature?


Replies

libraryofbabellast Sunday at 11:40 PM

The papers from Anthropic on interpretability are pretty good. They look at how certain concepts are encoded within the LLM.