Any suggestions from this literature?

armchairhacker • last Sunday at 7:56 PM • 1 reply • view on HN

libraryofbabel • last Sunday at 11:40 PM

The papers from Anthropic on interpretability are pretty good. They look at how certain concepts are encoded within the LLM.

alt Hacker News