Any suggestions from this literature?
The papers from Anthropic on interpretability are pretty good. They look at how certain concepts are encoded within the LLM.
The papers from Anthropic on interpretability are pretty good. They look at how certain concepts are encoded within the LLM.