Circuit Tracing: Revealing Computational Graphs in Language Models (Anthropic)

170 points • by ydnyshhh • last Monday at 7:42 AM • 27 comments • view on HN

Comments

> Deep learning models produce their outputs using a series of transformations distributed across many computational units (artificial “neurons”). The field of mechanistic interpretability seeks to describe these transformations in human-understandable language.

This is the central theme behind why I find techniques like genetic programming to be so compelling. You get interpretability by default. The second order effect of this seems to be that you can generalize using substantially less training data. The humans developing the model can look inside the box and set breakpoints, inspect memory, snapshot/restore state, follow the rabbit, etc.

The biggest tradeoff here being that the search space over computer programs tends to be substantially more rugged. You can't use math tricks to cheat the computation. You have to run every damn program end-to-end and measure the performance of each directly. However, you can execute linear program tapes very, very quickly on modern x86 CPUs. You can search through a billion programs with a high degree of statistical certainty in a few minutes. I believe we are at a point where some of the ideas from the 20th century are viable again.

➕ show 3 replies

ironbound • yesterday at 6:18 AM

For people new to this maybe check out this video, it explains how the internals run pretty quickly https://m.youtube.com/watch?v=UKcWu1l_UNw

In theory if Anthropic puts research into the mechanics of the models internals, we can get better returns in training and alignment.

somethingsome • yesterday at 7:25 AM

Is the pdf available somewhere?

➕ show 1 reply

alt Hacker News

Circuit Tracing: Revealing Computational Graphs in Language Models (Anthropic)

Comments