logoalt Hacker News

PaulDavisThe1stlast Tuesday at 4:50 PM1 replyview on HN

Thanks for that. I've read the two Lindsey papers before. I think these are all interesting, but they are also what used to be called "just-so stories". That is, they describe a way of understanding what the LLM is doing, but do not actually describe what the LLM is doing.

And this is OK and still quite interesting - we do it to ourselves all the time. Often it's the only way we have of understanding the world (or ourselves).

However, in the case of LLMs, which are tools that we have created from scratch, I think we can require a higher standard.

I don't personally think that any of these papers suggest that LLMs manipulate concepts. They do suggest that the internal representation after training is highly complex (superposition, in particular), and that when inputs are presented, it isn't unreasonable to talk about the observable behavior as if it involved represented concepts. It is useful stance to take, similar to Dennett's intentional stance.

However, while this may turn out to be how a lot of human cognition works, I don't think it is what is the significant part of what is happening when we actively reason. Nor do I think it corresponds to what most people mean by "manipulate concepts".

The LLM, despite the prescence of "features" that may correspond to human concepts, is relentlessly forward-driving: given these inputs, what is my output? Look at the description in the 3rd paper of the arithmetic example. This is not "manipulating concepts" - it's a trick that often gets to the right answer (just like many human tricks used for arithmetic, only somewhat less reliable). It is extremely different, however, from "rigorous" arithmetic - the stuff you learned when you somewhere between age 5 and 12 perhaps - that always gives the right answer and involves no pattern matter, no inference, no approximations. The same thing can be said, I think, about every other example in all 4 papers, to some degree or another.

What I do think is true (and very interesting) is that it seems somewhere between possible and likely that a lot more human cognition than we've previously suspected uses similar mechanisms as these papers are uncovering/describing.


Replies

famouswaffleslast Tuesday at 6:58 PM

>That is, they describe a way of understanding what the LLM is doing, but do not actually describe what the LLM is doing.

I’m not sure what distinction you’re drawing here. A lot of mechanistic interpretability work is explicitly trying to describe what the model is doing in the most literal sense we have access to: identifying internal features/circuits and showing that intervening on them predictably changes behavior. That’s not “as-if” gloss; it’s a causal claim about internals.

If your standard is higher than “we can locate internal variables that track X and show they causally affect outputs in X-consistent ways,” what would count as “actually describing what it’s doing”?

>However, in the case of LLMs, which are tools that we have created from scratch, I think we can require a higher standard.

This is backwards. We don’t “create them from scratch” in the sense relevant to interpretability. We specify an architecture template and a training objective, then we let gradient descent discover a huge, distributed program. The “program” is not something we wrote or understand. In that sense, we’re in a similar epistemic position as neuroscience: we can observe behavior, probe internals, and build causal/mechanistic models, without having full transparency.

So what does “higher standard” mean here, concretely? If you mean “we should be able to fully enumerate a clean symbolic algorithm,” that’s not a standard we can meet even for many human cognitive skills, and it’s not obvious why that should be the bar for “concept manipulation.”

>I don't personally think that any of these papers suggest that LLMs manipulate concepts. They do suggest that the internal representation after training is highly complex (superposition, in particular), and that when inputs are presented, it isn't unreasonable to talk about the observable behavior as if it involved represented concepts. It is useful stance to take, similar to Dennett's intentional stance.

You start with “there is no representation of a concept,” but then concede “features that may correspond to human concepts.” If those features are (a) reliably present across contexts, (b) abstract over surface tokens, and (c) participate causally in producing downstream behavior, then that is a representation in the sense most people mean in cognitive science. One of the most frustrating things about these sorts of discussions is the meaningless semantic games and goalpost shifting.

>The LLM, despite the prescence of "features" that may correspond to human concepts, is relentlessly forward-driving: given these inputs, what is my output?

Again, that’s a description of the objective, not the internal computation. The fact that the training loss is next-token prediction doesn’t imply the internal machinery is only “token-ish.” Models can and do learn latent structure that’s useful for prediction: compressed variables, abstractions, world regularities, etc. Saying “it’s just next-token prediction” is like saying “humans are just maximizing inclusive genetic fitness,” therefore no real concepts. Goal ≠ mechanism.

> Look at the description in the 3rd paper of the arithmetic example. This is not "manipulating concepts" - it's a trick that often gets to the right answer

Two issues:

1. “Heuristic / approximate” doesn’t mean “not conceptual.” Humans use heuristics constantly, including in arithmetic. Concept manipulation doesn’t require perfect guarantees; it requires that internal variables encode and transform abstractions in ways that generalize.

2. Even if a model is using a “trick,” it can still be doing so by operating over internal representations that correspond to quantities, relations, carry-like states, etc. “Not a clean grade-school algorithm” is not the same as “no concepts.”

>Rigorous arithmetic… always gives the right answer and involves no pattern matching, no inference…

“Rigorous arithmetic” is a great example of a reliable procedure, but reliability doesn’t define “concept manipulation.” It’s perfectly possible to manipulate concepts using approximate, distributed representations, and it’s also possible to follow a rigid procedure with near-zero understanding (e.g., executing steps mechanically without grasping place value).

So if the claim is “LLMs don’t manipulate concepts because they don’t implement the grade-school algorithm,” that’s just conflating one particular human-taught algorithm with the broader notion of representing and transforming abstractions.

show 1 reply