>Even if interpretability of specific models or features within them is an open area of research,...

famouswaffles • yesterday at 8:56 PM • 2 replies • view on HN

>Even if interpretability of specific models or features within them is an open area of research, the mechanics of how LLMs work to produce results are observable and well-understood, and methods to understand their fundamental limitations are pretty solid these days as well.

If you train a transformer on (only) lots and lots of addition pairs, i.e '38393 + 79628 = 118021' and nothing else, the transformer will, during training discover an algorithm for addition and employ it in service of predicting the next token, which in this instance would be the sum of two numbers.

We know this because of tedious interpretability research, the very limited problem space and the fact we knew exactly what to look for.

Alright, let's leave addition aside (SOTA LLMs are after all trained on much more) and think about another question. Any other question at all. How about something like:

"Take a capital letter J and a right parenthesis, ). Take the parenthesis, rotate it counterclockwise 90 degrees, and put it on top of the J. What everyday object does that resemble?"

What algorithm does GPT or Gemini or whatever employ to answer this and similar questions correctly ? It's certainly not the one it learnt for addition. Do you Know ? No. Do the creators at Open AI or Google know ? Not at all. Can you or they find out right now ? Also No.

Let's revisit your statement.

"the mechanics of how LLMs work to produce results are observable and well-understood".

Observable, I'll give you that, but how on earth can you look at the above and sincerely call that 'well-understood' ?

Replies

striking • yesterday at 9:49 PM

It's pattern matching, likely from typography texts and descriptions of umbrellas. My understanding is that the model can attempt some permutations in its thinking and eventually a permutation's tokens catch enough attention to attempt to solve, and that once it is attending to "everyday object", "arc", and "hook", it will reply with "umbrella".

Why am I confident that it's not actually doing spatial reasoning? At least in the case of Claude Opus 4.6, it also confidently replies "umbrella" even when you tell it to put the parenthesis under the J, with a handy diagram clearly proving itself wrong: https://claude.ai/share/497ad081-c73f-44d7-96db-cec33e6c0ae3 . Here's me specifically asking for the three key points above: https://claude.ai/share/b529f15b-0dfe-4662-9f18-97363f7971d1

I feel like I have a pretty good intuition of what's happening here based on my understanding of the underlying mathematical mechanics.

Edit: I poked at it a little longer and I was able to get some more specific matches to source material binding the concept of umbrellas being drawn using the letter J: https://claude.ai/share/f8bb90c3-b1a6-4d82-a8ba-2b8da769241e

➕ show 1 reply

dbdoug • yesterday at 9:25 PM

From Gemini:When you take those two shapes and combine them, the resulting image looks like an umbrella.

alt Hacker News

Replies