logoalt Hacker News

ainiriandtoday at 2:34 PM14 repliesview on HN

Are not LLMs supposed to just find the most probable word that follows next like many people here have touted? How this can be explained under that pretense? Is this way of problem solving 'thinking'?


Replies

dilaptoday at 3:10 PM

That description is really only fair for base models†. Something like Opus 4.6 has all kinds of other training on top of that which teach it behaviors beyond "predict most probable token," like problem-solving and being a good chatbot.

(†And even then is kind of overly-dismissive and underspecified. The "most probable word" is defined over some training data set. So imagine if you train on e.g. mathematicians solving problems... To do a good job at predicting [w/o overfitting] your model will have to in fact get good at thinking like a mathematician. In general "to be able to predict what is likely to happen next" is probably one pretty good definition of intelligence.)

show 2 replies
throw310822today at 4:02 PM

> just find the most probable word that follows next

Well, if in all situations you can predict which word Einstein would probably say next, then I think you're in a good spot.

This "most probable" stuff is just absurd handwaving. Every prompt of even a few words is unique, there simply is no trivially "most probable" continuation. Probable given what? What these machines learn to do is predicting what intelligence would do, which is the same as being intelligent.

show 1 reply
sega_saitoday at 4:09 PM

In some sense that is still correct, i.e. the words are taken from some probability distribution conditional on previous words, but the key point is that probability distribution is not just some sort of average across the internet set of word probabilities. In the end this probability distribution is really the whole point of intelligence. And I think the LLMs are learning those.

IgorPartolatoday at 2:42 PM

In some cases solving a problem is about restating the problem in a way that opens up a new path forward. “Why do planets move around the sun?” vs “What kind of force exists in the world that makes planets tethered to the sun with no visible leash?” (Obviously very simplified but I hope you can see what I am saying.) Given that a human is there to ask the right questions it isn’t just an LLM.

Further, some solutions are like running a maze. If you know all the wrong turns/next words to say and can just brute force the right ones you might find a solution like a mouse running through the maze not seeing the whole picture.

Whether this is thinking is more philosophical. To me this demonstrates more that we are closer to bio computers than an LLM is to having some sort of divine soul.

show 1 reply
tux3today at 3:03 PM

>Are not LLMs supposed to just find the most probable word that follows next like many people here have touted?

The base models are trained to do this. If a web page contains a problem, and then the word "Answer: ", it is statistically very likely that what follows on that web page is an answer. If the base model wants to be good at predicting text, at some point learning the answer to common question becomes a good strategy, so that it can complete text that contains these.

NN training tries to push models to generalize instead of memorizing the training set, so this creates an incentive for the model to learn a computation pattern that can answer many questions, instead of just memorizing. Whether they actually generalize in practice... it depends. Sometimes you still get copy-pasted input that was clearly pulled verbatim from the training set.

But that's only base models. The actual production LLMs you chat with don't predict the most probable word according to the raw statistical distribution. They output the words that RLHF has rewarded them to output, which includes acting as an assistant that answers questions instead of just predicting text. RLHF is also the reason there are so many AI SIGNS [1] like "you're absolutely right" and way more use of the word "delve" than is common in western English.

[1]: https://en.wikipedia.org/wiki/WP:AISIGNS

qseratoday at 3:44 PM

Yes, that is exactly what they do.

But that does not mean that the results cannot be dramatic. Just like stacking pixels can result in a beautiful image.

vjerancrnjaktoday at 5:23 PM

No. There is good signal in IMO gold medal performance.

These models actually learn distributed representations of nontrivial search algorithms.

A whole field of theorem provingaftwr decades of refinements couldn’t even win a medal yet 8B param models are doing it very well.

Attention mechanism, a bruteforce quadratic approach, combined with gradient descent is actually discovering very efficient distributed representations of algorithms. I don’t think they can even be extracted and made into an imperative program.

adamtaylor_13today at 4:27 PM

That's the way many people reduce it, and mathematically, I think that's true. I think what we fail to realize is just far that will actually take you.

"just the most probable word" is a pretty powerful mechanism when you have all of human knowledge at your fingertips.

I say that people "reduce it" that way because it neatly packs in the assumption that general intelligence is something other than next token prediction. I'm not saying we've arrived at AGI, in fact, I do not believe we have. But, it feels like people who use that framing are snarkily writing off something that they themselves to do not fully comprehend behind the guise of being "technically correct."

I'm not saying all people do this. But I've noticed many do.

crocowhiletoday at 3:36 PM

Those people still exist? I only know one guy who is still fighting those windmills

show 2 replies
wrsh07today at 3:28 PM

Imagine training a chess bot to predict a valid sequence of moves or valid game using the standard algebraic notation for chess

Great! It will now correctly structure chess games, but we've created no incentive for it to create a game where white wins or to make the next move be "good"

Ok, so now you change the objective. Now let's say "we don't just want valid games, we want you to predict the next move that will help that color win"

And we train towards that objective and it starts picking better moves (note: the moves are still valid)

You might imagine more sophisticated ways to optimize picking good moves. You continue adjusting the objective function, you might train a pool of models all based off of the initial model and each of them gets a slightly different curriculum and then you have a tournament and pick the winningest model. Great!

Now you might have a skilled chess-playing-model.

It is no longer correct to say it just finds a valid chess program, because the objective function changed several times throughout this process.

This is exactly how you should think about LLMs except the ways the objective function has changed are significantly significantly more complicated than for our chess bot.

So to answer your first question: no, that is not what they do. That is a deep over simplification that was accurate for the first two generations of the models and sort of accurate for the "pretraining" step of modern llms (except not even that accurate, because pretraining does instill other objectives. Almost like swapping our first step "predict valid chess moves" with "predict stockfish outputs")

noslenwerdnatoday at 5:30 PM

I find this kind of reduction silly.

All your brain is doing is bouncing atoms off each other, with some occasionally sticking together, how can it be really thinking?

See how silly it sounds?

esafaktoday at 3:03 PM

Are you feigning ignorance? The best way to answer a question, like completing a sentence, is through reasoning; an emergent behavior in complex models.

adampunktoday at 4:15 PM

Thinking is a big word that sweeps up a lot of different human behavior, so I don't know if it's right to jump to that; HOWEVER, explanations of LLMs that depend heavily on next-token prediction are defunct. They stopped being fundamentally accurate with the rise of massive reinforcement learning and w/ 'reasoning' models the analogy falls apart when you try to do work with it.

Be on the lookout for folks who tell you these machines are limited because they are "just predicting the next word." They may not know what they're talking about.