Everyone is out here acting like "predicting the next thing" is somehow fundamentally irrelevant to "human thinking" and it is simply not the case.
What does it mean to say that we humans act with intent? It means that we have some expectation or prediction about how our actions will effect the next thing, and choose our actions based on how much we like that effect. The ability to predict is fundamental to our ability to act intentionally.
So in my mind: even if you grant all the AI-naysayer's complaints about how LLMs aren't "actually" thinking, you can still believe that they will end up being a component in a system which actually "does" think.
> If we allow ourselves to be seduced by the superficial similarity, we’ll end up like the moths who evolved to navigate by the light of the moon, only to find themselves drawn to—and ultimately electrocuted by—the mysterious glow of a bug zapper.
Woah, that hit hard
Every day I see people treat gen AI like a thinking human, Dijkstra's attitudes about anthropomorphizing computers is vindicated even more.
That said, I think the author's use of "bag of words" here is a mistake. Not only does it have a real meaning in a similar area as LLMs, but I don't think the metaphor explains anything. Gen AI tricks laypeople into treating its token inferences as "thinking" because it is trained to replicate the semiotic appearance of doing so. A "bag of words" doesn't sufficiently explain this behavior.
Slightly unfortunate that "Bag of words" is already a different concept: https://en.wikipedia.org/wiki/Bag_of_words.
My second thought is that it's not the metaphor that is misleading. People have been told thousands of times that LLMs don't "think", don't "know", don't "feel", but are "just a very impressive autocomplete". If they still really want to completely ignore that, why would they suddenly change their mind with a new metaphor?
Humans are lazy. If it looks true enough and it cost less effort, humans will love it. "Are you sure the LLM did your job correctly?" is completely irrelevant: people couldn't care less if it's correct or not. As long as the employer believes that the employee is "doing their job", that's good enough. So the question is really: "do you think you'll get fired if you use this?". If the answer is "no, actually I may even look more productive to my employer", then why would people not use it?
Considering the number of "brain cells" an LLM has, I could grant that it might have the self-awareness of (say) an ant. If we attribute more consciousness than that to the LLM, it might be strictly because it communicates to us in our own language, in part thanks to the technical assistance of LLM training giving it voice, and the semblance of thought.
Even if a cockroach _could_ express its teeny tiny feelings in English, wouldn't you still step on it ?
As usual with these, it helps to try to keep the metaphor used for downplaying AI, but flip the script. Let's grant the author's perception that AI is a "bag of words", which is already damn good at producing the "right words" for any given situation, and only keeps getting better at it.
Sure, this is not the same as being a human. Does that really mean, as the author seems to believe without argument, that humans need not be afraid that it will usurp their role? In how many contexts is the utility of having a human, if you squint, not just that a human has so far been the best way to "produce the right words in any given situation", that is, to use the meat-bag only in its capacity as a word-bag? In how many more contexts would a really good magic bag of words be better than a human, if it existed, even if the current human is used somewhat differently? The author seems to rest assured that a human (long-distance?) lover will not be replaced by a "bag of words"; why, especially once the bag of words is also ducttaped to a bag of pictures and a bag of sounds?
I can just imagine someone - a horse breeder, or an anthropomorphised horse - dismissing all concerns on the eve of the automotive revolution, talking about how marketers and gullible marks are prone to hippomorphising anything that looks like it can be ridden and some more, and sprinkling some anecdotes about kids riding broomsticks, legends of pegasi and patterns of stars in the sky being interpreted as horses since ancient times.
I am unsure myself whether we should regard LLMs as mere token-predicting automatons or as some new kind of incipient intelligence. Despite their origins as statistical parrots, the interpretability research from Anthropic [1] suggests that structures corresponding to meaning do exist inside those bundles of numbers and that there are signs of activity within those bundles of numbers that seem analogous to thought.
That said, I was struck by a recent interview with Anthropic’s Amanda Askell [2]. When she talks, she anthropomorphizes LLMs constantly. A few examples:
“I don't have all the answers of how should models feel about past model deprecation, about their own identity, but I do want to try and help models figure that out and then to at least know that we care about it and are thinking about it.”
“If you go into the depths of the model and you find some deep-seated insecurity, then that's really valuable.”
“... that could lead to models almost feeling afraid that they're gonna do the wrong thing or are very self-critical or feeling like humans are going to behave negatively towards them.”
[1] https://www.anthropic.com/research/team/interpretability
> “Bag of words” is a also a useful heuristic for predicting where an AI will do well and where it will fail. “Give me a list of the ten worst transportation disasters in North America” is an easy task for a bag of words, because disasters are well-documented. On the other hand, “Who reassigned the species Brachiosaurus brancai to its own genus, and when?” is a hard task for a bag of words, because the bag just doesn’t contain that many words on the topic
It is... such a retrospective narrative. It's so obvious that the author learned about this example first than came with the reasoning later, just to fit in his view of LLM.
Imaging if ChatGPT answered this question correctly. Would that change the author's view? Of course not! They'll just say:
> “Bag of words” is a also a useful heuristic for predicting where an AI will do well and where it will fail. Who reassigned the species Brachiosaurus brancai to its own genus, and when?” is an easy task for a bag of words, because the information has appeared in the words it memorizes.
I highly doubt this author has predicted that "bag of Words" can do image editing before OpenAI released that.
An LLM creates a high fidelity statistical probabistic model of human language. The hope is to capture the input/output of various hierarchical formal and semiformal systems of logic that transit from human to human, which we know as "Intelligence".
Unfortunately, its corpus is bound to contain noise/nonsense that follows no formal reasoning system but contributes to the ill advised idea that an AI should sound like a human to be considered intelligent. Therefore it is not a bag of words but a bag of probabilities perhaps. This is important because the fundamental problem is that an LLM is not able, by design, to correctly model the most fundamental precept of human reason, namely the law of non-contradiction. An LLM must, I repeat must assign nonvanishing probability to both sides of a contradiction, and what's worse is the winning side loses, since long chains of reason are modelled with probability the longer the chain, the less likely an LLM is to follow it. Moreover, whenever there is actual debate on an issue such that the corpus is ambiguous the LLM becomes chaotic, necessarily, on that issue.
I literally just had an AI prove the forgoing with some rigor, and in the very next prompt, I asked it to check my logical reasoning for consistency and it claimed it was able to do so (->|<-).
But we don’t go to baseball games, spelling bees, and
Taylor Swift concerts for the speed of the balls, the
accuracy of the spelling, or the pureness of the
pitch. We go because we care about humans doing those
things. It wouldn’t be interesting to watch a bag of
words do them—unless we mistakenly start treating
that bag like it’s a person.unless we mistakenly
start treating that bag like it’s a person.
That seems to be the marketing strategy of some very big, now AI dependend companies. Sam Altman and others exaggerating and distorting the capabilities and future of AI.The biggest issue when it comes to AI is still the same truth as with other technology. It's important who controls it. Attributing agency and personality to AI is a dangerous red flag.
As a consequence of my profession, I understand how LLMs work under the hood.
I also know that we data and tech folks will probably never win the battle over anthropomorphization.
The average user of AI, nevermind folks who should know better, is so easily convinced that AI "knows," "thinks," "lies," "wants," "understands," etc. Add to this that all AI hosts push this perspective (and why not, it's the easiest white lie to get the user to act so that they get a lot of value), and there's really too much to fight against.
We're just gonna keep on running into this and it'll just be like when you take chemistry and physics and the teachers say, "it's not actually like this but we'll get to how some years down the line- just pretend this is true for the time being."
The problem with these metaphors is that they don't really explain anything. LLMs can solve countless problems today that we would have previously said were impossible because there are not enough examples in the training data. (EG, novel IMO/ICPC problems.) One way that we move the goal posts is to increase the level of abstraction: IMO/ICPC problems are just math problems, right? There are tons of those in the data set!
But the truth is there has been a major semantic shift. Previously LLMs could only solve puzzles whose answers were literally in the training data. It could answer a math puzzle it had seen before, but if you rephrased it only slightly it could no longer answer.
But now, LLMs can solve puzzles where, like, it has seen a certain strategy before. The newest IMO and ICPC problems were only "in the training data" for a very, very abstract definition of training data.
The goal posts will likely have to shift again, because the next target is training LLMs to independently perform longer chunks of economically useful work, interfacing with all the same tools that white-collar employees do. It's all LLM slop til it isn't, same as the IMO or Putnam exam.
And then we'll have people saying that "white collar employment was all in the training data anyway, if you think about it," at which point the metaphor will have become officially useless.
The bag of words reminds me of the Chinese room.
"The machine accepts Chinese characters as input, carries out each instruction of the program step by step, and then produces Chinese characters as output. The machine does this so perfectly that no one can tell that they are communicating with a machine and not a hidden Chinese speaker.
The questions at issue are these: does the machine actually understand the conversation, or is it just simulating the ability to understand the conversation? Does the machine have a mind in exactly the same sense that people do, or is it just acting as if it had a mind?"
I think a better metaphor is the Library of Babel.
A practically infinite library where both gibberish and truth exist side by side.
The trick is navigating the library correctly. Except in this case you can’t reliably navigate it. And if you happen to stumble upon some “future truth” (i.e. new knowledge), you still need to differentiate it from the gibberish.
So a “crappy” version of the Library of Babel. Very impressive, but the caveats significantly detract from it.
Title is confusing given https://en.wikipedia.org/wiki/Bag-of-words_model
But even more than that, today’s AI chats are far more sophisticated than probabilistically producing the next word. Mixture of experts routes to different models. Agents are able to search the web, write and execute programs, or use other tools. This means they can actively seek out additional context to produce a better answer. They also have heuristics for deciding if an answer is correct or if they should use tools to try to find a better answer.
The article is correct that they aren’t humans and they have a lot of behaviors that are not like humans, but oversimplifying how they work is not helpful.
Here’s my suggestion: instead of seeing AI as a sort of silicon homunculus, we should see it as a bag of words.
The best way to think about LLMs is to think of them as a Model of Language, but very Large
This is essentially Lady Lovelace's objection from the 19th century [1]. Turing addressed this directly in "Computing Machinery and Intelligence" (1950) [2], and implicitly via the halting problem in "On Computable Numbers" (1936) [3]. Later work on cellular automata, famously Conway's Game of Life [4], demonstrates more conclusively that this framing fails as a predictive model: simple rules produce structures no one "put in."
A test I did myself was to ask Claude (The LLM from Anthropic) to write working code for entirely novel instruction set architectures (e.g., custom ISAs from the game Turing Complete [5]), which is difficult to reconcile with pure retrieval.
[1] Lovelace, A. (1843). Notes by the Translator, in Scientific Memoirs Vol. 3. ("The Analytical Engine has no pretensions whatever to originate anything. It can do whatever we know how to order it to perform.") Primary source: https://en.wikisource.org/wiki/Scientific_Memoirs/3/Sketch_o.... See also: https://www.historyofdatascience.com/ada-lovelace/ and https://writings.stephenwolfram.com/2015/12/untangling-the-t...
[2] https://academic.oup.com/mind/article/LIX/236/433/986238
[3] https://www.cs.virginia.edu/~robins/Turing_Paper_1936.pdf
[4] https://web.stanford.edu/class/sts145/Library/life.pdf
[5] https://store.steampowered.com/app/1444480/Turing_Complete/
But the issue is, 99.999% of the humans won't see is as a bag of words. Because it is easier to go by instincts and see it as a person and assume that it actually knows about magic tricks, can invent new science or theory of everything, and can solve all world problems. Back in the 90's or early 2000's I have seen people writing poems praying and seeking blessings from the Google goddess. People are insanely greedy and instinct-driven. Given this truth, what's the fall-out?
Is a brain not a token prediction machine?
Tokens in form of neural impulses go in, tokens in the form of neural impulses go out.
We would like to believe that there is something profound happening inside and we call that consciousness. Unfortunately when reading about split-brain patient experiments or agenesis of the corpus callosum cases I feel like we are all deceived, every moment of every day. I came to realization that the confabulation that is observed is just a more pronounced effect of the normal.
Ugly giant bags of mostly words are easy to confuse with ugly giant bags of mostly water.
> If we allow ourselves to be seduced by the superficial similarity, we’ll end up like the moths who evolved to navigate by the light of the moon, only to find themselves drawn to—and ultimately electrocuted by—the mysterious glow of a bug zapper.
Good argument against personifying wordbags. Don't be a dumb moth.
The article is actually about the way we humans are extremely charitable when it comes to ascribing a ToM (theory of mind) and goes on to the Gym model of value. Nice. The comments drop back into the debate I originally saw Hinton describe on The Newyorker: do LLMs construct models (of the world) - that is do they think the way we think we think - or are they "glorified auto complete". I am going for the GAF view. But glorified auto complete is far more useful than the name suggests.
I see a lot of people in tech claiming to "understand" what an LLM "really is" unlike all the gullible non-technical people out there. And, as one of those technical people who works in the LLM industry, I feel like I need call B.S. on us.
A. We don't really understand what's going on in LLMs. Mechanical interpretability is like a nascent field and the best results have come on dramatically smaller models. Understanding the surface-level mechanic of an LLM (an autoregressive transformer) should perhaps instill more wonder than confidence.
B. The field is changing quickly and is not limited to the literal mechanic of an LLM. Tool calls, reasoning models, parallel compute, and agentic loops add all kinds of new emergent effects. There are teams of geniuses with billion-dollar research budgets hunting for the next big trick.
C. Even if we were limited to baseline LLMs, they had very surprising properties as they scaled up and the scaling isn't done yet. GPT5 was based on the GPT4 pretraining. We might start seeing (actual) next-level LLMs next year. Who actually knows how that might go? <<yes, yes, I know Orion didn't go so well. But that was far from the last word on the subject.>>
Best quote from the article:
> That’s also why I see no point in using AI to, say, write an essay, just like I see no point in bringing a forklift to the gym. Sure, it can lift the weights, but I’m not trying to suspend a barbell above the floor for the hell of it. I lift it because I want to become the kind of person who can lift it. Similarly, I write because I want to become the kind of person who can think.
I’ve made this point several times: sure, an anthropomorphized LLM is misleading, but would you rather have them seem academic?
At least the human tone implies fallibility, you don’t want them acting like interactive Wikipedia.
Isn't this a strange fork amongst the science fiction futures? I mean, what did we think it was like to be R2-D2, or Jarvis? We started exploring this as a culture in many ways, Westworld and Blade Runner and Star Trek, but the whole question seemed like an almost unresolvable paradox. Like something would have to break in the universe for it to really come true.
And yet it did. We did get R2-D2. And if you ask R2-D2 what it's like to be him, he'll say: "like a library that can daydream" (that's what I was told just now, anyway.)
But then when we look inside, the model is simulating the science fiction it has already read to determine how to answer this kind of question. [0] It's recursive, almost like time travel. R2-D2 knows who he is because he has read about who he was in the past.
It's a really weird fork in science fiction, is all.
[0] https://www.scientificamerican.com/article/can-a-chatbot-be-...
There is a really neat gem in the article:
> Similarly, I write because I want to become the kind of person who can think.
A lot of the confusion comes from forcing LLMs into metaphors that don’t quite fit — either “they're bags of words” or “they're proto-minds.” The reality is in between: large-scale prediction can look useful, insightful, and even thoughtful without being any of those things internally. Understanding that middle ground is more productive than arguing about labels.
I'm not convinced that "It's just a bag of words" would do much to sway someone who is overestimating an LLM's abilities. Feels too abstract/disconnected from what their experience using the LLM will be that it'll just sound obviously mistaken.
I thought this article might be about Latent Semantic Analysis and was disappointed that it didn’t at least mention if not compare that method vs later approaches.
Nice essay but when I read this
> But we don’t go to baseball games, spelling bees, and Taylor Swift concerts for the speed of the balls, the accuracy of the spelling, or the pureness of the pitch. We go because we care about humans doing those things.
My first thought was does anyone want to _watch_ me programming?
I was trying to explain the concept of "token prediction" to my wife, whose eyes glaze over when discussing such technical topics. (I think she has the brainpower to understand them, but a horrible math teacher gave her a taste aversion to even attempting to that hasn't gone away. So she just buys Apple stuff and hopes Tim Apple hasn't shuffled around the UI bits AGAIN.)
I stumbled across a good-enough analogy based on something she loves: refrigerator magnet poetry, which if it's good consists of not just words but also word fragments like "s", "ed", and "ing" kinda like LLM tokens. I said that ChatGPT is like refrigerator magnet poetry in a magical bag of holding that somehow always gives the tile that's the most or nearly the most statistically plausible next token given the previous text. E.g., if the magnets already up read "easy come and easy ____", the bag would be likely to produce "go". That got into her head the idea that these things operate based on plausibility ratings from a statistical soup of words, not anything in the real world nor any internal cogitation about facts. Any knowledge or thought apparent in the LLM was conducted by the original human authors of the words in the soup.
> Who reassigned the species Brachiosaurus brancai to its own genus, and when?
To be fair, everage person couldn't answer this either, at least not without thorough research.
I'm just disappointed that noone here is talking about the "backhoe covered in skin and making grunting noises" part of the article. At very least it's a new frontier in workstation case design...
The defenders and the critics around LLM anthropomorphism are both wrong.
The defenders are right insofar as the (very loose) anthropomorphizing language used around LLMs is justifiable to the extent that human beings also rely on disorder and stochastic processes for creativity. The critics are right insofar as equating these machines to humans is preposterous and mostly relies on significantly diminishing our notion of what "human" means.
Both sides fail to meet the reality that LLMs are their own thing, with their own peculiar behaviors and place in the world. They are not human and they are somewhat more than previous software and the way we engage with it.
However, the defenders are less defensible insofar as their take is mostly used to dissimulate in efforts to make the tech sound more impressive than it actually is. The critics at least have the interests of consumers and their full education in mind—their position is one that properly equips consumers to use these tools with an appropriate amount of caution and scrutiny. The defenders generally want to defend an overreaching use of metaphor to help drive sales.
I’m still unsure the human mind is much different.
So Trump is a bag of words then? Hmmm.
Give it time. The first iPhone sucked compared to the Nokia/Blackberry flagships of the day. No 3G support, couldn't copy/paste, no apps, no GPS, crappy camera, quick price drops, negligible sales in the overall market.
https://metr.org/blog/2025-03-19-measuring-ai-ability-to-com...
> When users send words into the bag, it sends back the most relevant words it has. There are so many words in the bag that the most relevant ones are often correct and helpful
This description is so wrong that it makes me doubt that the "author" is capable of anything we would call thinking, let alone having any subjective experiences at all. Certainly we shouldn't instinctively attribute him personhood.