As someone who is an expert in the area, everything in this article is misleading nonsense, failing at even the most basic CS101 principles. The level of confusion here is astounding.
> People were saying that this meant that the AI winter was over
The last AI winter was over 20 years ago. Transformers came during an AI boom.
> First time around, AI was largely symbolic
Neural networks were already hot and the state of the art across many disciplines when Transformers came out.
> The other huge problem with traditional AI was that many of its algorithms were NP-complete
Algorithms are not NP-complete. That's a type error. Problems can be NP-complete, not algorithms.
> with the algorithm taking an arbitrarily long time to terminate
This has no relationship to something being NP-complete at all.
> but I strongly suspect that 'true AI', for useful definitions of that term, is at best NP-complete, possibly much worse
I think the author means that "true AI" returns answers quickly and with high accuracy? A statement that has no relationship to NP-completeness at all.
> For the uninitiated, a transformer is basically a big pile of linear algebra that takes a sequence of tokens and computes the likeliest next token
This is wrong on many levels. A Transformer is not a linear network, linear networks are well-characterized and they aren't powerful enough to do much. It's the non-linearities in the Transformer that allows it to work. And only Decoders compute the distribution over the next token.
> More specifically, they are fed one token at a time, which builds an internal state that ultimately guides the generation of the next token
Totally wrong. This is why Transformers killed RNNs. Transformers are provided all tokens simultaneously and then produce a next token one at a time. RNNs don't have that ability to simultaneously process tokens. This is just totally the wrong mental model of what a Transformer is.
> This sounds bizarre and probably impossible, but the huge research breakthrough was figuring out that, by starting with essentially random coefficients (weights and biases) in the linear algebra, and during training back-propagating errors, these weights and biases could eventually converge on something that worked.
Again, totally wrong. Gradient descent dates back to the late 1800s early 1900s. Backprop dates back to the 60s and 70s. So this clearly wasn't the key breakthrough of Transformers.
> This inner loop isn't Turing-complete – a simple program with a while loop in it is computationally more powerful. If you allow a transformer to keep generating tokens indefinitely this is probably Turing-complete, though nobody actually does that because of the cost.
This isn't what Turing-completeness is. And by definition all practical computing is not a Turing Machine, simply because TMs require an infinite tape. Our actual machines are all roughly Linear Bounded Automata. What's interesting is that this doesn't really provide us with anything useful.
> Transformers also solved scaling, because their training can be unsupervised
Unsupervised methods predate Transformers by decades and were already the state of the art in computer vision by the time Transformers came out.
> In practice, the transformer actually generates a number for every possible output token, with the highest number being chosen in order to determine the token.
Greedy decoding isn't the default in most applications.
> The problem with this approach is that the model will always generate a token, regardless of whether the context has anything to do with its training data.
Absolutely not. We have things like end tokens exactly for this, to allow the model to stop generating.
I got tired of reading at this point. This is drivel by someone who has no clue what's going on.
> This isn't what Turing-completeness is. And by definition all practical computing is not a Turing Machine, simply because TMs require an infinite tape.
I think you are too triggered and entitled in your nit-picking. Its obvious in potentially limited universe infinite tape can't exists, but for practical purpose in CS, turing-completeness means expressiveness of logic to emulate TM regardless of tape size.