I think the biggest problem is that most tutorials use words to illustrate how the attention mechani...

ozgung • yesterday at 10:38 PM • 0 replies • view on HN

I think the biggest problem is that most tutorials use words to illustrate how the attention mechanism works. In reality, there are no word-associated tokens inside a Transformer. Tokens != word parts. An LLM does not perform language processing inside the Transformer blocks, and a Vision Transformer does not perform image processing. Words and pixels are only relevant at the input. I think this misunderstanding was a root cause of underestimating their capabilities.

alt Hacker News