I picked the “Attention is All You Need” example at the top, and wow it is not great!
Didn’t take long to find hallucination/general lack of intelligence:
> For each word, we compute three vectors: a Query (what am I looking for?), a Key (what do I contain?), and a Value (what do I give out?).
What? That’s the worst description of a key-value relationship I’ve ever read, unhelpful for understanding what the equation is doing, and just wrong.
> Attention(Q, K, V) = softmax( Q·Kᵀ / √dk ) · V
> 3 Mask (Optional) Block future positions in decoder
Not present in this equation, also not a great description of masking in a RNN.
> 5 × V Weighted sum of values = output
Nope!
https://nowigetit.us/pages/f4795875-61bf-4c79-9fbe-164b32344...
LLMs, even the best ones, are still hit or miss wrt quality. Constantly improving, though.
I see more confusion from Opus 4.x about how to weight the different parts of a paper in terms of importance than I see hallucinations of flat out incorrect stuff. But these things still happen.