There’s Grouped Query Attention as well, a different activation function, and a bunch of not very interesting norms stuff. But yeah, you’re right - still very similar overall.