I haven't read much about it to understand what's going on, but the development of multi-modal models has also felt like a major step. Being able to paste an image into a chat and have it "understand" the image to a comparable extent to language is very powerful.