The existence of SIMD has knock-on effects on the design of the execution unit and the FPUs, though, since it's usually the only way to fully utilize them for float/arithmetic workloads. And newer SIMD features like AVX/AVX2 have a pretty big effect on the whole CPU design; it was widely reported that Intel and AMD went to a lot of trouble to make it viable, even though most software probably isn't even compiled with AVX support enabled.
Also SIMD is just one example. Modern DMA controllers are probably another good example but I know less about them (although I did try some weird things with the one in the Raspberry Pi). Or niche OS features like shared memory--pipes are usually all you need, and don't break the multitasking paradigm, but in the few cases where shared memory is needed it speeds things up tremendously.