4 lanes of SIMD (like in say SSE) is not necessarily 4x faster because of the memory access, sometimes it's better than that (and often it's less).
PSHUFB wins in case of unpredictable access patterns. Though I don't remember how much it typically wins.
PMOVMSKB can replace several conditionals (up to 16 in SSE2 for byte operands) with only one, winning in terms of branch prediction.
PMADDWD is in SSE2, and does 8 byte multiplies not 4. SSE4.1 FP rounding that doesn't require changing the rounding mode, etc. The weird string functions in SSE4.2. Non-temporal moves and prefetching in some cases.
The cool thing with SIMD is that it's a lot less stress for the CPU access prediction and branch prediction, not only ALU. So when you optimize it will help unrelated parts of your code to go faster.