The high arithmetic bandwidth on GPUs is of course SIMD based as well. They just tend to have a ISPC style compilation model that doesn't expose the SIMD lanes in the source code. (Whereas SIMD even after decades is very lightly utilized by compilers on the CPU side).
It's SIMD-based at the lowest level, but there's also the use of very high hardware multithreading (the threads are called, AIUI, "wavefronts" or "warps") on each compute unit/stream processor to hide memory access latency. Recent SPARC CPU's have 8-way hardware multithreading on the individual CPU core, GPU's can easily go even higher than that.