I agree! I think people are used to comparing to a single threaded execution of non-vectorized code, which is using .1% of a modern CPU's compute power.
Where the balance slants all the way towards gpus again is the tensor units using reduced precision...