logoalt Hacker News

dzaimalast Saturday at 4:12 PM1 replyview on HN

That's because the things GPUs do just isn't what CPUs do. GPUs don't have to deal with ad-hoc 10..100-char strings. They don't have to deal with 20-iteration loops with necessarily-serial dependencies between invocations of the small loops. They don't have to deal with parallelizing mutable hashmap probing operations.

Indeed what GPUs do is good for what GPUs do. But we have a tool for doing things that GPUs do well - it's called, uhh, what's it, uhh... oh yeah, GPUs. Copying that into CPUs is somewhere between just completely unnecessary, and directly harmful to things that CPUs are actually supposed to be used for.

The GPU approach has pretty big downsides for anything other than the most embarassingly-parallel code on very massive inputs; namely, anything non-trivial (sorting, prefix sum) will typically require log(n) iterations, and somewhere between twice as much, and O(n*log(n)) memory access (and even summing requires stupid things like using memory for an accumulator instead of being able to just use vector registers), compared to the CPU SIMD approach of doing a single pass with some shuffles. GPUs handle this via trading off memory latency for more bandwidth, but any CPU that did that would immediately go right in the thrash because that'd utterly kill scalar code performance.


Replies

smallmancontrovlast Saturday at 4:23 PM

This reads like you haven't tried CUDA. The whole point of CUDA is that your CUDA code has single-threaded semantics. The problem you assumed it has is the problem it doesn't have, and the fact that it doesn't have this problem is the reason why it works and the reason why it wins.

EDIT: ok, now you're talking about the hardware difference between CPUs and GPUs. This is relevant for the types of programs that each can accelerate -- barrel processors are uniquely suited to embarrassingly parallel problems, obviously -- but it is not relevant for the question of "how to write code that is generic across block size, boundary conditions, and thread divergence." CUDA figured this out, non-embarassingly-parallel programs still have this problem, and they should copy what works. The best time to copy what works was 20 years ago, but the second best time is today.

show 2 replies