logoalt Hacker News

dzaimalast Saturday at 5:00 PM0 repliesview on HN

I haven't used CUDA (don't have nvidia gpu), but I've looked at examples of code before. And it doesn't look any more simple than CPU SIMD for anything non-trivial.

And pleasepleaseplease don't have locks in something operating over a 20-element array, I'm pretty damn sure that's just simply gonna be a suboptimal approach in any scenario. (even if those are just "software" locks for forcing serialized computation that don't actually end up in any memory atomics or otherwise more instructions, as just needing such is hinting to me of awful approaches like log(n) loops over a n=20-element array, or some in-memory accumulators, or something awful)

As an extreme case of something I've had to do in CPU SIMD that I don't think would be sane in any other way:

How would I in CUDA implement code that does elementwise 32-bit integer addition of two input arrays into a third array (which may be one of the inputs), checking for overflow, and, in the case of any addition overflowing (ideally early-exiting on such to not do useless work), report in some way how much was processed, such that further code could do the addition with a wider result type, being able to compute the full final wider result array even in the case where some of the original inputs aren't available due to the input overlapping the output (which is fine as for those the computed results can be used directly)?

This is a pretty trivial CPU SIMD loop consisting of maybe a dozen intrinsics (even easily doable via any of the generalized arch-independent SIMD libraries!), but I'm pretty sure it'd require a ton of synchronization in anything CUDA-like, and probably being forced to do early-exiting in way larger blocks, and probably having to return a bitmask of which threads wrote their results, as opposed to the SIMD loop having a trivial guarantee of the processed and unprocessed elements being split exactly on where the loop stopped.

(for addition specifically you can also undo the addition to recover the input array, but that gets way worse for multiplication as the inverse there is division; and perhaps in the CUDA approach you might also want to split into separately checking for overflow and writing the results, but that's an inefficient two passes over memory just to split out a store of something the first part already computes)