Threads. It's less a difference in instruction set capability and more a difference in mental...

kmeisthax • yesterday at 3:04 AM • 3 replies • view on HN

Threads.

It's less a difference in instruction set capability and more a difference in mentality.

Like, for SIMD, you have to say "ok, we're working in vector land now" and start doing vector loads into vector registers to do vector ops on them. Otherwise, the standard variables your program uses are scalars and you get less parallelism. On a GPU this is flipped: the regular registers are vector, and the scalar ones (if you have any) are the weird ones. And because of this the code you write is (more or less) scalar code where everything so happens to magically get done sixteen times at once.

As you can imagine, this isn't foolproof and there's a lot of other things that have to change on GPU programming in order to be viable. Like, conditional branching has to be scalar, since the instruction pointer register is still scalar. But you can have vectors of condition flags (aka "predicates"), and make all the operations take a predicate register to tell which specific lanes should and shouldn't execute. Any scalar conditional can be compiled into predicates, so long as you're OK with having to chew through all instructions on both branches[0].

[0] A sufficiently smart shader compiler could check if the predicate is all-false or all-true and do a scalar jump over the instructions that won't execute. Whether or not that's a good idea is another question.

Replies

jabl • yesterday at 5:45 AM

One way to think of SIMT is that instead of vector instructions you have a 'fork' instruction which turns on vector mode where the scalar instructions execute on all the vector lanes. Your SIMT code must then include a 'lane index' variable somewhere (of course e.g. in CUDA it's more complicated with blocks, warps etc etc., but in principle it's just a more detailed way of doing lane indexing) so that all the threads work on different data. There is traditionally a shared program counter (PC) (in reality on GPU's, something like per-warp PC's so you still have multiple PC's), where in case of divergent control flow lanes are masked off (though post-Volta Nvidia HW has per-lane PC's). Then finally when you're done with your parallel algorithm you execute a 'join' instruction which blocks until all the lanes have reached that point, and then turns off all the lanes except one so you're now in scalar mode again.

Now whether this is actually how the hardware operates or whether the compiler in the GPU driver turns the SIMT code into something like SIMD code for the actual HW is another question.

HelloNurse • yesterday at 10:20 AM

The different "mentality" is whether sequential execution or lockless parallelism prevails. SIMD instructions in a CPU are very small islands of parallelism that do a strictly limited number of elementary small jobs simultaneously within a single opcode, with trivial synchronization (the next opcode is executed next); SIMT jobs in a GPU are an arbitrary number and arbitrarily long and complex, you don't know whether they are waiting or running or done without explicit synchronization primitives, and in-order execution doesn't extends beyond the boundary of a shader.

taktoa • yesterday at 3:30 AM

I think what you're describing is SPMD, which is a compilation strategy, not a hardware architecture. I am not sure but I think SIMT is SIMD but with multiple program counters (1 per N lanes) to enable some limited control flow divergence between lane groups.

➕ show 1 reply

alt Hacker News

Replies