One way to think of SIMT is that instead of vector instructions you have a 'fork' instruction which turns on vector mode where the scalar instructions execute on all the vector lanes. Your SIMT code must then include a 'lane index' variable somewhere (of course e.g. in CUDA it's more complicated with blocks, warps etc etc., but in principle it's just a more detailed way of doing lane indexing) so that all the threads work on different data. There is traditionally a shared program counter (PC) (in reality on GPU's, something like per-warp PC's so you still have multiple PC's), where in case of divergent control flow lanes are masked off (though post-Volta Nvidia HW has per-lane PC's). Then finally when you're done with your parallel algorithm you execute a 'join' instruction which blocks until all the lanes have reached that point, and then turns off all the lanes except one so you're now in scalar mode again.
Now whether this is actually how the hardware operates or whether the compiler in the GPU driver turns the SIMT code into something like SIMD code for the actual HW is another question.