vpcmpestri xmm2, xmm3, BYTEWISE_CMP
test cx, 0x10 ; if(rcx != 16)
I see this test/cmp all the time after the instruction and I don't understand it. pcmpestri will set ZF if edx < 16, and it will set SF if eax < 16. It is already giving you the necessary status. Also testing sub words of the larger register is very slow and is a pipeline hazard.You've got this monster of an instruction and then people place all this paranoid slowness around it. Am I reading the x86 manual wrong?
Not sure what Visual Studio has done over the years but I remember decompiling Gearbox's utilities .dll in James Bond 007 Nightfire (2002) and it appeared to have a bunch of string manipulation functions written using these instructions.