Here's a non-parallel and unoptimized implementation of that operation in Go:
func _mm512_permutex2var_epi8(a, idx, b [64]uint8) [64]uint8 {
var dst [64]uint8
for j := 0; j < 64; j++ {
i := idx[j]
src := a
if i&0b0100_0000 != 0 {
src = b
}
dst[j] = src[i&0b0011_1111]
}
return dst
}
Basically, for a lookup table of 8-bit values, you need only 1 instruction to perform up to 64 lookups simultaneously, for each 128 bytes of table.