logoalt Hacker News

kbolinoyesterday at 10:36 PM0 repliesview on HN

Here's a non-parallel and unoptimized implementation of that operation in Go:

  func _mm512_permutex2var_epi8(a, idx, b [64]uint8) [64]uint8 {
    var dst [64]uint8
    for j := 0; j < 64; j++ {
      i := idx[j]
      src := a
      if i&0b0100_0000 != 0 {
        src = b
      }
      dst[j] = src[i&0b0011_1111]
    }
    return dst
  }
Basically, for a lookup table of 8-bit values, you need only 1 instruction to perform up to 64 lookups simultaneously, for each 128 bytes of table.