I'm just happy that finally, with the popularity of zen4 and 5 chips, AVX512 is around ~20% of the running hardware in the steam hardware survey. It's going to be a long while before it gets to a majority - Intel still isn't shipping its own instruction set in consumer CPUs - but its going the right direction.
Compared to the weird, lumpy lego set of avx1/2, avx512 is quite enjoyable to write with, and still has some fun instructions that deliver more than just twice the width.
Personal example: The double width byte shuffles (_mm512_permutex2var_epi8) that takes 128 bytes as input in two registers. I had a critical inner loop that uses a 256 byte lookup table; running an upper/lower double-shuffle and blending them essentially pops out 64 answers a cycle from the lookup table on zen5 (which has two shuffle units), which is pretty incredible, and on its own produced a global 4x speedup for the kernel as a whole.
Could you please elaborate on your example? Thanks.