Here is an example using google highway: https://godbolt.org/z/Y8vsonTb8
See how the code has only been written once, but multiple versions of the same functions where generated targeting different hardware features (e.g. SSE, AVX, AVX512). Then `HWY_DYNAMIC_DISPATCH` can be used to dynamically call the fastest one matching your CPU at runtime.
Nice. First time that I saw this dynamic dispatch was in FFTW.
Thank you so much, this explains it well. I was initially afraid that the dispatch would be costly, but from what I understand it's (almost) zero cost after the first call.
I only code for x86 with vectorclass library, so I never had to worry about portability. In practice, is it really possible to write generic SIMD code like the example using Highway? Or could you often find optimization opportunities if you targeted a particular architecture?