Thank you so much, this explains it well. I was initially afraid that the dispatch would be costly, but from what I understand it's (almost) zero cost after the first call.
I only code for x86 with vectorclass library, so I never had to worry about portability. In practice, is it really possible to write generic SIMD code like the example using Highway? Or could you often find optimization opportunities if you targeted a particular architecture?
You can go quite far with such libraries if you only perform data-parallel numerics on the CPU. However, if you work on complex algorithms or exotic data structures, there's almost always more upside in avoiding them and writing specialized code for each platform of interest.