I've long been surprised there isn't more multiversion stuff built right into every language compiler; I would have thought Intel would be very motivated to get more binaries lighting up the features they add to their expensive top line CPUs.
But yeah no, on the whole cost of the checks and duplicated binary size aren't seen as worth it, so instead it's piecemeal implementations mostly in numeric packages like eigen and lapack.
> so instead it's piecemeal implementations mostly in numeric packages like eigen and lapack.
Because that’s where the user-noticeable gains can be made. Using popcount in code you run once is going to shave off, maybe, 100 cycles. That isn’t worth the extra cycles of that approach.
Also, FTA: “and arguably the whole scheme should be replaced by finer-grained feature detection”. Such feature detection would lead to a combinatorial explosion of different binaries.
Finally, where it really matters, it’s not only a matter of recompiling the same code. For optimal performance, you also want to change loop unrolling strategy, stride count, etc.