Ok, then it will be an explosion of binary size, if you have several code blocks optimized for each architecture level - I'm not very familiar with the subject, but I imagine it would have to be relatively large chunks of code, otherwise the constant branching would eat up the speed advantage.
These are usually pretty tight loops or constructs based on specific features.
An unspecialised popcnt is half the dozen instructions, for specialised versions it’s 4 implementations ranging from half a dozen to two dozen bytes.