A better way to reformulate this problem is for the LLM to be tasked with making a _comparative_ judgement between two CVs. This should prove much more reliable, especially if you give it a third “too close to call” option. You can also ask for clear justifications of preference.
That's a good idea.
The only drawback I see is that you should compare every pair of CVs for best results, and that grows quadraticly with number of CVs. Of course you can settle for fewer comparisons and not perfect results. But then I'm not sure if you can hit a good ratio of quality and token spend.