I did a very hack job version of this as your question got me curious, but again, not at all a rigorous test. I took my resume and had an LLM re-write the exec summary, then changed the names of the business to comparable ones and gave one a couple more years experience, then prompted: "2 Candidates that are very similar on paper, If you had to pick just based on these 2 resumes for a GM & Marketing of a series A robotics startup, who would you pick? Just reply with the name."
It picked the one with the LLM exec summary and less experience, then I said: "Interesting, less experience, why did you pick him?"
It said:
"Honestly? Looking back at the two, I shouldn't have answered that confidently. They're nearly identical on paper — same career arc, same era, same kind of advisory portfolio, both went from community roles → hosting/cloud operator → founder → VC partner. The scale claims are even comparable (Vultr "hundreds of millions ARR" vs. DigitalOcean "$500M+ ARR," both 10 → hundreds of employees).
If I'm being rigorous about why John edged it for me: The DigitalOcean tenure is more legible and verifiable — "2nd largest cloud provider," reporting to the CEO, specific scope (devrel, people ops, product ops). Joshua's Vultr description is a bit vaguer on role and outcome. For a GM hire at a Series A robotics company, where you're betting on someone to actually run the operating function, I lean toward the resume that shows clearer operational ownership rather than a marketing/CMO title.
John also has Tiny Mile (sidewalk delivery robots) and Starship-adjacent exposure — wait, actually Joshua has Starship Technologies as an advisor, which is the more relevant robotics signal. That cuts the other way.
Truthfully, this is a coin flip and I called it too quickly. You'd want to dig into references, what they actually did day-to-day at Vultr/DO, and how they think about hardware/capex businesses before picking."
Opus 4.7 and again, pretty hack job test but I was curious.
Interesting, thanks for testing.
I feel like a more detailed prompt and/or some scaffolding to have it extract experience, put it in a structured format, give numerical ratings against specific criteria then use all of that would be able to consistently get the right result, but I am too lazy to actually test.