AI Self-preferencing in Algorithmic Hiring: Empirical Evidence and Insights

299 points • by laurex • today at 3:28 PM • 155 comments • view on HN

Comments

I'll copy what I wrote on LinkedIn (note: I read roughly 25 pages, which is half the paper, and read it quickly)[0]:

"If I read the paper correctly, they don’t actually show that LLMs prefer resumes they generate.

Their actual method seems to be taking a human written resume, deleting the executive summary, having an LLM rewrite the executive summary based on the rest of the resume and then having another LLM rate the executive summary without the rest of the resume.

That’s likely to massively overstate any real impact, if you can even rely on it capturing a real effect.

I really wonder if I read that correctly, because I can’t come up with a justification for that study design."

[0] I couldn't help but mildly copy-edit before pasting here.

➕ show 2 replies

charliebwrites • today at 3:57 PM

Anecdata, sample size of one:

When I was looking for my next role after being laid off, I didn’t get much of a response with my human handmade resume despite my experience

Just for kicks, I asked ChatGPT to “Analyze my resume and give it a score for what percentage it was in” then I asked it to revise it to make it score as high as possible

I still tweaked and fact checked it but after I started sending that out, I got a much higher hit rate than before

But who knows, maybe the market changed, was a better time of year, etc

I still had to pass interviews and prove my worth. But it probably helped me get my foot in the door

➕ show 6 replies

benashford • today at 3:58 PM

Intuitively this feels obvious. Content generated by the model will be shaped by its training, therefore when reading it back it will resonate with that same training and have a positive view as a result.

Human when preparing a CV: "Make my CV more professional"

LLM many days later presenting a report to HR: "This CV is really professional"

There's probably more to it than that of course.

But it justifies my personal policy of using a different LLM family for code review tasks than for code generation tasks. To avoid the "marking your own homework" problem.

bendergarcia • today at 4:02 PM

We are without our consent introducing a party in between people. The models become the arbiters of who does and does not get a job. It feels problematic.

➕ show 5 replies

ivansmf • today at 5:18 PM

I suspect the entire industry uses "auto-raters", where an agent instance is used to scores the agent's output. The idea is similar in intent as using adversarial networks to train image generation, minus the human labelers. Raising the scores of the auto-rater then becomes the metric teams optimize, and it is no wonder the end result is that the agent scores its own generated content the highest.

mcv • today at 6:01 PM

Timely topic for me. My CV had grown to 7 pages, and I kept reading everywhere that it should be no more than 2, so I asked Gemini to rewrite it. Took a lot of time, because Gemini loves to exaggerate everything, but I'm quite happy with the result.

The first couple of recruiters I sent it to preferred my old 7 page CV. I guess they're not using enough AI yet.

rogermarley • today at 4:07 PM

I think resumes will eventually (or have already) become obsolete in tech. The SNR is so low, they offer very thin filtering value.

Even taking the tiny bits of the resume that are "hard signal", like GPA, certifications, prior roles, etc, it doesn't translate into their performance in the initial screening interview.

This is why what I think the industry sorely needs is examination consortia.

Rather than trying to guess capability from the name of the university they went to, leading tech companies creating standardized tests in various fields, and your test scores form your "resume", so that developers can just focus on improving their scores rather than wasting time on resume/application/repetitive-screening toil.

➕ show 2 replies

AlexB138 • today at 3:59 PM

This may lead to some interesting gamesmanship. For instance, if I am applying to a company, and I know they use a certain applicant tracking system, and I know that ATS uses a certain model provider for its filter, I should then use that model to write the version of my resume I send to the company.

aykutseker • today at 5:22 PM

The uncomfortable part is that this is probably rational behavior for both sides.

Employers use models to filter resumes, candidates optimize resumes for those models, and suddenly the resume is no longer written for a human at all.

drillsteps5 • today at 4:34 PM

That's what people on both side have been doing for at least couple years already.

Recruiters scan resumes for the best match with LLMs, candidates use the same LLMs (there's only like 3 of them) to tweak their resume for better match. I don't know what research you need to see why that makes sense.

➕ show 2 replies

visarga • today at 4:33 PM

When classifying resumes it is better to use the LLM as a feature extractor, think of 10-20 features you base your decision on, and extract them by LLM. The LLM only needs to do lower level task of question answering. Then you fit a classical ML model (xgboost for example) on the extracted features, based on company triage data points. This way you don't rely on the biases in the model, you can decide what criteria to use and how to judge cases without retraining the LLM. The feature extractor is generic, and the actual triage model is a toy you can retrain in seconds on new data points. It is also much more explainable, you can see how features influence decisions.

➕ show 1 reply

onlyrealcuzzo • today at 5:45 PM

Further, LLMs consistently think LLM written content is "good".

Ask an LLM to write some design doc for you, wait until you get one that's very bad, send it to other LLMs and get their feedback, they will typically have good things to say.

Compare that to a very well written document you have. They will typically have a lot more bad things to say, even if the premise is solid.

Someone should study this.

LLMs clearly have a lot of value. But IMO this is very interesting and points out a weakness that's not entirely clear what the full ramifications of it are.

I suspect LLMs also have a major bias to code they write.

Take something universally considered to be well written like Redis, feed it to an LLM for feedback. They'll probably find much to pick apart (and a lot of it may be flat out wrong).

Feed the same LLM some clearly garbage LLM repository. Do they have a similar response as they do with design? Do they treat language different than code, and they're just susceptible to the way they write regular language that's different from logical code? Or do they have the same problem?

Has anyone done this?

logicalfails • today at 3:56 PM

I suspect this is more a function of the corporate sanitization of language within the models. When I have passed my resume through the models for refinement, it often sanitizes some of the more easy going or simpler wording. It expands the vocabulary, makes it more dense, and uses more corpo speak in the bullets and formatting.

Each model likely has its own biases in terms of what constitutes correct corporate speak, and it chooses the resumes that best fit this. Ultimately, I suspect it's more a function of model saying "this grammer, syntax structure, and formatting is most aligned with what is correct corporate language, so flag as high quality".

ilia-a • today at 4:11 PM

Seems kinda obvious, given that most large recruiting firms/hr use algos to analyze resumes and AI written version likely do a better job at hitting keywords/structure algos/llms pick up on...

embedding-shape • today at 4:12 PM

You'll find the same is true if you have two different LLMs first independently come up with a plan for an implementation, then ask each one of them to say which one of the two designs/plans are the best. They're much more likely to favor the plans generated from the same model, rather than from other models. I'm sure, internally, this somehow makes sense, but it's worth thinking about if you're doing the whole "ask N models for voting/rating N plans to find the best" charade.

➕ show 1 reply

sb057 • today at 3:59 PM

Well yeah, LLMs generate resumes (and other text) that they judge as superior to alternative plausible texts. Why would that judgement change just because a different instance hasn't seen it before? To anthropomorphize it, it's like having a hiring manager write a resume, get amnesia, and then have to judge it among other resumes.

➕ show 2 replies

ryeguy_24 • today at 4:26 PM

Does anyone know of any HR departments actually using LLMs for scoring, selection, extraction, classification or any real use cases? I'm curious to hear about it and how they are using it.

➕ show 2 replies

jimnotgym • today at 4:04 PM

I just guessed that and got Copilot to rewrite my profile on the internal HR system. I also got a job spec benchmarked higher by getting Copilot to write it with that exact aim given in the prompt

➕ show 1 reply

mpurbo • today at 4:01 PM

At this point, all these are becoming almost like comedy.

einpoklum • today at 4:00 PM

> As artificial intelligence (AI) tools become widely adopted, large language models (LLMs) are increasingly involved ... [in] ... decision-making processes

That's the problem right there.

➕ show 1 reply

jamiecurle • today at 4:07 PM

disclaimer: Not a lawyer, but studying towards CIPP/E.

You'd make no friends doing it, but as I understand it, for those that have GDPR as a statutory right then under "[Article 22 - Automated individual decision-making, including profiling][0]" you can request to know if your CV was screened by AI and what (and this is key) "meaningful human interaction" led to that decision. Technically this falls under a data subject access request and so a response is mandatory (but who really is going to enforce that - ICO / <insert your data protection agency here> probably isn't). Companies can't just smash a button and claim meaningful interaction, it has to be, well, meaningful and smashing a "nope" button obviously isn't meaninful.

If it turns out that it was only AI that screened it you can request a human review. Do not hold your breath.

Again, you'd make no friends doing it, but sooner or later a test case will emerge to generate some case law around "AI said no" because employment, or lack of because AI says no, does have significant impact on a human.

[0]: https://gdpr.algolia.com/gdpr-article-22

➕ show 1 reply

makeitrain • today at 3:45 PM

Vibe resume?

➕ show 2 replies

analog8374 • today at 5:22 PM

This means that LLM human resource departments will only hire LLMs. Which is kind of beautiful.

booleandilemma • today at 4:52 PM

HR departments aren't using LLMs to select candidates for jobs are they?

abubakir1997 • today at 5:10 PM

Very interesting.

bjourne • today at 4:29 PM

The only test that has worked 100% of the time for me is to read the candidate's code. Two hours is enough to precisely estimate the candidate's qualities as a software developer. I never understood why companies waste time with tests and quizzes because since it is so easy for me it should be just as easy for other software developers too. Of course, a candidate may be a jerk or unfit for other reasons, but ranking them on a software developer hot-or-not scale is not very difficult.

➕ show 1 reply

parentheses • today at 4:30 PM

Reading only the abstract: LLMs prefer output of their own generation over humans or even other models.

This is a very good reason to avoid using model-generated data to train future models. We'd be deepening this bias by continuing to do that, essentially forcing society to reshape their output using LLMs to increase engagement. This feels like a form of enshittification that doesn't just touch one product but all of society.

jonahs197 • today at 4:10 PM

Will people snap over this?

bdangubic • today at 4:47 PM

My new CV contains 37 emdashes

interstice • today at 4:55 PM

"I'm not just good, I'm amazing"

Der_Einzige • today at 4:24 PM

This is extremely obvious to anyone whose read other papers. There's tons of papers showing LLMs prefer their own outputs. It's a big enough problem that LLM-as-judge has to be a different LLM from the LLM you are testing in papers.

jqpabc123 • today at 4:18 PM

Repeat after me --- it makes no sense to try and prompt a language prediction engine to display good judgment.

cyberax • today at 6:19 PM

As always, XKCD is prescient here: https://xkcd.com/2237/

randomdrake • today at 3:49 PM

I wonder if this extends to training models on new content as well. Are we creating a cyclical information-consumption and training situation in which models being trained are more likely to pick up on and reference content created by themselves or by other LLMs than by other humans?

modzu • today at 5:35 PM

ais interviewing ais... lol

samagragune • today at 6:18 PM

[dead]

johndhi • today at 3:46 PM

Another way to phrase this might be that LLMs make better resumes no?

➕ show 7 replies

nottorp • today at 3:47 PM

Easy then. Apply N times, each time with a resume generated by a different LLM.

No human is going to notice anyway. Or add a N+1 resume written by yourself in which you describe your strategy, just in case.

➕ show 3 replies

skeledrew • today at 5:33 PM

Pretty straight forward IMO. The model is looking for particular qualities in a given resume, and strives to ensure the qualities it looks for is present in resumes it creates. Humans do the exact same thing (unless forced by something like DEI, etc to do otherwise), so I see nothing noteworthy here.

idopmstuff • today at 4:36 PM

Even if we take this to be true, I'm not sure that it really matters?

It's comparing two resumes with the same information and picking one of the two. That's obviously a situation that would never occur in actual hiring. This doesn't demonstrate anything at all that indicates that LLMs would incorrectly preference LLM-written resumes in the real world.

It'd be interesting to do the same thing but with two resumes that are almost identical. One is slightly better (an extra year of experience or a specific note of some skill that is relevant to the role), and the other slightly worse one is written by an LLM. If the reviewing LLM picks the worse one in that case, you're potentially establishing a bias that would matter. As it stands this experiment just seems contrived and pointless.

alt Hacker News

AI Self-preferencing in Algorithmic Hiring: Empirical Evidence and Insights

Comments