Yuck, this is going to really harm scientific research.
There is already a problem with papers falsifying data/samples/etc, LLMs being able to put out plausible papers is just going to make it worse.
On the bright side, maybe this will get the scientific community and science journalists to finally take reproducibility more seriously. I'd love to see future reporting that instead of saying "Research finds amazing chemical x which does y" you see "Researcher reproduces amazing results for chemical x which does y. First discovered by z".
The ironic part about these hallucinations is that a research paper includes a literature review because the goal of the research is to be in dialogue with prior work, to show a gap in the existing literature, and to further the knowledge that this prior work has built.
By using an LLM to fabricate citations, authors are moving away from this noble pursuit of knowledge built on the "shoulders of giants" and show that behind the curtain output volume is what really matters in modern US research communities.
NeurIPS leadership doesn’t think hallucinated references are necessarily disqualifying; see the full article from Fortune for a statement from them: https://archive.ph/yizHN
> When reached for comment, the NeurIPS board shared the following statement: “The usage of LLMs in papers at AI conferences is rapidly evolving, and NeurIPS is actively monitoring developments. In previous years, we piloted policies regarding the use of LLMs, and in 2025, reviewers were instructed to flag hallucinations. Regarding the findings of this specific work, we emphasize that significantly more effort is required to determine the implications. Even if 1.1% of the papers have one or more incorrect references due to the use of LLMs, the content of the papers themselves are not necessarily invalidated. For example, authors may have given an LLM a partial description of a citation and asked the LLM to produce bibtex (a formatted reference). As always, NeurIPS is committed to evolving the review and authorship process to best ensure scientific rigor and to identify ways that LLMs can be used to enhance author and reviewer capabilities.”
Wow! They're literally submitting references to papers by Firstname Lastname, John Doe and Jane Smith and nobody is noticing or punishing them.
I was getting completely AI-generated reviews for a WACV publication back in 2024. The area chairs are so overworked that authors don't have much recourse, which sucks but is also really hard to handle unless more volunteers step up to the bat to help organize the conference.
(If you're qualified to review papers, please email the program chair of your favorite conference and let them know -- they really need the help!)
As for my review, the review form has a textbox for a summary, a textbox for strengths, a textbox for weaknesses, and a textbox for overall thoughts. The review I received included one complete set of summary/strengths/weaknesses/closing thoughts in the summary text box, another distinct set of summary/strengths/weaknesses/closing thoughts in the strengths, another complete and distinct review in the weaknesses, and a fourth complete review in the closing thoughts. Each of these four reviews were slightly different and contradicted each other.
The reviewer put my paper down as a weak reject, but also said "the pros greatly outweigh the cons."
They listed "innovative use of synthetic data" as a strength, and "reliance on synthetic data" as a weakness.
Especially for your first NeurIPS paper as a PhD student, getting one published is extremely lucrative.
Most big tech PhD intern job postings have NeurIPS/ICML/ICLR/etc. first author paper as a de facto requirement to be considered. It's like getting your SAG card.
If you get one of these internships, it effectively doubles or triples your salary that year right away. You will make more in that summer than your PhD stipend. Plus you can now apply in future summers and the jobs will be easier to get. And it sets your career on a good path.
A conservative estimate of the discounted cash value of a student's first NeurIPS paper would certainly be five figures. It's potentially much higher depending on how you think about it, considering potential path dependent impacts on future career opportunities.
We should not be surprised to see cheating. Nonetheless, it's really bad for science that these attempts get through. I also expect some people did make legitimate mistakes letting AI touch their .bib.
There's a lot of good arguments in this thread about incentives: extremely convincing about why current incentives lead to exactly this behaviour, and also why creating better incentives is a very hard problem.
If we grant that good carrots are hard to grow, what's the argument against leaning into the stick? Change university policies and processes so that getting caught fabricating data or submitting a paper with LLM hallucinations is a career ending event. Tip the expected value of unethical behaviours in favour of avoiding them. Maybe we can't change the odds of getting caught but we certainly can change the impact.
This would not be easy, but maybe it's more tractable than changing positive incentives.
The innumeracy is load-bearing for the entire media ecosystem. If readers could do basic proportional reasoning, half of health journalism and most tech panic coverage would collapse overnight.
GPTZero of course knows this. "100 hallucinations across 53 papers at prestigious conference" hits different than "0.07% of citations had issues, compared to unknown baseline, in papers whose actual findings remain valid."
Could you run a similar analysis for pre-2020 papers? It'd be interesting to know how prevalent making up sources was before LLMs.
It has been several years since the reviewing process for top AI conferences have been broken as hell, due to having too many submissions and only a few reviewers (up to the point that Masters students are reviewing the papers). It was only a matter of time before these conferences will be filled with AI-written papers.
I wrote before about my embarrassing time with ChatGPT during a period (https://news.ycombinator.com/item?id=44767601) - I decided to go back through those old 4o chats with 5.2 pro extended thinking, the reply was pretty funny because it first slightly ridiculed me, heh - but what it showed was: basically I would say "what 5 research papers from any area of science talk to these ideas" and it would find 1 and invent 4 if it didn't know 4 others, and not tell me, and then I'd keep working with it and it would invent what it thought might be in the papers long the way, making up new papers in it's own work to cite to make it's own work valid, lol. Anyway, I'm a moron, sure, and no real harm came of it for me, just still slightly shook I let that happen to me.
Getting papers published is now more about embellishing your CV versus a sincere desire to present new research. I see this everywhere at every level. Getting a paper published anywhere is a checkbox in completing your resume. As an industry we need to stop taking this into consideration when reviewing candidates or deciding pay. In some sense it has become an anti-signal.
With regard to confabulating (hallucinating) sources, or anything else, it is worth noting this is a first class training requirement imposed on models. Not models simply picking up the habit from humans.
When training a student, normally we expect a lack of knowledge early, and reward self-awareness, self-evaluation and self-disclosure of that.
But the very first epoch of a model training run, when the model has all the ignorance of a dropped plate of spaghetti, we optimize the network to respond to information, as anything from a typical human to an expert, without any base of understanding.
So the training practice for models is inherently extreme enforced “fake it until you make it”, to a degree far beyond any human context or culture.
(Regardless, humans need to verify, not to mention read, the sources they site. But it will be nice when models can be trusted to accurately access what they know/don’t-know too.)
I don't understand: why aren't there automated tools to verify citations' existence? The data for a citation has a structured styling (APA, MLA, Chicago) and paper metadata is available via e.g. a web search, even if the paper contents are not
I guess GPTZero has such a tool. I'm confused why it isn't used more widely by paper authors and reviewers
Why focus on hallucinations/LLMs and not on the authors? There are rules for submitting papers.
If I drop a loaded gun and it fires, killing someone, we don't go after the gun's manufacturer in most cases.
AI might just extinguish the entire paradigm of publish or perish. The sheer volume of papers makes it nearly impossible to properly decide which papers have merit, which are non-replicate and suspect, and which are just a desperate rush to publish. The entire practice needs to end.
Which is worse:
a) p-hacking and suppressing null results
b) hallucinations
c) falsifying data
Would be cool to see an analysis of this
This is awful but hardly surprising. Someone mentioned reproducible code with the papers - but there is a high likelihood of the code being partially or fully AI generated as well. I.e. AI generated hypothesis -> AI produces code to implement and execute the hypothesis -> AI generates paper based on the hypothesis and the code.
Also: there were 15 000 submissions that were rejected at NeurIPS; it would be very interesting to see what % of those rejected were partially or fully AI generated/hallucinated. Are the ratios comperable?
Didn't know the L in Samuel L Jackson was for LeCun.
It would be ironic if the very detection of hallucinations contained hallucinations of its own.
This is mostly an ad for their product. But I bet you can get pretty good results with a Claude Code agent using a couple simple skills.
Should be extremely easy for AI to successfully detect hallucinated references as they are semi-structured data with an easily verifiable ground truth.
And this is the tip of the iceberg, because these are the easy to check/validate things.
I'm sure plenty of more nuanced facts are also entirely without basis.
As long as these sorts of papers serve more important purposes for the careers of the authors than anything related to science or discovery of knowledge, then of course this happens and continues.
The best possible outcome is that these two purposes are disconflated, with follow-on consequences for the conferences and journals.
Implicitly this makes sense but the amount cited in this article is still hard for me to grasp. Wow.
This is going to be a huge problem for conferences. While journals have a longer time to get things right, as a conference reviewer (for IEEE conferences) I was often asked to review 20+ papers in a short time to determine who gets a full paper, who gets to present just a poster, etc. There was normally a second round, but often these would just look at submissions near the cutoff margin in the rankings. Obvious slop can be quickly rejected, but it will be easier to sneak things in.
You will find out that Top CS conference is never scientific, if you really go to their GitHub and run their code.
We have the h score and such, can we have something similar that goes down when you pull stunts like these? Preferably link it to people’s orcid ids.
It is very concerning that these hallucinations passed through peer review. It's not like peer review is a fool-proof method or anything, but the fact that reviewers did not check all references and noticed clearly bogus ones is alarming and could be a sign that the article authors weren't the only ones using LLMs in the process...
We've been talking about a "crisis of reproducibility" for years and the incentive to crank out high volumes of low-quality research. We now have a tool that brings down the cost of producing plausibly-looking research down to zero. So of course we're going to see that tool abused on a galactic scale.
But here's the thing: let's say you're an university or a research institution that wants to curtail it. You catch someone producing LLM slop, and you confirm it by analyzing their work and conducting internal interviews. You fire them. The fired researcher goes public saying that they were doing nothing of the sort and that this is a witch hunt. Their blog post makes it to the front page of HN, garnering tons of sympathy and prompting many angry calls to their ex-employer. It gets picked up by some mainstream outlets, too. It happened a bunch of times.
In contrast, there are basically no consequences to institutions that let it slide. No one is angrily calling the employers of the authors of these 100 NeurIPS papers, right? If anything, there's the plausible deniability of "oh, I only asked ChatGPT to reformat the citations, the rest of the paper is 100% legit, my bad".
This suggests that nobody was screening this papers in the first place—so is it actually significant that people are using LLMs in a setting without meaningful oversight?
These clearly aren't being peer-reviewed, so there's no natural check on LLM usage (which is different than what we see in work published in journals).
Clearly there is some demand for those papers, and research, to exist. Good opportunity to fill the gaps.
I am wondering if we are going to reach hallucination collapse sooner than we reach AGI.
How you know it's really real is that they clearly tell the FPR, and compare against a pre-llm baseline.
But I saw it in Apple News, so MISSION ACCOMPLISHED!
The downstream effects of this are extremely concerning. We have already seen the damage caused by human written research that was later retracted like the “research” on vaccines causing autism.
As we get more and more papers that may be citing information that was originally hallucinated in the first place we have a major reliability issue here. What is worse is people that did not use AI in the first place will be caught in the crosshairs since they will be referencing incorrect information.
There needs to be a serious amount of education done on what these tools can and cannot do and importantly where they fail. Too many people see these tools as magic since that is what the big companies are pushing them as.
Other than that we need to put in actual repercussions for publishing work created by an LLM without validating it (or just say you can’t in the first place but I guess that ship has sailed) or it will just keep happening. We can’t just ignore it and hope it won’t be a problem.
And yes, humans can make mistakes too. The difference is accountability and the ability to actually be unsure about something so you question yourself to validate.
A lot of research in AI/ML seems to me to be "fake it and never make it". Literally it's all about optics, posturing, connections, publicity. Lots of bullshit and little substance. This was true before AI slop, too. But the fact that AI slop can make it pass the review really showcases how much a paper's acceptance hinges on things, other than the substance and results of the paper.
I even know PIs who got fame and funding based on some research direction that supposedly is going to be revolutionary. Except all they had were preliminary results that from one angle, if you squint, you can envision some good result. But then the result never comes. That's why I say, "fake it, and never make it".
Given that many of these detections are being made from references, I don't understand why we're not using automatic citation checkers.
Just ask authors to submit their bib file so we don't need to do OCR on the PDF. Flag the unknown citations and ask reviewers to verify their existence. Then contact authors and ban if they can't produce the cited work.
This is low hanging fruit here!
Detecting slop where the authors vet citations is much harder. The big problem with all the review rules is they have no teeth. If it were up to me we'd review in the open, or at least like ICLR. Publish the list of known bad actors and let is look at the network. The current system is too protective of egregious errors like plagiarism. Authors can get detected in one conference, pull, and submit to another, rolling the dice. We can't allow that to happen and we should discourage people from associating with these conartists.
AI is certainly a problem in the world of science review, but it's far from the only one and I'm not even convinced it's the biggest. The biggest is just that reviewers are lazy and/or not qualified to review the works they're assigned. It takes at least an hour to properly review a paper in your niche, much more when it's outside. We're over worked as is, with 5+ works to review, not to mention all the time we got to spend reworking our own works that were rejected due to the slot machine. We could do much better if we dropped this notion of conference/journal prestige and focused on the quality of the works and reviews.
Addressing those issues also addresses the AI issues because, frankly, *it doesn't matter if the whole work was done by AI, what matters is if the work is real.*
Is there a comparison to rate of reference errors in other forums?
Jamie, bring up their nationalities.
What's wild is so many of these are from prestigious universities. MIT, Princeton, Oxford and Cambridge are all on there. It must be a terrible time to be an academic who's getting outcompeted by this slop because somebody from an institution with a better name submitted it.
The problem isn’t scale.
The problem is consequences (lack of).
Doing this should get you barred from research. It won’t.
All papers proved to have used a LLM beyond writing improvement should be automatically retracted
What if they would only accept handwritten papers? Basically the current system is beyond repair, so may as well go back to receiving 20 decent papers instead of 20k hallucinated ones.
They will turn it into a party drug.
This is not the AI future we dreamed of, or feared.
No surprises. Machine learning has, at least since 2012, been the go-to field for scammers and grifters. Machine learning, and technology in general, is basically a few real ideas, a small number of honest hard workers, and then millions of fad chasers and scammers.
I spot-checked one of the flagged papers (from Google, co-authored by a colleague of mine)
The paper was https://openreview.net/forum?id=0ZnXGzLcOg and the problem flagged was "Two authors are omitted and one (Kyle Richardson) is added. This paper was published at ICLR 2024." I.e., for one cited paper, the author list was off and the venue was wrong. And this citation was mentioned in the background section of the paper, and not fundamental to the validity of the paper. So the citation was not fabricated, but it was incorrectly attributed (perhaps via use of an AI autocomplete).
I think there are some egregious papers in their dataset, and this error does make me pause to wonder how much of the rest of the paper used AI assistance. That said, the "single error" papers in the dataset seem similar to the one I checked: relatively harmless and minor errors (which would be immediately caught by a DOI checker), and so I have to assume some of these were included in the dataset mainly to amplify the author's product pitch. It succeeded.