We already learned how to defeat this from SEO spammers and citation farmers: by building networks that cross reference and corroborate one another’s fake stories.
We’re already at a point where much of the academic research you find in online databases can’t be trusted without vetting through real world trustworthy institutions and experts in relevant fields. How is an LLM supposed to do this kind of vetting without the help of human curators?
If all the LLM training teams have to stop indiscriminate crawling and fall back to human curation and data labeling then the poisoners will have won.
We already learned how to defeat this from SEO spammers and citation farmers: by building networks that cross reference and corroborate one another’s fake stories.
We’re already at a point where much of the academic research you find in online databases can’t be trusted without vetting through real world trustworthy institutions and experts in relevant fields. How is an LLM supposed to do this kind of vetting without the help of human curators?
If all the LLM training teams have to stop indiscriminate crawling and fall back to human curation and data labeling then the poisoners will have won.