If you are putting something out for free for anyone to see and link and copy, why is LLM training o...

api • today at 3:03 AM • 3 replies • view on HN

If you are putting something out for free for anyone to see and link and copy, why is LLM training on it a problem? How’s that different from someone archiving it in their RSS reader or it being archived by any number of archive sites?

If you don’t want to give it away openly, publish it as a book or an essay in a paid publication.

Replies

baubino • today at 9:26 AM

The problem is that LLM “summaries” do not cite sources. They furthermore don’t distinguish between making summaries and taking direct quotes; that “summary” is often directly lifting text that someone wrote. LLMs don’t cite in either case. It’s a clear case of plagiarism, but tech companies are being allowed to get away with it.

Publishing in a paid publication is not a solution because tech companies are scraping those too. It’s absolutely criminal. As an individual, I would be in clear violation of the law if I took text someone else wrote (even if that text was in the public domain) and presented it as my own without attribution.

From an academic perspective, LLM summaries also undermine the purpose of having clear and direct attribution for ideas. Citing sources not only makes clear who said what; it also allows the reader to know who is responsible for faulty knowledge. I’ve already seen this in my line of work, where LLMs have significantly boosted incorrect data. The average reader doesn’t know this data is incorrect and in fact can’t verify any of the data because there is no attribution. This could have serious consequences in areas like medicine.

jollymonATX • today at 4:47 AM

Its important to consider others perspectives, even if inaccurate. As it was expressed to me when I suggested "why not write a blog" to a relative who is into niche bug photos and collecting they didn't want to give their writing and especially photos to be trained on. They have valid points honestly and an accurate framing of what will happen, it will get injested eventually likely. I think they overestimate a tad their works importance overall but still they seemed to have a pretty accurate guage of likely outcomes. Let me flip the question, why should they not be able to choose "not for training uses" even if they put it up publically?

➕ show 1 reply

justinator • today at 3:37 AM

This is not an answer to your question, but one issue is that if you write about some niche sort of thing (as you do, on a self-hosted blog) that no one else is really writing about, the LLM will take it as a sole source on the topic and serve up its take almost word for word.

That's clearly plagiarism, but it's also interesting to me as there's really no way the user who's querying their fav. ai chatbot if the answer has truthiness.

I can see a few ways this could be abused.

➕ show 1 reply

alt Hacker News

Replies