My best guess is that it flew under the radar. The Kaggle dataset has 'only' 10,000 downlo...

anonymous908213 • yesterday at 11:39 PM • 2 replies • view on HN

My best guess is that it flew under the radar. The Kaggle dataset has 'only' 10,000 downloads, and the article itself probably doesn't have that many views. Still, this seems pretty far beyond the pale. Given the other case of AI-related plagiarism by Microsoft that was on the front page[1], it seems whatever review process they have for content that is published by their employees, if there is any review process at all, is deeply flawed.

[1] https://news.ycombinator.com/item?id=47057829, "Microsoft morged my diagram". It was in a discussion there that someone pointed out this article linking to full downloads of the Harry Potter novels, which I thought deserved more visibility.

Replies

zythyx • today at 12:08 AM

Also, I imagine that most of those 10k downloads are probably from AI trainers that are just speed running through Kaggle to obtain absolutely anything to train their AI. There are definitely other, more 'known' ways to obtain these books without finding them as random text files in an AI dataset operation

selridge • today at 12:24 AM

Why did you think that?

➕ show 1 reply

alt Hacker News

Replies