logoalt Hacker News

nickpsecuritylast Saturday at 4:02 AM0 repliesview on HN

More than anything, they need to match and then exceed Singapore's text and data mining exception for copyrighted works. I'll be happy to tell them how since I wrote several versions of it trying to balance all sides.

The minimum, though, is that all copyrighted works the supplier has legal access to can be copied, transformed arbitrarily, and used for training. And they can share those and transformed versions with anyone else who already has legal access to that data. And no contract, including terms of use, can override that. And they can freely scrape it but maybe daily limits imposed to avoid destructive scraping.

That might be enough to collect, preprocess, and share datasets like The Pile, RefinedWeb, uploaded content the host shares (eg The Stack, Youtube). We can do a lot with big models trained that way. We can also synthesize other data from them with less risk.