logoalt Hacker News

nickpsecurityyesterday at 9:16 PM2 repliesview on HN

Some projects refuse for copyright reasons. Back when GPT4 was new, I dug into pretraining reports for nearly all models.

Every one (IIRC) was breaking copyrights by sharing 3rd-party works in data sets without permission. Some were trained on patent filings which makes patent infringement highly likely. Many breaking EULA's (contract law) by scraping them. Some outputs were verbatim reproductions of copyrighted works, too, which could get someoen sued if they published them.

So, I warned people to stay away from AI until (a) training on copyrighted/patented works was legal in all those circumstances, (b) the outputs had no liability, and (c) users of a model could know this by looking at the pretraining data. There's no GPT3- or Claude-level models produced that way.

On a personal level, I follow Jesus Christ who paid for my sins with His life. We're to be obedient to God's law. One is to submit to authority (aka don't break man's law). I don't know that I can use AI outputs if they were illegally trained or like fencing stolen goods. Another reason I want the pretraining to be legal either by mandate or using only permissible works.

Note: If your country is in the Berne Convention, it might apply to you, too.


Replies

hirako2000yesterday at 10:15 PM

Not sure we need to invoke Jesus to agree with the liability concerns.

show 1 reply
user34283today at 8:44 AM

Complete non-issue in my experience.

With usage on a daily basis since GPT-4 I have not once encountered a scenario where I was concerned about the output being complex enough and a verbatim copy to warrant such concerns.

Generally it would seem statistically unlikely to reconstruct a copyrighted work, rather the output should be a probabilistic average. Snippets are typically too common and short to be protected by copyright. Copyright challenges are likely to fail on the "substantial similarity" test.

I understand plaintiffs would need to show that code is virtually identical, not just similar, and that these parts represent a "substantial" portion of the original work's creative value.