> In response, NVIDIA defended its actions as fair use, noting that books are nothing more than statistical correlations to its AI models.
Does this even make sense? Are the copyright laws so bad that a statement like this would actually be in NVIDIA’s favor?
Did you pirated this movie? No I did not, it is fair use because this movie is nothing more than a statistical correlation to my dopamine production.
> Does this even make sense? Are the copyright laws so bad that a statement like this would actually be in NVIDIA’s favor?
It makes some sense, yeah. There's also precedent, in google scanning massive amounts of books, but not reproducing them. Most of our current copyright laws deal with reproductions. That's a no-no. It gets murky on the rest. Nvda's argument here is that they're not reproducing the works, they're not providing the works for other people, they're "scanning the books and computing some statistics over the entire set". Kinda similar to Google. Kinda not.
I don't see how they get around "procuring them" from 3rd party dubious sources, but oh well. The only certain thing is that our current laws didn't cover this, and probably now it's too late.
It does make sense. It’s controversial. Your memory memorizes things in the same way. So what nvidia does here is no different, the AI doesn’t actually copy any of the books. To call training illegal is similar to calling reading a book and remembering it illegal.
Our copyright laws are nowhere near detailed enough to specify anything in detail here so there is indeed a logical and technical inconsistency here.
I can definitely see these laws evolving into things that are human centric. It’s permissible for a human to do something but not for an AI.
What is consistent is that obtaining the books was probably illegal, but say if nvidia bought one kindle copy of each book from Amazon and scraped everything for training then that falls into the grey zone.
It's not settled law as it pertains to LLMs, but, yes, creating a "statistical summary" of a book (consider, e.g., a concordance of Joyce's "Ulysses") is generally protected as fair use. However, illegally accessing pirated books to create that concordance is still illegal.
Copyright laws are so undefined and NVIDIAs lawyers so plentiful that the statement works in their favor. You're allowed to copy part of a work in many cases, the easiest example is you can quote a line from a book in a review. The line is fuzzy.
Who cares? Only Disney had the money to fight them.
Everything else will be slurped up for and with AI and be reused.
When you're responsible for 4% of the global GDP, they let you do it.
Books are databases, chars their elements. We have copyright for databases in EU :)
The chicken is trying to become the egg.
A quite good explanation of what copyright laws cover and should (and should not) cover is here by Cory Doctorow: https://www.theguardian.com/us-news/ng-interactive/2026/jan/...
It seems so, stealing copyrighted content is only illegal if you do it to read it or allow others to read it. Stealing it to create slop is legal.
(The difference, is that the first use allows ordinary poeple to get smarter, while the second use allows rich people to get (seemingly) richer, a much more important thing)
Yes, it's been discussed many times before. All the corporations training LLMs have to have done a legal analysis and concluded that it's defensible. Even one of the white papers commissioned by the FSF ( "Copyright Implications of the Use of Code Repositories to Train a Machine Learning Model" at https://www.fsf.org/licensing/copilot/copyright-implications... ), concluded that using copyrighted data to train AI was plausibly legally defensible and outlined the potential argument. You will notice that the FSF has not rushed out to file copyright infringement suits even though they probably have more reason to oppose LLMs trained on FOSS code than anyone else in the world.