I don’t buy this argument. The tokens are useless without their context, which provides the probabil...

mplanchard • today at 2:59 PM • 2 replies • view on HN

I don’t buy this argument. The tokens are useless without their context, which provides the probability distributions needed to make them useful. Sure you MIGHT not be able to get the book word for word, but it’s impossible to make a useful model without the whole book and all of the artistry that went into it, to guide the tokens in their expected output.

Fair use generally does not cover commercial use, which this clearly is, and is dependent on the amount of the original content present in the derived work, which I would contend in this case is “all of it”

Replies

Vvector • today at 3:36 PM

"Commercial Use" is only one part of the four prongs of the fair use test. For example, commercial Parody is generally considered Fair Use. Look at Space Balls, which is a direct transformation from Star Wars.

This is all new territory. We don't have court-settled law yet.

samatman • today at 3:35 PM

It's more complicated than that. Quite a bit more.

Commercial use counts _against_ a fair use defense, but is not dispositive: it's not accurate at all to say it "generally does not cover" commercial use. This is the "purpose and character" test, one of four in contemporary (United States) fair use doctrine.

Purpose and character also includes the degree to which a use is _transformative_. It's clear that the degree to which a training run mulching texts "transforms" them is very high. This counts toward a fair use finding for purpose and character.

> is dependent on the amount of the original content present in the derived work, which I would contend in this case is “all of it”

The "amount and substantiality" test. Your case for "all of it" can't possibly be sustained: the models aren't big enough. It's amount _and_ substantiality: this has come up in the publication of concordances, where a relatively large amount of a copyrighted work appears, but it's chopped up and ordered in a way which is no longer substantially the same. Courts have ruled that this kind of text is fair use, pretty consistently. It's not an LLM, of course, but those have yet to be ruled on.

Also worth knowing that courts have never accepted reading or studying a work as incorporation, and are unlikely to change course on the question. It's taken for granted that anyone is allowed to read a copyrighted work in as much detail as they wish, in the course of producing another one. Model training isn't reading either, but the question is to what degree it resembles study. I'd say, more than not.

Specifically:

> it’s impossible to make a useful model without the whole book and all of the artistry that went into it

Courts have never once accepted "it would be impossible for defendant to write his biography without reading plaintiff's" as valid, and it's been tried. The standard for plagiarism is higher than that.

"Effect upon the work's value" is probably the most interesting one. For some things, extreme, for others, negligible. I suspect this is the one courts are going to spend the most time on as all of these questions are litigated.

Ultimately, model training is highly out-of-distribution for the common law questions involving fair use. It was not anticipated by statute, to put it mildly. The best solution to that kind of dilemma is more statute, and we'll probably see that, but, I don't think you'll be happy with the result, given what I'm replying to. Just a guess on my part.

➕ show 1 reply

alt Hacker News

Replies