This sounds pretty damning, why don't they implement a n-gram based bloom filter to ensure they...

visarga • today at 12:46 AM • 6 replies • view on HN

This sounds pretty damning, why don't they implement a n-gram based bloom filter to ensure they don't replicate expression too close to the protected IP they trained on? Almost any random 10 word ngram is unique on the internet.

Alternatively they could train on synthetic data like summaries and QA pairs extracted from protected sources, so the model gets the ideas separated from their original expression. Since it never saw the originals it can't regurgitate them.

Replies

soulofmischief • today at 2:23 AM

The idea of applying clean-room design to model training is interesting... having a "dirty model" and a "clean model", dirty model touches restricted content and clean model works only with the output of the dirty model.

However, besides how this sidesteps the fact that current copyright law violates the constitutional rights of US citizens, I imagine there is a very real threat of the clean model losing the fidelity of insight that the dirty model develops by having access to the base training data.

➕ show 1 reply

empiko • today at 11:21 AM

IMO they just don't have any idea what data are actually copyrighted and are too lazy to invest in the problem.

stubish • today at 7:43 AM

Even if output is blocked, if it can be demonstrated that the copyrighted material is still in the model then you become liable for distribution and/or duplication without a license.

Training on synthetic data is interesting, but how do you generate the synthetic data? Is it turtles all the way down?

orbital-decay • today at 3:05 AM

That would reduce the training quality immensely. Besides, any generalist model really needs to remember facts and texts verbatim to stay useful, not just generalize. There's no easy way around that.

apical_dendrite • today at 2:29 AM

I'm assuming that the goal of the bloom filter is to prevent the model from producing output that infringes copyright rather than hide that the text is in the training data.

In that case the model would lose the ability to provide relatively brief quotes from copyrighted sources in its answers, which is a really helpful feature when doing research. A brief quote from a copyrighted text, particularly for a transformative purpose like commentary is perfectly fine under copyright law.

isodev • today at 1:08 AM

But that would only hide the problem, doesn’t resolve the fact that models, in fact, violate copyright

➕ show 2 replies

alt Hacker News

Replies