logoalt Hacker News

findjashuayesterday at 12:36 AM5 repliesview on HN

providers' ToS explicitly states whether or not any data provided is used for training purposes. the usual that i've seen is that while they retain the right to use the data on free tiers, it's almost never the case for paid tiers


Replies

torginusyesterday at 9:57 AM

I bet companies are circumventing this in a way that allows them to derive almost all the benefit from your data, yet makes it very hard to build a case against them.

For example, in RL, you have a train set, and a test set, which the model never sees, but is used to validate it - why not put proprietary data in the test set?

I'm pretty sure 99% of ML engineers would say this would constitute training on your data, but this is an argument you could drag out in courts forever.

Or alternatively - it's easier to ask for forgiveness than permission.

I've recently had an apocalyptic vision, that one day we'll wake up, an find that AI companies have produced an AI copy of every piece of software in existence - AI Windows, AI Office, AI Photoshop etc.

sotrustingyesterday at 12:42 AM

Right, so totally cool to ignore the law but our TOS is a binding contract.

show 2 replies
Oarchyesterday at 1:47 AM

Given the conduct we've seen to date, I'd trust them to follow the letter - but not the spirit - of IP law.

There may very well be clever techniques that don't require directly training on the users' data. Perhaps generating a parallel paraphrased corpus as they serve user queries - one which they CAN train on legally.

The amount of value unlocked by stealing practically ~everyone's lunch makes me not want to put that past anyone who's capable of implementing such a technology.

bdangubicyesterday at 1:30 AM

it is amazing in almost 2026 there is anyone believing this… amazing

GCUMstlyHarmlsyesterday at 12:41 AM

I wonder how much wiggle there is for collect now (to provide service, context history, etc), then later anonymise (some how, to some level) and then train on it?

Also I wonder if the ToS covers "queries & interaction" vs "uploaded data" - I could imagine some tricky language in there that says we wont use your word document, but we may at some time use the queries you put against it, not as raw corpus but as a second layer examining what tools/workflows to expand/exploit.

show 1 reply