logoalt Hacker News

Oarchyesterday at 12:24 AM5 repliesview on HN

Earlier this year I thought that rare proprietary knowledge and IP was a safe haven from AI, since LLMs can only scrub public data.

Then it dawned on me how many companies are deeply integrating Copilot into their everyday workflows. It's the perfect Trojan Horse.


Replies

findjashuayesterday at 12:36 AM

providers' ToS explicitly states whether or not any data provided is used for training purposes. the usual that i've seen is that while they retain the right to use the data on free tiers, it's almost never the case for paid tiers

show 4 replies
phendrenad2yesterday at 1:39 AM

Ironically (for you), copilot is the one provider that is doing a good job of provably NOT training on user data. The rest are not up to speed on that compliance angle, so many companies ban them (of course, people still use them).

show 1 reply
matt-pyesterday at 1:12 AM

Even if they're were doing this (I highly doubt it) so much would be lost to distillation I'm not convinced there would be much that actually got in, apart from perhaps internal codenames or whatever which will be obvious.

show 1 reply
gaigalasyesterday at 12:46 AM

What kind of rare proprietary knowledge?

show 1 reply
Aurornisyesterday at 12:31 AM

Using an LLM on data does not ingest that data into the training corpus. LLMs don’t “learn” from the information they operate on, contrary to what a lot of people assume.

None of the mainstream paid services ingest operating data into their training sets. You will find a lot of conspiracy theories claiming that companies are saying one thing but secretly stealing your data, of course.

show 11 replies