Earlier this year I thought that rare proprietary knowledge and IP was a safe haven from AI, since L...

Oarch • yesterday at 12:24 AM • 5 replies • view on HN

Earlier this year I thought that rare proprietary knowledge and IP was a safe haven from AI, since LLMs can only scrub public data.

Then it dawned on me how many companies are deeply integrating Copilot into their everyday workflows. It's the perfect Trojan Horse.

Replies

findjashua • yesterday at 12:36 AM

providers' ToS explicitly states whether or not any data provided is used for training purposes. the usual that i've seen is that while they retain the right to use the data on free tiers, it's almost never the case for paid tiers

➕ show 4 replies

phendrenad2 • yesterday at 1:39 AM

Ironically (for you), copilot is the one provider that is doing a good job of provably NOT training on user data. The rest are not up to speed on that compliance angle, so many companies ban them (of course, people still use them).

➕ show 1 reply

matt-p • yesterday at 1:12 AM

Even if they're were doing this (I highly doubt it) so much would be lost to distillation I'm not convinced there would be much that actually got in, apart from perhaps internal codenames or whatever which will be obvious.

➕ show 1 reply

gaigalas • yesterday at 12:46 AM

What kind of rare proprietary knowledge?

➕ show 1 reply

Aurornis • yesterday at 12:31 AM

Using an LLM on data does not ingest that data into the training corpus. LLMs don’t “learn” from the information they operate on, contrary to what a lot of people assume.

None of the mainstream paid services ingest operating data into their training sets. You will find a lot of conspiracy theories claiming that companies are saying one thing but secretly stealing your data, of course.

➕ show 11 replies

alt Hacker News

Replies