Earlier this year I thought that rare proprietary knowledge and IP was a safe haven from AI, since LLMs can only scrub public data.
Then it dawned on me how many companies are deeply integrating Copilot into their everyday workflows. It's the perfect Trojan Horse.
Ironically (for you), copilot is the one provider that is doing a good job of provably NOT training on user data. The rest are not up to speed on that compliance angle, so many companies ban them (of course, people still use them).
Even if they're were doing this (I highly doubt it) so much would be lost to distillation I'm not convinced there would be much that actually got in, apart from perhaps internal codenames or whatever which will be obvious.
Using an LLM on data does not ingest that data into the training corpus. LLMs don’t “learn” from the information they operate on, contrary to what a lot of people assume.
None of the mainstream paid services ingest operating data into their training sets. You will find a lot of conspiracy theories claiming that companies are saying one thing but secretly stealing your data, of course.
providers' ToS explicitly states whether or not any data provided is used for training purposes. the usual that i've seen is that while they retain the right to use the data on free tiers, it's almost never the case for paid tiers