logoalt Hacker News

palmotealast Thursday at 10:59 PM1 replyview on HN

> Email seems like not only a pretty terrible training data set, since most of it is marketing spam with dubious value

Google probably even has an advantage there: filter out everything except messages sent from valid gmail account to valid gmail account. If you do that you drop most of the spam and marketing, and have mostly human-to-human interactions. Then they have their spam filters.


Replies

Terr_last Friday at 12:04 AM

I'd upgrade that "probably" leak to "will absolutely" leak, albeit with some loss of fidelity.

Imagine industrial espionage where someone is asking the model to roleplay a fictional email exchange between named corporate figures in a particular company.