many people tend to overlook how little information is needed for successful de-anonymization.
i like to introduce students to de-anonymization with an old paper "Robust De-anonymization of Large Sparse Datasets" published in the ancient history of 2008 (https://www.cs.cornell.edu/~shmat/shmat_oak08netflix.pdf):
"We apply our de-anonymization methodology to the Netflix Prize dataset, which contains anonymous movie ratings of 500,000 subscribers of Netflix [...]. We demonstrate that an adversary who knows only a little bit about an individual subscriber can easily identify this subscriber’s record in the dataset."
and that was 20 years ago! de-anonymization techniques have improved by leaps and bounds since then, alongside the massive growth in various technology that enhances/enables various techniques.
i think the age of (pseduo-)anonymous internet browsing will be over soon. certainly within my lifetime (and im not that young!). it might be by regulation, it might be by nature of dragnet surveillance + de-anonymization, or a combination of both. but i think it will be a chilling time.
Throwaway accounts using "clever" turns of phrase can often be anonymized by double click, right-clicking -> googling their witty pun and seeing their the sole instance elsewhere, on Twitter, Facebook, etc
If I see a couple words I dont know in a row, I can infer a posters real name.
Id be more specific but any example is doxxing, literally so
That's a great background paper on the Netflix attack, we make a pretty direct comparison in section 5. We also try to use similar methods for comparison in sections 4 and 6. In section 5 we transform peoples Reddit comments into movie reviews with an LLM and then see if LLMs are better than naraynan purely on movie reviews. LLMs are still much better (getting about 8% but the average person only had 2.5 movies and 48% only shared one movie, so very difficult to match)