logoalt Hacker News

ziddoap01/21/20251 replyview on HN

>Seems pretty handwavy.

It has a whole Wikipedia article and everything.

https://en.wikipedia.org/wiki/De-anonymization#Re-identifica...

>Can you describe concretely how this would work?

Here's one of the earlier papers I remember off-hand, demonstrating one methodology. New (and improvements to existing) statistical techniques have happened in the ~18 years since this was published. Not to mention their is significantly more data to work with now.

https://www.cs.utexas.edu/~shmat/shmat_oak08netflix.pdf

"We apply our de-anonymization methodology to the Netflix Prize dataset, which contains anonymous movie ratings of 500,000 subscribers of Netflix, the world’s largest online movie rental service. We demonstrate that an adversary who knows only a little bit about an individual subscriber can easily identify this subscriber’s record in the dataset."

From the Wiki I linked:

"Researchers at MIT and the Université catholique de Louvain, in Belgium, analyzed data on 1.5 million cellphone users in a small European country over a span of 15 months and found that just four points of reference, with fairly low spatial and temporal resolution, was enough to uniquely identify 95 percent of them." [...] "A few Twitter posts would probably provide all the information you needed, if they contained specific information about the person's whereabouts."

Point being that operational security is hard, and it takes a lot less to "slip up" and accidentally reveal yourself than most people think. Obtaining a location within 250 miles (or whatever) can be a key piece of information that leads to other dots being connected.

Other examples (albeit with less explanation) include police take downs of prolific CSAM producers by gathering bits and pieces of information over time, culminating in enough to make an identification.


Replies

gruez01/21/2025

>"We apply our de-anonymization methodology to the Netflix Prize dataset, which contains anonymous movie ratings of 500,000 subscribers of Netflix, the world’s largest online movie rental service. We demonstrate that an adversary who knows only a little bit about an individual subscriber can easily identify this subscriber’s record in the dataset."

> [...]

"Researchers at MIT and the Université catholique de Louvain, in Belgium, analyzed data on 1.5 million cellphone users in a small European country over a span of 15 months and found that just four points of reference, with fairly low spatial and temporal resolution, was enough to uniquely identify 95 percent of them." [...] "A few Twitter posts would probably provide all the information you needed, if they contained specific information about the person's whereabouts."

The only reason the two attacks work is that you have access to a bunch of uncorrelated data points. That is, ratings for various shows and their dates, and cellphone movement patterns. It's unclear how you could extend this to some guy you're trying to dox on signal. The geo info is relatively coarse and stays static, so trying to single out a single person is going to be difficult. To put another way, "guy was vaguely near New York on these dates" doesn't narrow down the search parameters by much. That's going to be true for millions of people.

show 1 reply