> I think artificial and data-less identifiers are the better means of identification that takes ...

mkleczek • yesterday at 5:21 PM • 2 replies • view on HN

> I think artificial and data-less identifiers are the better means of identification that takes into account that things change. They don't have to be the identifier you present to the world, but having them is very useful.

If the only reason you need a surrogate key is to introduce indirection in your internal database design then sequence numbers are enough. There is no need to use UUIDs.

The whole discussion is about externally visible identifiers (ie. identifiers visible to external software, potentially used as a persistent long-term reference to your data).

> E.g. phone numbers are semi-common identifiers now, but phone numbers change owners for reasons outside of your control. If you use them as an internal identifier, changing them between accounts gets very messy because now you don't have an identifier for the person who used to have that phone number.

Introducing surrogate keys (regardless of whether UUIDs or anything else) does not solve any problem in reality. When I come to you and say "My name is X, this is my phone number, this is my e-mail, I want my GDPR records deleted", you still need to be able to find all data that is related to me. Surrogate keys don't help here at all. You either have to be able to solve this issue in the database or you need to have an oracle (ie. a person) that must decide ad-hoc what piece of data is identified by the information I provided.

The key issue here is that you try to model identifiable "entities" in your data model, while it is much better to model "captured information".

So in your example there is no "person" identified by "phone number" but rather "at timestamp X we captured information about a person at the time named Y and using phone number Z". Once you start thinking about your database as structured storage of facts that you can use to infer conclusions, there is much less need for surrogate keys.

Replies

dpark • yesterday at 8:53 PM

> So in your example there is no "person" identified by "phone number" but rather "at timestamp X we captured information about a person at the time named Y and using phone number Z". Once you start thinking about your database as structured storage of facts that you can use to infer conclusions, there is much less need for surrogate keys.

This is so needlessly complex that you contradicted yourself immediately. You claim there is no “person” identified but immediately say you have information “about a person”. The fact that you can assert that the information is about a person means that you have identified a person.

Clearly tying data to the person makes things so much easier. I feel like attempting to do what you propose is begging to mess up GDPR erasure.

> “So I got a request from a John Doe to erase all data we recorded for them. They identified themselves by mailing address and current phone number. So we deleted all data we recorded for that phone number.”

> “Did you delete data recorded for their previous phone number?”

> “Uh, what?”

The stubborn refusal to create a persistent identifier makes your job harder, not easier.

everforward • yesterday at 6:04 PM

> If the only reason you need a surrogate key is to introduce indirection in your internal database design then sequence numbers are enough. There is no need to use UUIDs.

The UUID would be an example of an external key (for e.g. preventing crawling keys being easy). This article mentions a few reasons why you may later decide there are better external keys.

> When I come to you and say "My name is X, this is my phone number, this is my e-mail, I want my GDPR records deleted", you still need to be able to find all data that is related to me.

How are you going to trace all those records if the requester has changed their name, phone number and email since they signed up if you don't have a surrogate key? All 3 of those are pretty routine to change. I've changed my email and phone number a few times, and if I got married my name might change as well.

> Once you start thinking about your database as structured storage of facts that you can use to infer conclusions, there is much less need for surrogate keys.

I think that spirals into way more complexity than you're thinking. You get those timestamped records about "we got info about person named Y with phone number Z", and then person Y changes their phone number. Now you're going to start getting records from person named Y with phone number A, but it's the same account. You can record "person named Y changed their phone number from Z to A", and now your queries have to be temporal (i.e. know when that person had what phone number). You could back-update all the records to change Z to A, but that breaks some things (e.g. SMS logs will show that you sent a text to a number that you didn't send it to).

Worse yet, neither names nor phone numbers uniquely identify a person, so it's entirely possible to have records saying "person named Y and phone number Z" that refer to different people if a phone number transfers from a John Doe to a different person named John Doe.

I don't doubt you could do it, but I can't imagine it being worth it. I can't imagine a way to do it that doesn't either a) break records by backdating information that wasn't true back then, or b) require repeated/recursive querying that will hammer the DB (e.g. if someone has had 5 phone numbers, how do you get all the numbers they've had without pulling the latest one to find the last change, and then the one before that, and etc). Those queries are incredibly simple with surrogate keys: "SELECT * FROM phone_number_changes WHERE user_id = blah".

➕ show 1 reply

alt Hacker News

Replies