A prime example of premature optimization. Permanent identifiers should not carry data . Thi...

vintermann • yesterday at 12:02 PM • 20 replies • view on HN

A prime example of premature optimization.

Permanent identifiers should not carry data. This is like the cardinal sin of data management. You always run into situations where the thing you thought, "surely this never changes, so it's safe to squeeze into the ID to save a lookup". Then people suddenly find out they have a new gender identity, and they need a last final digit in their ID numbers too.

Even if nothing changes, you can run into trouble. Norwegian PNs have your birth date (in DDMMYY format) as the first six digits. Surely that doesn't change, right? Well, wrong, since although the date doesn't change, your knowledge of it might. Immigrants who didn't know their exact date of birth got assigned 1. Jan by default... And then people with actual birthdays on 1 Jan got told, "sorry, you can't have that as birth date, we've run out of numbers in that series!"

Librarians in the analog age can be forgiven for cramming data into their identifiers, to save a lookup. When the lookup is in a physical card catalog, that's somewhat understandable (although you bet they could run into trouble over it too). But when you have a powerful database at your fingertips, use it! Don't make decisions you will regret just to shave off a couple of milliseconds!

Replies

ralferoo • yesterday at 5:10 PM

> Norwegian PNs have your birth date (in DDMMYY format) as the first six digits. Surely that doesn't change, right? Well, wrong, since although the date doesn't change, your knowledge of it might. Immigrants who didn't know their exact date of birth got assigned 1. Jan by default... And then people with actual birthdays on 1 Jan got told, "sorry, you can't have that as birth date, we've run out of numbers in that series!"

To me, what your example really shows is the problem with incorrect default values, not a problem with encoding data into a key per se. If they'd chosen a non-date for unknown values, maybe 00 or 99 for day or month components, then the issue you described would disappear.

But in case, the intention for encoding a timestamp into a UUID isn't for any implied meaning. It's both to guarantee uniqueness with a side effect that IDs are more or less monotonically increasing. Whether this is actually desirable depends on your application, but generally if the application is as a indexed key for insertion into a database, it's usually more useful for performance than a fully random ID as it avoids rewriting lots of leaf-nodes of B-trees. If you insert a load of these such keys, it forms a cluster on one side of the tree that can the rebalance with only the top levels needing to be rewritten.

➕ show 5 replies

tacone • yesterday at 12:23 PM

Fantastic real life example. Italian PNs carry also the gender, which something you can change surgically, and you'll eventually run into the issue when operating at scale.

I don't agree with the absolute statement, though. Permanent identifiers should not generally carry data. There are situations where you want to have a way to reconciliate, you have space or speed constraints, so you may accept the trade off, md5 your data and store it in a primary index as a UUID. Your index will fragment and thus you will vacuum, but life will still be good overall.

➕ show 2 replies

barrkel • yesterday at 1:25 PM

Uuid v7 just has a bias in its generation; it isn't carrying information. You're not going to try and extract a timestamp from a uuid.

Random vs time biased uuids are not a decision to shave off ms that you will regret.

Most likely they will be a decision that shaves off seconds (yes, really - especially when you consider locality effects) and you'll regret nothing.

➕ show 6 replies

hnfong • yesterday at 12:56 PM

The curious thing about the article is that, it's definitely premature optimization for smaller databases, but when the database gets to the scale where these optimizations start to matter, you actually don't want to do what they suggest.

Specifically, if your database is small, the performance impact is probably not very noticeable. And if your database is large (eg. to the extent primary keys can't fit within 32-bit int), then you're actually going to have to think about sharding and making the system more distributed... and that's where UUID works better than auto-incrementing ints.

➕ show 1 reply

mkleczek • yesterday at 12:31 PM

This is actually a very deep and interesting topic. Stripping information from an identifier disconnects a piece of data from the real world which means we no longer can match them. But such connection is the sole purpose of keeping the data in the first place. So, what happens next is that the real world tries to adjust and the "data-less" identifier becomes a real world artifact. The situation becomes the same but worse (eg. you don't exist if you don't remember your social security id). In extreme cases people are tattooed with their numbers.

The solution is not to come up with yet another artificial identifier but to come up with better means of identification taking into account the fact that things change.

➕ show 5 replies

hyperpape • yesterday at 12:37 PM

Your comment is sufficiently generic that it’s impossible to tell what specific part of the article you’re agreeing with, disagreeing with, or expanding upon.

➕ show 1 reply

GuB-42 • yesterday at 6:49 PM

I don't think the timestamped UUIDs are "carrying data", it is just a heuristic to improve lookup performance. If the timestamp is wrong, it will just run as slow as the non-timestamped UUID.

If you take the gender example, for 99% of people, it is male/female and it won't change, and you can use that for load balancing. But if later, you found out that the gender is not the one you expect for that bucket, no big deal, it will cause a branch misprediction, but instead of happening 50% of the times when you use a random value, it will only happen 1% of the times, significant speedup with no loss in functionality.

➕ show 2 replies

danudey • yesterday at 11:23 PM

Perhaps you can clarify something for me, because I think I'm missing it.

> Norwegian PNs have your birth date (in DDMMYY format) as the first six digits

So presumably the format is DDMMYYXXXXX (for some arbitrary number of X's), where the XXX represents e.g. an automatically incrementing number of some kind?

Which means that if it's DDMMYYXXX then you can only have 1000 people born on DDMMYY, and if it's DDMMYYXXXXX then you can have 100,000 people born on DDMMYY.

So in order for there to be so many such entries in common that people are denied use of their actual birthday, then one of the following must be true:

1. The XXX counter must be extremely small, in order for it to run out as a result of people 'using up' those Jan 1 dates each year

2. The number of people born on Jan 1 or immigrating to Norway without knowledge of their birthday must be colossal

If it was just DDMMXXXXX (no year) then I can see how this system would fall apart rapidly, but when you're dealing with specifically "people born on Jan 1 2014 or who immigrated to Norway and didn't know their birthday and were born on/around 2014 so that was the year chosen" I'm not sure how that becomes a sufficiently large number to cause these issues. Perhaps this only occurs in specific years where huge numbers of poorly-documented refugees are accepted?

(Happy to be educated, as I must be missing something here)

benterix • yesterday at 4:17 PM

Your comment is valid but is not related to the article.

➕ show 2 replies

PunchyHamster • yesterday at 11:16 PM

The cause is more just "not having enough bits". UUID is 128 bit. You're not running out even if you use part for timestamp, the random part will be big enough.

Like, it's a valid complaint.. just not for discussion at hand.

Also, we do live in reality and while having entirely random one might be perfect from theory of data, in reality having it be prefixed by date have many advantages performance wise.

> Permanent identifiers should not carry data. This is like the cardinal sin of data management

As long as you don't use the data and have actual fields for what's also encoded in UUID, there is absolutely nothing wrong with it, provided there is enough of the random part to get around artifacts in real life data.

cycomanic • yesterday at 6:52 PM

Like the other poster said, this is a problem with default values not encoding the birthday into the personnummer.

I think it also is important to remember the purpose of specific numbers. For instance I would argue a PN without the birthday would be strictly worse. With the current system (I only know the Swedish one, but assume it's the same) I only have to remember a 4 digit (because the number is bdate + unique 4 digits). If we would instead use completely random numbers I would have to remember at least an 8 digit number (and likely to be future proof you'd want at least 9 digits). Sure that's fine for myself (although I suspect some people already struggle with it), but then I also have to remember the numbers for my 2 kids and my partner and things become quickly annoying. Especially, because one doesn't use the numbers often enough that it becomes easy, but still often enough that it becomes annoying to look up, especially when one doesn't always cary their phone with them.

➕ show 1 reply

oncallthrow • yesterday at 12:15 PM

It sounds to me like you’re just arguing for premature optimization of another kind (specifically, prematurely changing your entire architecture for edge cases that probably won’t ever happen to you).

➕ show 1 reply

mzi • yesterday at 6:33 PM

> Even if nothing changes, you can run into trouble. Norwegian PNs have your birth date (in DDMMYY format) as the first six digits. Surely that doesn't change, right?

I guess that Norway has solved it in the same or similar way as Sweden? So a person is identified by the PNR and for those systems that need to track a person over several PNR (government agencies) use PRI. And a PRI is just the first PNR assigned to a person with a 1 inserted in the middle. If that PRI is occupied, use a 2,and so on.

PRI could of course have been a UUID instead.

maxbond • yesterday at 11:26 PM

> Permanent identifiers should not carry data.

Do you have the same criticism for serial identifiers? How about hashes? What about the version field in UUIDs?

sgarland • yesterday at 2:04 PM

> Permanent identifiers should not carry data.

Did you read the article? He doesn’t recommend natural keys, he recommends integer-based surrogates.

> A prime example of premature optimization.

Disagree. Data is sticky, and PKs especially so. Moreover, if you’re going to spend time optimizing anything early on, it should be your data model.

> Don't make decisions you will regret just to shave off a couple of milliseconds!

A bad PK in some databases (InnoDB engine, SQL Server if clustered) can cause query times to go from sub-msec to tens of msec quite easily, especially with cloud solutions where storage isn’t node-local. I don’t just mean a UUID; a BIGINT PK on a 1:M can destroy your latency for the simple reason of needing to fetch a separate page for every record. If instead the PK is a composite of (<linked_id>, id) - e.g. (user_id, id) - where id is a monotonic integer, you’ll have WAY better data locality.

Postgres suffers a different but similar problem with its visibility map lookups.

➕ show 2 replies

scottlamb • yesterday at 4:13 PM

> Permanent identifiers should not carry data.

I think you're attacking a straw man. The article doesn't say "instead of UUIDv4 primary keys, use keys such as birthdays with exposed semantic meaning". On the contrary, they have a section about how to use sequence numbers internally but obfuscated keys externally. (Although I agree with dfox's and formerly_proven's comments [1, 2] that XOR method they proposed for this is terrible. Reuse of a one-time pad is probably the most basic textbook example of bad cryptography. They referred to the values as "obfuscated" so they probably know this. They should have just gone with a better method instead.)

[1] https://news.ycombinator.com/item?id=46272985

[2] https://news.ycombinator.com/item?id=46273325

➕ show 2 replies

asah • yesterday at 9:56 PM

counterpoint: IRL, data values in a system like PostgreSQL are padded to word boundaries so either you're wasting bits or "carrying data."

Traubenfuchs • yesterday at 3:21 PM

> Norwegian PNs have your birth date

Same with Austrian social security numbers, which, in somes cases, don't contain the persons birth date and in some cases don't contain any existing date at all.

Yet many websites enforce a valid date and pull the persons birthdate from it...

oblio • yesterday at 1:02 PM

> Well, wrong, since although the date doesn't change.

Someone should have told Julius Caesar and Gregory XIII that :-p

alt Hacker News

Replies