Avoid UUIDv4 Primary Keys

264 points • by pil0u • today at 10:08 AM • 272 comments • view on HN

Comments

A prime example of premature optimization.

Permanent identifiers should not carry data. This is like the cardinal sin of data management. You always run into situations where the thing you thought, "surely this never changes, so it's safe to squeeze into the ID to save a lookup". Then people suddenly find out they have a new gender identity, and they need a last final digit in their ID numbers too.

Even if nothing changes, you can run into trouble. Norwegian PNs have your birth date (in DDMMYY format) as the first six digits. Surely that doesn't change, right? Well, wrong, since although the date doesn't change, your knowledge of it might. Immigrants who didn't know their exact date of birth got assigned 1. Jan by default... And then people with actual birthdays on 1 Jan got told, "sorry, you can't have that as birth date, we've run out of numbers in that series!"

Librarians in the analog age can be forgiven for cramming data into their identifiers, to save a lookup. When the lookup is in a physical card catalog, that's somewhat understandable (although you bet they could run into trouble over it too). But when you have a powerful database at your fingertips, use it! Don't make decisions you will regret just to shave off a couple of milliseconds!

➕ show 6 replies

stickfigure • today at 3:51 PM

This is incredibly database-specific. In Postgres random PKs are bad. But in distributed databases like Cockroach, Google Cloud Datastore, and Spanner it is the opposite - monotonic PKs are bad. You want to distribute load across the keyspace so you avoid hot shards.

➕ show 4 replies

can3p • today at 6:41 PM

> For many business apps, they will never reach 2 billion unique values per table, so this will be adequate for their entire life. I’ve also recommended always using bigint/int8 in other contexts.

I'm sure every dba has a war story that starts with similar decision in the past

iainctduncan • today at 5:43 PM

Counterargument... I do technical diligence so I talk to a lot of companies at points of inflection, and I also talk to lots who are stuck.

The ability to rapidly shard everything can be extremely valuable. The difference between "we can shard on a dime" and "sharding will take a bunch of careful work" can be expensive If the company has poor margins, this can be the difference between "can scale easily" and "we're not getting this investment".

I would argue that if your folks have the technical chops to be able to shard while avoiding surrogate guaranteed unique keys, great. But if they don't.... a UUID on every table can be a massive get-out-of-jail free card and for many companies this is much, much important than some minor space and time optimizations on the DB.

Worth thinking about.

➕ show 1 reply

benterix • today at 10:50 AM

The article sums up some valid arguments against UUIDv4 as PKs but the solution the author provides on how to obfuscate integers is probably not something I'd use in production. UUIDv7 still seems like a reasonable compromise for small-to-medium databases.

➕ show 2 replies

gwbas1c • today at 2:16 PM

I work on an application where we encrypt the integer primary key and then use the bytes to generate something that looks like a UUID.

In our case, we don't want database IDs in an API and in URLs. When IDs are sequential, it enables things like dictionary attacks and provides estimates about how many customers we have.

Encrypting a database ID makes it very obvious when someone is trying to scan, because the UUID won't decrypt. We don't even need a database round trip.

➕ show 2 replies

phendrenad2 • today at 6:57 PM

You probably don't want integer primary keys, and you probably don't want UUID primary keys. You probably want something in-between, depending on your use case. UUID is one extreme on this spectrum, which tries to solve all of the problems, including ones you might not have.

➕ show 1 reply

dotancohen • today at 10:56 AM

From the fine article:

  > Random values don’t have natural sorting like integers or lexicographic (dictionary) sorting like character strings. UUID v4s do have "byte ordering," but this has no useful meaning for how they’re accessed.

Might the author mean that random values are not sequential, so ordering them is inefficient? Of course random values can be ordered - and ordering by what he calls "byte ordering" is exactly how all integer ordering is done. And naive string ordering too, like we would do in the days before Unicode.

➕ show 6 replies

maxnoe • today at 7:00 PM

Would this argument still apply if I need to store a uuidv4 anyway in the table?

And I'd likely want a unique constraint on that?

ivan_gammel • today at 12:09 PM

The is article is about a solution in search of a problem, a classic premature optimization issue. UUIDv4 is perfectly fine for many use cases, including small databases. Performance argument must be considered when there’s a problem with performance on the horizon. Other considerations may be and very often superior to that.

xandrius • today at 10:51 AM

To summarise the article: in PG, prefer using UUIDv7 over UUIDv4 as they have slightly better performance.

If you're using latest version of PG, there is a plugin for it.

That's it.

➕ show 2 replies

K0nserv • today at 11:58 AM

An additional thing I learned when I worked on a ulid alternative over the weekend[0] is: Postgres's internal Datum type is at most 64 bits which means every uuid requires heap allocation[1] (at least until we get 128 bit machines).

0: https://bsky.app/profile/hugotunius.se/post/3m7wvfokrus2g

1: https://github.com/postgres/postgres/blob/master/src/backend...

stevefan1999 • today at 1:44 PM

Try Snowflake ID then: https://en.wikipedia.org/wiki/Snowflake_ID

➕ show 1 reply

aynyc • today at 3:07 PM

I've seen this type of advice a few times now. Now I'm not a database expert by any stretch of imagination, but I have yet to see UUID as primary key in any of the systems I've touched.

Are there valid reasons to use UUID (assuming correctly) for primary key? I know systems have incorrectly expose primary key to the public, but assuming that's not the concern. Why use UUID over big-int?

➕ show 4 replies

hippo22 • today at 4:23 PM

The author should include benchmarks otherwise, saying that UUIDs “increase latency” is meaningless. For instance, how much longer does it take to insert a UUID vs. an integer? How much longer does scanning an index take?

➕ show 1 reply

mcny • today at 10:59 AM

Postgresql 18 released in September and has uuidv7

https://www.postgresql.org/docs/current/functions-uuid.html

FancyFane • today at 5:35 PM

Even MySQL benefits from these changes as well. What we're really discussing is random primary key inserts (UUIDv4) vs incrementing primary key inserts (UUIDv6 or v7).

PlanetScale wrote up a really good article on why incrementing primary keys are better for performance when compared to randomly inserted primary keys; when it comes to b-tree performance. https://planetscale.com/blog/btrees-and-database-indexes

whateveracct • today at 6:07 PM

Sometimes its nice for your PK to be uniformly distributed. As a reader, even if it hurts as a writer. For instance, you can easily shard queries and workloads.

> the impact to inserts and retrieval of individual items or ranges of values from the index.

Classic OLTP vs OLAP.

grugdev42 • today at 12:14 PM

A much simpler solution is to keep your tables as they are (with an integer primary key), but add a non sequential public identifier too.

id => 123, public_id => 202cb962ac59075b964b07152d234b70

There are many ways to generate the public_id. A simple MD5 with a salt works quite well for extremely low effort.

Add a unique constraint on that column (which also indexes it), and you'll be safe and performant for hundreds of millions of rows!

Why do we developers like to overcomplicate things? ;)

➕ show 2 replies

dfox • today at 11:17 AM

> Creating obfuscated values using integers

While that is often neat solution, do not do that by simply XORing the numbers with constant. Use a block cipher in ECB mode (If you want the ID to be short then something like NSA's Speck comes handy here as it can be instantiated with 32 or 48 bit block).

And do not even think about using RC4 for that (I've seen that multiple times), because that is completely equivalent to XORing with constant.

bux93 • today at 11:21 AM

Long article about why not to use UUIDv4 as Primary Keys, but.. Who is doing so? And why are they doing that? How would you solve their requirements? Just throwing out "you can use UUIDv7" doesn't help with, e.g., the size they take up.

Aren't people using (big)ints are primary keys, and using UUIDs as logical keys for import/export, solving portability across different machines?

➕ show 1 reply

dimitrisnl • today at 11:10 AM

Noob question, but why no use ints for PK, and UUIDs for a public_id field?

➕ show 5 replies

Waterluvian • today at 6:15 PM

Any decent resources with benchmark data on Postgres insertion, indexing, retrieve, etc. for UUID vs. integer based PKs?

mgoetzke • today at 3:06 PM

That depends a lot on many factors and thus I dont like generic statements like that which tend to be more focused on a specific database pattern. That said everyone should indeed be aware of the potential tradeoffs.

And of course we could come up with many ways to generate our own ids and make them unique, but we have the following requirements.

- It needs to be a string (because we allow composing them to 'derive' keys) - A client must be able to create them (not just a server) without risk for collisions - The time order of keys must not be guessable easily (as the id is often leaked via references which could 'betray' not just the existence of a document, but also its relative creation time wrt others). - It should be easy to document how any client can safely generate document ids.

The lookup performance is not really such a big deal for us. Where it is we can do a projection into a more simple format where applicable.

jwr • today at 10:47 AM

"if you use PostgreSQL"

(in the scientific reporting world this would be the perennial "in mice")

➕ show 3 replies

hk1337 • today at 12:13 PM

My biggest thing for UUIDs is don’t UUID everything. Most things should be okay with just regular integers as PKs.

BartjeD • today at 12:13 PM

Personally my approach has been to start with big-ints and add a GUID code field if it becomes necessary. And then provide imports where you can match objects based on their code, if you ever need to import/export between tenants, with complex object relationships.

But that also adds complexity.

deathanatos • today at 4:08 PM

> Are UUIDs secure?

> Misconceptions: UUIDs are secure

> One misconception about UUIDs is that they’re secure. However, the RFC describes that they shouldn’t be considered secure “capabilities.”

> From RFC 41221 Section 6 Security Considerations:

> Do not assume that UUIDs are hard to guess; they should not be used as security capabilities

This is just wrong, and the citation doesn't support it. You're not guessing a 122-bit long random identifier. What's crazy is that the article, immediately prior to this, even cites the very math involved in showing exactly how unguessable that is.

… the linked citation (to §4.4, which is different from the in-prose citation) is just about how to generate a v4, and completely unrelated to the claim. The prose citation to §6 is about UUIDs generally: the statement "Do not assume that [all] UUIDs are hard to guess" is not logically inconsistent with properly-generated UUIDv4s being hard to guess. A subset of UUIDs have security properties, if the system generating & using them implements those properties, but we should not assume all UUIDs have that property.

Moreover, replacing an unguessable UUID with an (effectively random) 32-bit integer does make it guessable, and the scheme laid out seems completely insecure if it is to be used in the contexts one finds UUIDv4s being an unguessable identifier.

The additional size argument is pretty weak too; at "millions of rows", a UUID column is consuming an additional ~24 MiB.

reactordev • today at 11:53 AM

I fun trick I did was generate UUID-like ids. We all can identify a UUIDv4 most of the time by looking at one. "Ah, a uuid" we say to ourselves. A little over a decade ago I was working on a massive cloud platform and rather than generate string keys like the author above suggested (int -> binary -> base62 str) we opted for a more "clever" approach.

The UUID is 128bits. The first 64bits are a java long. The last 64bits are a java long. Let's just combine the Tenant ID long with a Resource ID long to generate a unique id for this on our platform. (worked until it didn't).

➕ show 1 reply

kaladin_1 • today at 12:22 PM

I really hoped the author would discuss alternatives for distributed databases that writes in parallel. Sequential key would be atrocious in such circumstance this could kill the whole gain of distributed database as hotspots would inevitably appear.

I would like to hear from others using, for example, Google Spanner, do you have issues with UUID. I don't for now, most optimizations happen at the Controller level, data transformation can be slow due to validations. Try to keep service logic as straightforward as possible.

Lucasoato • today at 11:11 AM

Hi, a question for you folks. What if I don’t like to embed timestamp in uuid as v7 do? This could expose to timing attacks in specific scenarios.

Also is it necessary to show uuid at all to customers of an API? Or could it be a valid pattern to hide all the querying complexity behind named identifiers, even if it could cost a bit in terms of joining and indexing?

The context is the classic B2B SaaS, but feel free to share your experiences even if it comes from other scenarios!

➕ show 1 reply

scary-size • today at 11:11 AM

This reminds me about this old gist for generating Firebase-like "push IDs" [1]. Those have some nicer properties.

[1] https://gist.github.com/mikelehen/3596a30bd69384624c11

mkleczek • today at 1:15 PM

That's really an important deficiency of Postgres.

Hash index is ideally suited for UUIDs but for some reason Postgres hash indexes cannot be unique.

mexicocitinluez • today at 1:06 PM

You'll have to rip the ability to generate unique numbers from quite literally anywhere in my app and save them without conflict from my cold, dead hands.

The ability to know ahead of time what a primary key will be (in lieu of persisting it first, then returning) opened up a whole new world of architecting work in my app. It made a lot of previous awkward things feel natural.

nesarkvechnep • today at 3:07 PM

If we embraced REST, as Roy Fielding envisioned it, we wouldn't have this, and all similar, conversations. REST doesn't expose identifier, it only exposes relationships. Identifiers are an implementation details.

cebert • today at 11:26 AM

Using UUIDs as primary keys in non-relational databases like DynamoDB is valid and doesn’t raise the concerns mentioned in the article.

p2detar • today at 1:35 PM

Another interesting article from Feb-2024 [0] where the cost of inserting a uuid7() and a bigint is basically the same. To me it wasn't quite clear what the problem with the buffer cache is but the author makes it much more clear than OP's article:

> We need to read blocks from the disk when they are not in the PostgreSQL buffer cache. Conveniently, PostgreSQL makes it very easy to inspect the contents of the buffer cache. This is where the big difference between uuidv4 and uuidv7 becomes clear. Because of the lack of data locality in uuidv4 data, the primary key index is consuming a huge amount of the buffer cache in order to support new data being inserted – and this cache space is no longer available for other indexes and tables, and this significantly slows down the entire workload.

0 - https://ardentperf.com/2024/02/03/uuid-benchmark-war

old8man • today at 12:52 PM

[dead]

sneak • today at 4:32 PM

This is why ULID exists and why I use them in my ext_id columns. For the actual relational IDs internal to the db I use smaller/faster data types.

raxxorraxor • today at 11:58 AM

> Do not assume that UUIDs are hard to guess; they should not be used as security capabilities

The issue is that is true for more or less all capability URLs. I wouldn't recommend UUIDs per se here, probably better to just use a random number. I have seen UUIDs for this in practice though and these systems weren't compromised because of that.

I hate the tendency that password recovery flows for example leave the URL valid for 5 minutes. Of course these URLs need to have a limited life time, but mail isn't a real time communication medium. There is very little security benefit from reducing it from 30 minutes to 5 minutes for example. You are not getting "securer" this way.

henning • today at 4:28 PM

UUIDs make enumeration attacks harder and also prevent situations where seeing a high valid ID value lets you estimate how much money a private company is earning if they charge based on the object the ID is associated with. If you can sample enough object ID values and see when the IDs were created, you could reverse engineer their ARR chart and see whether they're growing or not which many companies want to avoid.

socketcluster • today at 11:13 AM

My advice is: Avoid Blanket Statements About Any Technology.

I'm tired of midwit arguments like "Tech X is N% faster than tech Y at performing operation Z. Since your system (sometimes) performs operation Z, it implies that Tech X is the only logical choice in all situations!"

It's an infuriatingly silly argument because operation Z may only represent about 10% of the total CPU usage of the whole system (averaged out)... So what is promoted as a 50% gain may in fact be a 5% gain when you consider it in the grand scheme of things... Negligible. If everyone was looking at this performance 'advantage' rationally; nobody would think it's worth sacrificing important security or operational properties.

I don't know what happened to our industry; we're supposed to be intelligent people but I see developers falling for these obvious logical fallacies over and over.

I remember back in my day, one of the senior engineers was discussing upgrading a python system and stated openly that the new version of the engine was something like 40% slower than the old version but he didn't even have to explain himself why upgrading was still a good decision; everybody in the company knew he was only talking about the code execution speed and everybody knew that this was a small fraction of the total.

Not saying UUIDv7 was a bad choice for Postgres. I'm sure it's fine for a lot of situations but you don't have to start a cult preaching the gospel of The One True UUID to justify your favorite project's decisions.

I do find it kind of sly though how the community decided to make this UUIDv7 instead of creating a new standard for it.

The whole point of UUID was to leverage the properties of randomness to generate unique IDs without requiring coordination. UUIDv7 seems to take things in a philosophically different path. People chose UUID for scalability and simplicity (both of which you get as a result of doing away with the coordination overhead), not for raw performance...

That's the other thing which drives me nuts; people who don't understand the difference between performance and scalability. People foolishly equate scalability with parallelism or concurrency; whereas that's just one aspect of it; scalability is a much broader topic. It's the difference between a theoretical system which is fast given a certain artificially small input size and one which actually performs better as the input size grows.

Lastly; no mention is made about the complex logic which has to take place behind the scenes to generate UUIDv7 IDs... People take it for granted that all computers have a clock which can produce accurate timestamps where all computers in the world are magically in-sync... UUIDv7 is not simple; it's very complicated. It has a lot of additional complexity and dependencies compared to UUIDv4. Just because that complexity is very well hidden from most developers, doesn't mean it's not there and that it's not a dependency... This may become especially obvious as we move to a world of robotics and embedded systems where cheap microchips may not have enough Flash memory to hold the code for the kinds of programs required to compute such elaborate IDs.

➕ show 1 reply

cruffle_duffle • today at 3:45 PM

I never understood the arguments against using using globally unique ids. For example how it somehow messes up indexes. I’m not a CS major but those are typically b-trees are they not? If you have a primary key whose generation is truly random such that each number is equally likely, then that b-tree is always going to be balanced.

Yes there are different flavors of generating them with their own pros and cons, but at the end of the day it’s just so much more elegant than some auto incrementing crap your database creates. But that is just semantic, you can always change the uuid algorithm for future keys. And honestly if you treat the uuid as some opaque entity (which you should), why not just pick the random one?

And I just thought of the argument that “but what if you want to sort the uuid…” say it’s used for a list of stories or something? Well, again… if you treat the uuid as opaque why would you sort it? You should be sorting on some other field like the date field or title or something. UUIDs are opaque, damn it. You don’t sort opaque data. “Well they get clustered weird” say people. Why are you clustering on a random opaque key? If you need certain data to be clustered, then do it on the right key (user_id field did your data was to be clustered by user, say)

Letting the client generate the primary keys is really liberating. Not having to care about PK collisions or leaking information via auto incrementing numbers is great!

In my opinion uuid isn’t used enough!

➕ show 1 reply

sobakistodor • today at 12:07 PM

[dead]

oldpersonintx2 • today at 10:24 AM

[dead]

waynenilsen • today at 10:59 AM

What kills me is I can’t double click the thing to select it.

➕ show 1 reply

alt Hacker News

Avoid UUIDv4 Primary Keys

Comments