Every GitHub object has two IDs

163 points • by dakshgupta • yesterday at 3:52 PM • 45 comments • view on HN

Comments

The newer global node IDs (which can be forced via the 'X-Github-Next-Global-ID' header [1]) have a prefix indicating the "type" of object delimited by an underscore, then a base64 encoded msgpack payload. For most objects it contains just a version (starting at 0) followed by the numeric "databaseId" field, but some are more complex.

For example, my GitHub user [2] has the node ID "U_kgDOAAhEkg". Users are "U_" and then the remaining data decodes to: [0, 541842] which matches the numeric ID for my user accepted by the REST API [3].

You shouldn't rely on any of this implementation of course, instead just directly query the "databaseId" field from the GraphQL API where you need interoperability. And in the other direction the REST API returns the "node_id" field for the GraphQL API.

For folks who finds this interesting, you might also like [4] which details GitHub's ETag implementation for the REST API.

[1] https://docs.github.com/en/graphql/guides/migrating-graphql-... [2] https://api.github.com/user/541842 [3] https://gchq.github.io/CyberChef/#recipe=Find_/_Replace(%7B'... [4] https://github.com/bored-engineer/github-conditional-http-tr...

haileys • today at 1:18 AM

> That repository ID (010:Repository2325298) had a clear structure: 010 is some type enum, followed by a colon, the word Repository, and then the database ID 2325298.

It's a classic length prefix. Repository has 10 chars, Tree has 4.

➕ show 2 replies

siralonso • today at 2:44 AM

I wouldn't decode them like this, it's fragile, and global node IDs are supposed to be opaque in GraphQL.

I see that GitHub exposes a `databaseId` field on many of their types (like PullRequest) - is that what you're looking for? [1]

Most GraphQL APIs that serve objects that implement the Node interface just base-64-encode the type name and the database ID, but I definitely wouldn't rely on that always being the case. You can read more about global IDs in GraphQL in the spec in [2].

[1] https://docs.github.com/en/graphql/reference/objects#pullreq... [2] https://graphql.org/learn/global-object-identification/

➕ show 1 reply

ezyang • today at 1:33 AM

I just want to point out that Opus 4.5 actually knows this trick and will write the code to decode the IDs if it is working with GitHub's API lol

zzo38computer • today at 3:10 AM

I had seen GitHub node IDs, although I had not used them or tried to decode them (although I could see they seem to be base64), since I only used the REST API, which reports node IDs but does not use them as input.

It looks like a good explanation of the node IDs, though. However, like another comment says, you should not rely on the format of node IDs.

phibz • today at 1:52 AM

In database design typically it recommends giving out opaque natural keys, and keeping your monotonically increasing integer IDs secret and used internally.

➕ show 2 replies

bastawhiz • today at 3:38 AM

1. The list of "scopes" are the object hierarchy that owns the resource. That lets you figure out which shard a resource should be in. You want all the resources for the same repository on the same shard, otherwise if you simply hash the id, one shard going down takes down much of your service since everything is spread more or less uniformly across shards.

2. The object identifier is at the end. That should be strictly increasing, so all the resources for the same scope are ordered in the DB. This is one of the benefits of uuid7.

3. The first element is almost certainly a version. If you do a migration like this, you don't want to rule out doing it again. If you're packing bits, it's nearly impossible to know what's in the data without an identifier, so without the version you might not be able know whether the id is new or old.

Another commenter mentioned that you should encrypt this data. Hard pass! Decrypting each id is decidedly slower than b64 decode. Moreover, if you're picking apart IDs, you're relying on an interface that was never made for you. There's nothing sensitive in there: you're just setting yourself up for a possible (probable?) world of pain in the future. GitHub doesn't have to stop you from shooting your foot off.

Moreover, encrypting the contents of the ID makes them sort randomly. This is to be avoided: it means similar/related objects are not stored near each other, and you can't do simple range scans over your data.

You could decrypt the ids on the way in and store both the unencrypted and encrypted versions in the DB, but why? That's a lot of complexity, effort, and resources to stop randos on the Internet from relying on an internal, non-sensitive data format.

As for the old IDs that are still appearing, they are almost certainly:

1. Sharded by their own id (i.e., users are sharded by user id, not repo id), so you don't need additional information. Use something like rendezvous hashing to choose the shard.

2. Got sharded before the new id format was developed, and it's just not worth the trouble to change

chatmasta • today at 1:04 AM

> Somewhere in GitHub's codebase, there's an if-statement checking when a repository was created to decide which ID format to return.

I doubt it. That's the beauty of GraphQL — each object can store its ID however it wants, and the GraphQL layer encodes it in base64. Then when someone sends a request with a base64-encoded ID, there _might_ be an if-statement (or maybe it just does a lookup on the ID). If anything, the if-statement happens _after_ decoding the ID, not before encoding it.

There was never any if-statement that checked the time — before the migration, IDs were created only in the old format. After the migration, they were created in the new format.

naikrovek • today at 4:11 AM

Remember this article when you get upset that your own customers have come to rely on behavior that you told them explicitly not to rely on.

If it is possible to figure something out, your customers will eventually figure it out and rely on it.

➕ show 1 reply

csomar • today at 2:37 AM

This makes no sense. I am developing a product in this space (https://codeinput.com) and GitHub API and GraphQl is a badly entangled mess but you don’t trick your way around ids.

There is actually a documented way to do it: https://docs.github.com/en/graphql/guides/using-global-node-...

Same for urls, you are supposed to get them directly from GitHub not construct them yourself as format can change and then you find yourself playing a refactor cat-and-mouse game.

Best you can do is an hourly/daily cache for the values.

agwa • today at 1:31 AM

> GitHub's migration guide tells developers to treat the new IDs as opaque strings and treat them as references. However it was clear that there was some underlying structure to these IDs as we just saw with the bitmasking

Great, so now GitHub can't change the structure of their IDs without breaking this person's code. The lesson is that if you're designing an API and want an ID to be opaque you have to literally encrypt it. I find it really demoralizing as an API designer that I have to treat my API's consumers as adversaries who will knowingly and intentionally ignore guidance in the documentation like this.

➕ show 11 replies

alt Hacker News

Every GitHub object has two IDs

Comments