Package managers keep using Git as a database, it never works out

677 points • by birdculture • yesterday at 12:46 PM • 380 comments • view on HN

Comments

I love this write-up. As a non-expert user of package managers I can quickly understand a set of patterns that have been deeply considered and carefully articulated. Thanks for taking the time to write up your observations!

c-linkage • yesterday at 1:52 PM

This seems like a tragedy of the commons -- GitHub is free after all, and it has all of these great properties, so why not? -- but this kind of decision making occurs whenever externalities are present.

My favorite hill to die on (externality) is user time. Most software houses spend so much time focusing on how expensive engineering time is that they neglect user time. Software houses optimize for feature delivery and not user interaction time. Yet if I spent one hour making my app one second faster for my million users, I can save 277 user hour per year. But since user hours are an externality, such optimization never gets done.

Externalities lead to users downloading extra gigabytes of data (wasted time) and waiting for software, all of which is waste that the developer isn't responsible for and doesn't care about.

➕ show 21 replies

cesarb • yesterday at 3:33 PM

One of these is not like the others...

> The problem was that go get needed to fetch each dependency’s source code just to read its go.mod file and resolve transitive dependencies.

This article is mixing two separate issues. One is using git as the master database storing the index of packages and their versions. The other is fetching the code of each package through git. They are orthogonal; you can have a package index using git but the packages being zip/tar/etc archives, you can have a package index not using git but each package is cloned from a git repository, you can have both the index and the packages being git repositories, you can have neither using git, you can even not have a package index at all (AFAIK that's the case for Go).

➕ show 3 replies

dboon • yesterday at 2:35 PM

I’m building Cargo/UV for C. Good article. I thought about this problem very deeply.

Unfortunately, when you’re starting out, the idea of running a registry is a really tough sell. Now, on top of the very hard engineering problem of writing the code and making a world class tool, plus the social one of getting it adopted, I need to worry about funding and maintaining something that serves potentially a world of traffic? The git solution is intoxicating through this lense.

Fundamentally, the issue is the sparse checkouts mentioned by the author. You’d really like to use git to version package manifests, so that anyone with any package version can get the EXACT package they built with.

But this doesn’t work, because you need arbitrary commits. You either need a full checkout, or you need to somehow track the commit a package version is in without knowing what hash git will generate before you do it. You have to push the package update and then push a second commit recording that. Obviously infeasible, obviously a nightmare.

Conan’s solution is I think just about the only way. It trades the perfect reproduction for conditional logic in the manifest. Instead of 3.12 pointing to a commit, every 3.x points to the same manifest, and there’s just a little logic to set that specific config field added in 3.12. If the logic gets too much, they let you map version ranges to manifests for a package. So if 3.13 rewrites the entire manifest, just remap it.

I have not found another package manager that uses git as a backend that isn’t a terrible and slow tool. Conan may not be as rigorous as Nix because of this decision but it is quite pragmatic and useful. The real solution is to use a database, of course, but unless someone wants to wire me ten thousand dollars plus server costs in perpetuity, what’s a guy supposed to do?

➕ show 9 replies

ekjhgkejhgk • yesterday at 1:49 PM

Do the easy thing while it works, and when it stops working, fix the problem.

Julia does the same thing, and from the Rust numbers on the article, Julia has about 1/7th the number of packages that Rust does[1] (95k/13k = 7.3).

It works fine, Julia has some heuristics to not re-download it too often.

But more importantly, there's a simple path to improve. The top Registry.toml [1] has a path to each package, and once donwloading everything proves unsustainable you can just download that one file and use it to download the rest as needed. I don't think this is a difficult problem.

[1] https://github.com/JuliaRegistries/General/blob/master/Regis...

➕ show 5 replies

steeleduncan • yesterday at 1:58 PM

The other conclusion to draw is "Git is a fantastic choice of database for starting your package manager, almost all popular package managers began that way."

➕ show 5 replies

kibwen • yesterday at 2:37 PM

I think there's a form of survivorship bias at work here. To use the example of Cargo, if Rust had never caught on, and thereby gotten popular enough to inflate the git-based index beyond reason, then it would never have been a problem to use git as the backing protocol for the index. Likewise, we can imagine innumerable smaller projects that successfully use git as a distributed delta-updating data distribution protocol, and never happen to outgrow it.

The point being, if you're not sure whether your project will ever need to scale, then it may not make sense to reinvent the wheel when git is right there (and then invent the solution for hosting that git repo, when Github is right there), letting you spend time instead on other, more immediate problems.

➕ show 3 replies

jama211 • yesterday at 6:19 PM

“It never works out” - hmm, seems like it worked out just fine, worked great to get the operation of the ground and when scale became an issue it was solvable by moving to something else. It served its purpose, sounds like it worked out to me.

➕ show 4 replies

newswangerd • yesterday at 5:58 PM

It’s always humbling when you go on the front page of HN and see an article titled “the thing you’re doing right now is a bad idea and here’s why”

This has happened to me a few times now. The last one was a fantastic article about how PG Notify locks the whole database.

In this particular case it just doesn’t make a ton of sense to change course. Im a solo dev building a thing that may never take off, so using git for plug-in distribution is just a no brainer right now. That said, I’ll hold on to this article in case I’m lucky enough to be in a position where scale becomes an issue for me.

➕ show 1 reply

quaintdev • yesterday at 1:44 PM

I host my own code repository using Forgejo. It's not public. In fact, it's behind mutual tls like all the service I host. Reason? I don't want to deal with bots and other security risks that come with opening port to the world.

Turns out Go module will not accept package hosted on my Forgejo instance because it asks for certificate. There are ways to make go get use ssh but even with that approach the repository needs to be accessible over https. In the end, I cloned the repository and used it in my project using replace directive. It's really annoying.

➕ show 4 replies

hogrug • yesterday at 2:10 PM

The facts are interesting but the conclusion a bit strange. These package managers have succeeded because git is better for the low trust model and GitHub has been hosting infra for free that no one in their right mind would provide for the average DB.

If it didn't work we would not have these massive ecosystems upsetting GitHub's freemium model, but anything at scale is naturally going to have consequences and features that aren't so compatible with the use case.

jarofgreen • yesterday at 4:29 PM

It's not just package manager who do this - a lot of smaller projects crowd source data in git repositories. Most of these don't reach the scale where the technical limitations become a problem.

Personally my view is that the main problem when they do this is that it gets much harder for non-technical people to contribute. At least that doesn't apply to package managers, where it's all technical people contributing.

There are a few other small problems - but it's interesting to see that so many other projects do this.

I ended up working on an open source software library to help in these cases: https://www.datatig.com/

Here's a write up of an introduction talk about it: https://www.datatig.com/2024/12/24/talk.html I'll add the scale point to future versions of this talk with a link to this post.

➕ show 1 reply

ifh-hn • yesterday at 2:18 PM

So what's the answer then? That's the question I wanted answered after reading this article. With no experience with git or package management, would using a local client sqlite database and something similar on the server do?

➕ show 2 replies

dleslie • yesterday at 3:39 PM

GitHub is intoxicatingly free hosting, but Git itself is a terrible database. Why not maintain an _actual_ database on GitHub, with tagged releases?

Sqlite data is paged and so you can get away with only fetching the pages you need to resolve your query.

https://phiresky.github.io/blog/2021/hosting-sqlite-database...

➕ show 1 reply

cbondurant • yesterday at 3:49 PM

Admittedly, I try and stay away from database design whenever possible at work. (Everything database is legacy for us) But the way the term is being used here kinda makes me wonder, do modern sql databases have enough security features and permissions management systems in place that you could just directly expose your database to the world with a "guest" user that can only make incredibly specific queries?

Cut out the middle man, directly serve the query response to the package manager client.

(I do immediately see issues stemming from the fact that you cant leverage features like edge caching this way, but I'm not really asking if its a good solution, im more asking if its possible at all)

➕ show 5 replies

nottorp • today at 12:19 PM

> Auto-updates now run every 24 hours instead of every 5 minutes

What the... why would you run an autoupdate every 5 minutes?

Ericson2314 • yesterday at 5:11 PM

The Nixpkgs example is not like the others, because it is source code.

I don't get what is so bad about shallow clones either. Why should they be so performance sensative?

➕ show 4 replies

twoodfin • yesterday at 1:39 PM

What made git special & powerful from the start was its data model: Like the network databases of old, but embedded in a Merkle tree for independent evolution and verifiability.

Scaling that data model beyond projects the size of the Linux kernel was not critical for the original implementation. I do wonder if there are fundamental limits to scaling the model for use cases beyond “source code management for modest-sized, long-lived projects”.

➕ show 1 reply

shellkr • today at 7:49 AM

I am not sure this is necessarily a git issue as it is mostly a GitHub issue. just look at the Aur of Arch Linux which works perfectly.

Zambyte • yesterday at 1:45 PM

The issues with using Git for Nix seem to entirely be issues with using GitHub for Nix, no?

➕ show 2 replies

ori_b • yesterday at 2:04 PM

Alternatively: Downloading the entire state of all packages when you care about just one, it never works out.

O(1) beats O(n) as n gets large.

➕ show 1 reply

the__alchemist • yesterday at 6:46 PM

The Cargo example at the top is striking. Whenever I publish a crate, and it blocks me until I write `--allow-dirty`, I am reminded that there is a conflation between Cargo/crates.io and Git that should not exist. I will write `--allow-dirty` because I think these are two separate functionalities that should not be coupled. Crates.io should not know about or care about my project's Git usage or lack thereof.

➕ show 1 reply

gethly • yesterday at 2:29 PM

If we stopped using VCS to fetch source files, we would lose the ability to get the exact commit(understand as version that has nothing to do with the underlying VCS) of these files. Git, Mercurial, SVN.., github, bitbucket...it does not matter. Absolutely nobody will be building downloadable versions of their source files, hosted on who knows how "prestigious" domains, by copying them to another location just to serve the --->exact same content<--- that github and alike already provide.

This entire blog is just a waste of time for anyone reading it.

➕ show 3 replies

tarun_anand • today at 12:42 PM

why dont we have a P2P transfer platform for this (modulo security)

mikkupikku • yesterday at 1:57 PM

People who put off learning SQL for later end up using anything other than a database as their database.

➕ show 2 replies

bandrami • yesterday at 9:58 PM

Maybe I'm misreading the article but isn't every example about the downside of using github as a database host, not the downside of using git as a database?

Like, yes, you should host your own database. This doesn't seem like an argument against that database being git.

zzo38computer • yesterday at 10:30 PM

Git commits will have a hash and each file will have a hash, which means that locking is unnecessary for read access. (This is also true of fossil, although fossil does have locking since it uses SQLite.)

The other stuff mentioned in the article seems to be valid criticisms.

bencornia • yesterday at 1:52 PM

> Grab’s engineering team went from 18 minutes for go get to 12 seconds after deploying a module proxy. That’s not a typo. Eighteen minutes down to twelve seconds.

> The problem was that go get needed to fetch each dependency’s source code just to read its go.mod file and resolve transitive dependencies. Cloning entire repositories to get a single file.

I have also had inconsistent performance with go get. Never enough to look closely at it. I wonder if I was running into the same issue?

➕ show 2 replies

ekjhgkejhgk • yesterday at 5:58 PM

Uncertain if this is OT, but given that the CCC is politically inspired organization, I hope not:

One thing that still seems absent is awareness of the complete takeover of "gadgets" in schools. Schools these days, as early as primary school, shove screens in front of children. They're expected to look at them, and "use" them for various activities, including practicing handwriting. I wish I was joking [1].

I see two problems with this.

First is that these devices are engineered to be addictive by way of constant notifications/distractions, and learning is something that requires long sustained focus. There's a lot of data showing that under certain common circumstances, you do worse learning from a screen than from paper.

Second is implicitly it trains children to expect that anything has to be done through a screen connected to a closed point-and-click platform. (Uninformed) people will say "people who work with computers make money, so I want my child to have an ipad". But interacting with a closed platform like an ipad is removing the possibilities and putting the interaction "on rails". You don't learn to think, explore and learn from mistakes, instead you learn to use the app that's put in front of you. This in turn reinforces the "computer says no" [2] approach to understanding the world.

I think this is a matter of civil rights and freedom, but sadly I don't often see "civil rights" organizations talk about this. I think I heard Stallman say something along these lines once, but other than that I don't see campaigns anywhere.

[1] https://www.letterjoin.co.uk/

[2] https://youtu.be/eE9vO-DTNZc

➕ show 1 reply

teiferer • yesterday at 5:14 PM

And this my friends is the reason why (only) focusing on CPU cycles and memory hierarchies is insufficient when thinking of the performance of a system. Yes they are important. But no level of low-level optimization will get you out of the hole that a wrong choice of algorithm and/or data structure may have dug you into.

mukundesh • yesterday at 5:32 PM

Though not Github, worth mentioning Huggingface, which is also using git, but managing large files with their(?) xet protocol. https://huggingface.co/docs/hub/en/xet/index

kuahyeow • yesterday at 11:05 PM

GitLab employee here. We have completed the move away from Gollum years ago (see https://gitlab.com/groups/gitlab-org/-/epics/2381).

It looks like that doc https://docs.gitlab.com/development/wikis/ was outdated - since fixed to no longer mention Gollum.

themk • yesterday at 7:24 PM

I think git is overkill, and probably a database is as well.

I quite like the hackage index, which is an append-only tar file. Incremental updates are trivial using HTTP range requests making hosting it trivial as well.

hk1337 • yesterday at 2:10 PM

I like Go but it’s dependency management is weird and seems to be centered around GitHub a lot.

➕ show 3 replies

aidenn0 • yesterday at 7:04 PM

As far as I know, Nixpkgs doesn't use git as a package database. The packages definitions are stored and developed in git, but the channels certainly are not.

nacozarina • yesterday at 2:50 PM

successful things often have humble origins, it’s a feature not a bug

for every project that managed to out-grow ext4/git there were a hundred that were well-served and never needed to over-invest in something else

PunchyHamster • yesterday at 3:02 PM

The article conclusion is just... not good. There are many benefits to using Git as backend, you can point your project to every single commit as a version which makes testing any fixes or changes in libs super easy, it has built in integrity control and technically (sadly not in practice) you could just sign commits and use that to verify whether package is authentic.

It being unoptimal bandwidth wise is frankly just a technical hurdle to get over it, with benefits well worth the drawback

mikepurvis • yesterday at 6:31 PM

The nix cli almost exclusively pulls GitHub as zipballs. Not perfect but certainly far faster than a real git clone.

➕ show 1 reply

krbaccord941x • today at 4:22 AM

I understand article is concerning RFC2789, in cloning whole indexes for lang indexes, but /cargo/src shallow-clones need another layer, where tertiary compilation or decompression takes place in mutex libraries, whether its SSL certificate is dependent on HTTP fetch.

drzaiusx11 • yesterday at 5:48 PM

I'd add git gemfile dependencies to the list of languages called out here as well. It supports git repos, but in general it's a bad idea unless you are diligent with git tag use and disallow git tag mutability, which also assumes you have complete control of your git dependencies...

drzaiusx11 • yesterday at 5:44 PM

One of the first things I did at my current place of employment was to detangle the mess of gemfile git dependencies and get them to adopt semver and an actual package repo. There were so many footguns with git dependencies in ruby we were getting taken down by friendly fire on the daily...

pizlonator • yesterday at 4:37 PM

What is the alternative?

"Use a database" isn't actionable advice because it's not specific enough

➕ show 1 reply

skylurk • today at 4:12 AM

Sounds like it worked pretty well several times? But yeah it does not scale forever.

xpressvideoz • yesterday at 3:48 PM

The article lists Git-based wiki engines as a bad usage of Git. Can anybody recommend alternatives? I want something that can be self-hosted, is easily modified by text editors, and has individual page history, preferably with Markdown.

wg0 • yesterday at 11:02 PM

Why not use SQLite then as database for package managers? A local copy could be replicated easily with delta fetch.

didip • yesterday at 6:41 PM

So… What we need is a globally distributed git seeders of all open source github content, then?

Seems possible if every git client is also a torrent client.

mcny • yesterday at 7:05 PM

I want to take a quick detour here if anyone is knowledgeable about this topic.

> The hosting problems are symptoms. The underlying issue is that git inherits filesystem limitations, and filesystems make terrible databases.

Does this mean mbox is inherently superior to maildir? I really like the idea of maildir because there is nothing to compact but if we assume we never delete emails (on the local machine anyways), does that mean mbox or similar is preferable over maildir?

➕ show 1 reply

dwardu • yesterday at 5:04 PM

Worst thing is when you’re in a an office and your pc along with other pcs pulls from git unauthenticated, then you get hit with api limits

alt Hacker News

Package managers keep using Git as a database, it never works out

Comments

🔗 View 26 more comments