This seems like a tragedy of the commons -- GitHub is free after all, and it has all of these great properties, so why not? -- but this kind of decision making occurs whenever externalities are present.
My favorite hill to die on (externality) is user time. Most software houses spend so much time focusing on how expensive engineering time is that they neglect user time. Software houses optimize for feature delivery and not user interaction time. Yet if I spent one hour making my app one second faster for my million users, I can save 277 user hour per year. But since user hours are an externality, such optimization never gets done.
Externalities lead to users downloading extra gigabytes of data (wasted time) and waiting for software, all of which is waste that the developer isn't responsible for and doesn't care about.
One of these is not like the others...
> The problem was that go get needed to fetch each dependency’s source code just to read its go.mod file and resolve transitive dependencies.
This article is mixing two separate issues. One is using git as the master database storing the index of packages and their versions. The other is fetching the code of each package through git. They are orthogonal; you can have a package index using git but the packages being zip/tar/etc archives, you can have a package index not using git but each package is cloned from a git repository, you can have both the index and the packages being git repositories, you can have neither using git, you can even not have a package index at all (AFAIK that's the case for Go).
I’m building Cargo/UV for C. Good article. I thought about this problem very deeply.
Unfortunately, when you’re starting out, the idea of running a registry is a really tough sell. Now, on top of the very hard engineering problem of writing the code and making a world class tool, plus the social one of getting it adopted, I need to worry about funding and maintaining something that serves potentially a world of traffic? The git solution is intoxicating through this lense.
Fundamentally, the issue is the sparse checkouts mentioned by the author. You’d really like to use git to version package manifests, so that anyone with any package version can get the EXACT package they built with.
But this doesn’t work, because you need arbitrary commits. You either need a full checkout, or you need to somehow track the commit a package version is in without knowing what hash git will generate before you do it. You have to push the package update and then push a second commit recording that. Obviously infeasible, obviously a nightmare.
Conan’s solution is I think just about the only way. It trades the perfect reproduction for conditional logic in the manifest. Instead of 3.12 pointing to a commit, every 3.x points to the same manifest, and there’s just a little logic to set that specific config field added in 3.12. If the logic gets too much, they let you map version ranges to manifests for a package. So if 3.13 rewrites the entire manifest, just remap it.
I have not found another package manager that uses git as a backend that isn’t a terrible and slow tool. Conan may not be as rigorous as Nix because of this decision but it is quite pragmatic and useful. The real solution is to use a database, of course, but unless someone wants to wire me ten thousand dollars plus server costs in perpetuity, what’s a guy supposed to do?
Do the easy thing while it works, and when it stops working, fix the problem.
Julia does the same thing, and from the Rust numbers on the article, Julia has about 1/7th the number of packages that Rust does[1] (95k/13k = 7.3).
It works fine, Julia has some heuristics to not re-download it too often.
But more importantly, there's a simple path to improve. The top Registry.toml [1] has a path to each package, and once donwloading everything proves unsustainable you can just download that one file and use it to download the rest as needed. I don't think this is a difficult problem.
[1] https://github.com/JuliaRegistries/General/blob/master/Regis...
The other conclusion to draw is "Git is a fantastic choice of database for starting your package manager, almost all popular package managers began that way."
I think there's a form of survivorship bias at work here. To use the example of Cargo, if Rust had never caught on, and thereby gotten popular enough to inflate the git-based index beyond reason, then it would never have been a problem to use git as the backing protocol for the index. Likewise, we can imagine innumerable smaller projects that successfully use git as a distributed delta-updating data distribution protocol, and never happen to outgrow it.
The point being, if you're not sure whether your project will ever need to scale, then it may not make sense to reinvent the wheel when git is right there (and then invent the solution for hosting that git repo, when Github is right there), letting you spend time instead on other, more immediate problems.
“It never works out” - hmm, seems like it worked out just fine, worked great to get the operation of the ground and when scale became an issue it was solvable by moving to something else. It served its purpose, sounds like it worked out to me.
It’s always humbling when you go on the front page of HN and see an article titled “the thing you’re doing right now is a bad idea and here’s why”
This has happened to me a few times now. The last one was a fantastic article about how PG Notify locks the whole database.
In this particular case it just doesn’t make a ton of sense to change course. Im a solo dev building a thing that may never take off, so using git for plug-in distribution is just a no brainer right now. That said, I’ll hold on to this article in case I’m lucky enough to be in a position where scale becomes an issue for me.
I host my own code repository using Forgejo. It's not public. In fact, it's behind mutual tls like all the service I host. Reason? I don't want to deal with bots and other security risks that come with opening port to the world.
Turns out Go module will not accept package hosted on my Forgejo instance because it asks for certificate. There are ways to make go get use ssh but even with that approach the repository needs to be accessible over https. In the end, I cloned the repository and used it in my project using replace directive. It's really annoying.
The facts are interesting but the conclusion a bit strange. These package managers have succeeded because git is better for the low trust model and GitHub has been hosting infra for free that no one in their right mind would provide for the average DB.
If it didn't work we would not have these massive ecosystems upsetting GitHub's freemium model, but anything at scale is naturally going to have consequences and features that aren't so compatible with the use case.
It's not just package manager who do this - a lot of smaller projects crowd source data in git repositories. Most of these don't reach the scale where the technical limitations become a problem.
Personally my view is that the main problem when they do this is that it gets much harder for non-technical people to contribute. At least that doesn't apply to package managers, where it's all technical people contributing.
There are a few other small problems - but it's interesting to see that so many other projects do this.
I ended up working on an open source software library to help in these cases: https://www.datatig.com/
Here's a write up of an introduction talk about it: https://www.datatig.com/2024/12/24/talk.html I'll add the scale point to future versions of this talk with a link to this post.
So what's the answer then? That's the question I wanted answered after reading this article. With no experience with git or package management, would using a local client sqlite database and something similar on the server do?
GitHub is intoxicatingly free hosting, but Git itself is a terrible database. Why not maintain an _actual_ database on GitHub, with tagged releases?
Sqlite data is paged and so you can get away with only fetching the pages you need to resolve your query.
https://phiresky.github.io/blog/2021/hosting-sqlite-database...
Admittedly, I try and stay away from database design whenever possible at work. (Everything database is legacy for us) But the way the term is being used here kinda makes me wonder, do modern sql databases have enough security features and permissions management systems in place that you could just directly expose your database to the world with a "guest" user that can only make incredibly specific queries?
Cut out the middle man, directly serve the query response to the package manager client.
(I do immediately see issues stemming from the fact that you cant leverage features like edge caching this way, but I'm not really asking if its a good solution, im more asking if its possible at all)
> Auto-updates now run every 24 hours instead of every 5 minutes
What the... why would you run an autoupdate every 5 minutes?
The Nixpkgs example is not like the others, because it is source code.
I don't get what is so bad about shallow clones either. Why should they be so performance sensative?
What made git special & powerful from the start was its data model: Like the network databases of old, but embedded in a Merkle tree for independent evolution and verifiability.
Scaling that data model beyond projects the size of the Linux kernel was not critical for the original implementation. I do wonder if there are fundamental limits to scaling the model for use cases beyond “source code management for modest-sized, long-lived projects”.
I am not sure this is necessarily a git issue as it is mostly a GitHub issue. just look at the Aur of Arch Linux which works perfectly.
The issues with using Git for Nix seem to entirely be issues with using GitHub for Nix, no?
Alternatively: Downloading the entire state of all packages when you care about just one, it never works out.
O(1) beats O(n) as n gets large.
The Cargo example at the top is striking. Whenever I publish a crate, and it blocks me until I write `--allow-dirty`, I am reminded that there is a conflation between Cargo/crates.io and Git that should not exist. I will write `--allow-dirty` because I think these are two separate functionalities that should not be coupled. Crates.io should not know about or care about my project's Git usage or lack thereof.
If we stopped using VCS to fetch source files, we would lose the ability to get the exact commit(understand as version that has nothing to do with the underlying VCS) of these files. Git, Mercurial, SVN.., github, bitbucket...it does not matter. Absolutely nobody will be building downloadable versions of their source files, hosted on who knows how "prestigious" domains, by copying them to another location just to serve the --->exact same content<--- that github and alike already provide.
This entire blog is just a waste of time for anyone reading it.
why dont we have a P2P transfer platform for this (modulo security)
People who put off learning SQL for later end up using anything other than a database as their database.
Maybe I'm misreading the article but isn't every example about the downside of using github as a database host, not the downside of using git as a database?
Like, yes, you should host your own database. This doesn't seem like an argument against that database being git.
Git commits will have a hash and each file will have a hash, which means that locking is unnecessary for read access. (This is also true of fossil, although fossil does have locking since it uses SQLite.)
The other stuff mentioned in the article seems to be valid criticisms.
> Grab’s engineering team went from 18 minutes for go get to 12 seconds after deploying a module proxy. That’s not a typo. Eighteen minutes down to twelve seconds.
> The problem was that go get needed to fetch each dependency’s source code just to read its go.mod file and resolve transitive dependencies. Cloning entire repositories to get a single file.
I have also had inconsistent performance with go get. Never enough to look closely at it. I wonder if I was running into the same issue?
Uncertain if this is OT, but given that the CCC is politically inspired organization, I hope not:
One thing that still seems absent is awareness of the complete takeover of "gadgets" in schools. Schools these days, as early as primary school, shove screens in front of children. They're expected to look at them, and "use" them for various activities, including practicing handwriting. I wish I was joking [1].
I see two problems with this.
First is that these devices are engineered to be addictive by way of constant notifications/distractions, and learning is something that requires long sustained focus. There's a lot of data showing that under certain common circumstances, you do worse learning from a screen than from paper.
Second is implicitly it trains children to expect that anything has to be done through a screen connected to a closed point-and-click platform. (Uninformed) people will say "people who work with computers make money, so I want my child to have an ipad". But interacting with a closed platform like an ipad is removing the possibilities and putting the interaction "on rails". You don't learn to think, explore and learn from mistakes, instead you learn to use the app that's put in front of you. This in turn reinforces the "computer says no" [2] approach to understanding the world.
I think this is a matter of civil rights and freedom, but sadly I don't often see "civil rights" organizations talk about this. I think I heard Stallman say something along these lines once, but other than that I don't see campaigns anywhere.
And this my friends is the reason why (only) focusing on CPU cycles and memory hierarchies is insufficient when thinking of the performance of a system. Yes they are important. But no level of low-level optimization will get you out of the hole that a wrong choice of algorithm and/or data structure may have dug you into.
Though not Github, worth mentioning Huggingface, which is also using git, but managing large files with their(?) xet protocol. https://huggingface.co/docs/hub/en/xet/index
GitLab employee here. We have completed the move away from Gollum years ago (see https://gitlab.com/groups/gitlab-org/-/epics/2381).
It looks like that doc https://docs.gitlab.com/development/wikis/ was outdated - since fixed to no longer mention Gollum.
I think git is overkill, and probably a database is as well.
I quite like the hackage index, which is an append-only tar file. Incremental updates are trivial using HTTP range requests making hosting it trivial as well.
I like Go but it’s dependency management is weird and seems to be centered around GitHub a lot.
As far as I know, Nixpkgs doesn't use git as a package database. The packages definitions are stored and developed in git, but the channels certainly are not.
successful things often have humble origins, it’s a feature not a bug
for every project that managed to out-grow ext4/git there were a hundred that were well-served and never needed to over-invest in something else
The article conclusion is just... not good. There are many benefits to using Git as backend, you can point your project to every single commit as a version which makes testing any fixes or changes in libs super easy, it has built in integrity control and technically (sadly not in practice) you could just sign commits and use that to verify whether package is authentic.
It being unoptimal bandwidth wise is frankly just a technical hurdle to get over it, with benefits well worth the drawback
The nix cli almost exclusively pulls GitHub as zipballs. Not perfect but certainly far faster than a real git clone.
I understand article is concerning RFC2789, in cloning whole indexes for lang indexes, but /cargo/src shallow-clones need another layer, where tertiary compilation or decompression takes place in mutex libraries, whether its SSL certificate is dependent on HTTP fetch.
I'd add git gemfile dependencies to the list of languages called out here as well. It supports git repos, but in general it's a bad idea unless you are diligent with git tag use and disallow git tag mutability, which also assumes you have complete control of your git dependencies...
One of the first things I did at my current place of employment was to detangle the mess of gemfile git dependencies and get them to adopt semver and an actual package repo. There were so many footguns with git dependencies in ruby we were getting taken down by friendly fire on the daily...
What is the alternative?
"Use a database" isn't actionable advice because it's not specific enough
Sounds like it worked pretty well several times? But yeah it does not scale forever.
The article lists Git-based wiki engines as a bad usage of Git. Can anybody recommend alternatives? I want something that can be self-hosted, is easily modified by text editors, and has individual page history, preferably with Markdown.
Why not use SQLite then as database for package managers? A local copy could be replicated easily with delta fetch.
So… What we need is a globally distributed git seeders of all open source github content, then?
Seems possible if every git client is also a torrent client.
I want to take a quick detour here if anyone is knowledgeable about this topic.
> The hosting problems are symptoms. The underlying issue is that git inherits filesystem limitations, and filesystems make terrible databases.
Does this mean mbox is inherently superior to maildir? I really like the idea of maildir because there is nothing to compact but if we assume we never delete emails (on the local machine anyways), does that mean mbox or similar is preferable over maildir?
Worst thing is when you’re in a an office and your pc along with other pcs pulls from git unauthenticated, then you get hit with api limits
I love this write-up. As a non-expert user of package managers I can quickly understand a set of patterns that have been deeply considered and carefully articulated. Thanks for taking the time to write up your observations!