At this point, the question is: why keep files as blobs in the first place. If a revision control system stores AST trees instead, all the work is AST-level. One can run SQL-level queries then to see what is changing where. Like
- do any concurrent branches touch this function?
- what new uses did this function accrete recently?
- did we create any actual merge conflicts?
Almost LSP-level querying, involving versions and branches.
Beagle is a revision control system like that [1]It is quite early stage, but the surprising finding is: instead of being a depository of source code blobs, an SCM can be the hub of all activities. Beagle's architecture is extremely open in the assumption that a lot of things can be built on top of it. Essentially, it is a key-value db, keys are URIs and values are BASON (binary mergeable JSON) [2] Can't be more open than that.
[1]: https://github.com/gritzko/librdx/tree/master/be
[2]: https://github.com/gritzko/librdx/blob/master/be/STORE.md
This is the right question. Storing ASTs directly would make all of this native instead of layered on top.
The pragmatic reason weave works at the git layer: adoption. Getting people to switch merge drivers is hard enough, getting them to switch VCS is nearly impossible. So weave parses the three file versions on the fly during merge, extracts entities, resolves per-entity, and writes back a normal file that git stores as a blob. You get entity-level merging without anyone changing their workflow.
But you're pointing at the ceiling of that approach. A VCS that stores ASTs natively could answer "did any concurrent branches touch this function?" as a query, not as a computation. That's a fundamentally different capability. Beagle looks interesting, will dig into the BASON format.
We built something adjacent with sem (https://github.com/ataraxy-labs/sem) which extracts the entity dependency graph from git history. It can answer "what new uses did this function accrete" and "what's the blast radius of this change" but it's still a layer on top of git, not native storage.
Well, if you're programming in C or C++, there may not be a parse tree. Tree-sitter makes a best effort attempt to parse but it can't in general due to the preprocessor.
Well, I'll be diving in. Thank you for sharing. Same for Weave.
How do you get blob file writes fast?
I built lix [0] which stores AST’s instead of blobs.
Direct AST writing works for apps that are “ast aware”. And I can confirm, it works great.
But, the all software just writes bytes atm.
The binary -> parse -> diff is too slow.
The parse and diff step need to get out of the hot path. That semi defeats the idea of a VCS that stores ASTs though.
[0] https://github.com/opral/lix