logoalt Hacker News

b1temytoday at 2:53 PM1 replyview on HN

Nice work, it's always interesting to see how one would design their own VCS from scratch, and see if they fall into problems existing implementations fell into in the past and if the same solution was naturally reached.

The `tvc ls` command seems to always recompute the hash for every non-ignored file in the directory and its children. Based on the description in the blog post, it seems the same/similar thing is happening during commits as well. I imagine such an operation would become expensive in a giant monorepo with many many files, and perhaps a few large binary files thrown in.

I'm not sure how git handles it (if it even does, but I'm sure it must). Perhaps it caches the hash somewhere in the `.git`directory, and only updates it if it senses the file hash changed (Hm... If it can't detect this by re-hashing the file and comparing it with a known value, perhaps by the timestamp the file was last edited?).

> Git uses SHA-1, which is an old and cryptographically broken algorithm. This doesn't actually matter to me though, since I'll only be using hashes to identify files by their content; not to protect any secrets

This _should_ matter to you in any case, even if it is "just to identify files". If hash collisions (See: SHAttered, dating back to 2017) were to occur, an attacker could, for example, have two scripts uploaded in a repository, one a clean benign script, and another malicious script with the same hash, perhaps hidden away in some deeply nested directory, and a user pulling the script might see the benign script but actually pull in the malicious script. In practice, I don't think this attack has ever happened in git, even with SHA-1. Interestingly, it seems that git itself is considering switching to SHA-256 as of a few months ago https://lwn.net/Articles/1042172/

I've not personally heard of the process of hashing to also be known as digesting, though I don't doubt that it is the case. I've mostly familiar of the resulting hash being referred to as the message digest. Perhaps it's to differentiate between the verb 'hash' (the process of hashing) with the output 'hash' (the ` result of hashing). And naming the function `sha256::try_digest`makes it more explicit that it is returning the hash/digest. But it is a bit of a reach, perhaps that are just synonyms to be used interchangeably as you said.

On a tangent, why were TOML files not considered at the end? I've no skin in the game and don't really mind either way, but I'm just curious since I often see Rust developers gravitate to that over YAML or JSON, presumably because it is what Cargo uses for its manifest.

--

Also, obligatory mention of jujutsu/jj since it seems to always be mentioned when talking of a VCS in HN.


Replies

TonyStrtoday at 3:46 PM

You are completely right about tvc ls recomputing each hash, but I think it has to do this? A timestamp wouldn't be reliable, so the only reliable way to verify a file's contents would be to generate a hash.

In my lazy implemenation, I don't even check if the hashes match, the program reads, compresses and tries to write the unchanged files. This is an obvious area to improve performance on. I've noticed that git speeds up object lookups by generating two-letter directories from the first two letters in hashes, so objects aren't actually stored as `.git/objects/asdf12ha89k9fhs98...`, but as `.git/objects/as/df12ha89k9fhs98...`.

>why were TOML files not considered at the end I'm just not that familiar with toml. Maybe that would be a better choice! I saw another commenter who complained about yaml. Though I would argue that the choice doesn't really matter to the user, since you would never actually write a commit object or a tree object by hand. These files are generated by git (or tvc), and only ever read by git/tvc. When you run `git cat-file <hash>`, you'll have to add the `-p` flag (--pretty) to render it in a human-readable format, and at that point it's just a matter of taste whether it's shown in yaml/toml/json/xml/special format.

show 1 reply