logoalt Hacker News

chipx86last Wednesday at 6:05 AM4 repliesview on HN

Hi, one of the authors of DiffX here. I didn't expect to find this on Hacker News tonight.

I'm going to try to address some of the reasoning behind DiffX here, and answer a few things that have come up in the comments. I'll start by saying, the issues we're addressing are more issues encountered by tools that work with these files, not necessarily the end users of these tools. Most people never have to think hard about the structure of these files, but we do.

A little background. I co-founded Review Board, a code review product that's been around since 2006. We work with a wide variety of source code management solutions, and because of this, we deal extensively with .diff/.patch files. And nobody generates these in any consistent way.

If Git were the only SCM out there, there'd be no need for DiffX. But there are at least a dozen SCMs in use in production across companies today, and more being developed. And each does things differently.

Git, Mercurial, Subversion, ClearCase, Perforce, and others all have their own bespoke type of format, built largely (but not always consistently or fully compatibly) on Unified Diffs. These often augment Unified Diffs by injecting:

* Revision identifiers (which can come in all kinds of forms -- numbers, hashes, paths, reserved words -- and sometimes need to be paired with other information to resolve a file). Sometimes these are on the `---`/`+++` lines, sometimes not.

* Symlink information (some tools provide the old and new symlink paths, some just the new, some neither, some the file contents)

* File modes (similarly, out of those that convey file modes, some provide more details than others, and these can impact application or processing of a patch)

* Commit descriptions (and if not done right, a stray `---` or metadata keyword can break some parsers)

Or any number of other common or SCM-specific metadata.

Pretty much all of these represent data in their own ways. This information goes in the "garbage" area of Unified Diffs, which basically means tools like GNU patch ignore it, but tools aware of that specific variant can parse it out.

At this point in my life, I've written a couple of dozen bespoke diff parsers at this point. Depending on the tool generating the diff, there's all sorts of parsing issues that can come up:

* Varying encodings for file paths and text strings (like commit messages), which can mean a patch on one system doesn't apply on another, depending on the tools generating it.

* BOMs that sometimes show up in strings (we hit this with Perforce years back).

* Messages or metadata can sometimes include characters that resemble Unified Diff content or other variant-specific syntax and can break patching/parsing.

* Differences in how information like symlinks/file modes are conveyed (see above).

* Binary files are almost never able to be represented beyond a "Binary files X and Y differ" line.

* Newlines are sometimes outright broken within the diff (particularly with mixed line endings) and can sometimes break patching.

Just to name a few.

Many SCMs don't even have a diff format to begin with, just generating a Unified Diff. These don't contain any revision information needed to locate a file. In these cases, or when important information is unavailable in some diff variant, we're stuck rolling our own.

Also, here's something you wouldn't normally expect to be a problem, but can be in practice more than you'd think, is that some very large diffs (we've seen ones hundreds of megabytes in size -- don't get me started) are time-consuming to parse. To know everything about the diff, you need to read and scan every line. To generate a list of filenames or stats on a diff, you need to effectively parse all of it.

So we took all those pain points, talked to developers working on a few SCMs, got their pain points and thoughts, and drafted the initial DiffX spec. Went through several rounds with them, iterated until we got where we are today.

The spec had some important goals:

1. Not being vendor-specific.

Git patches were built for Git, and even Git-like patches from Mercurial or Subversion have quirks that can break parsing in a Git-specific patch parser. There's no grammar for how Git stores the metadata and the clients require knowledge up-front of the value types, which isn't a good fit for some of the SCMs out there.

We wanted to draft something that could be used more generically, able to be adopted by newer tools while also being able to represent the information provided from existing tools.

2. Support for arbitrary and injectable metadata.

Some of the formats we work with don't contain enough information to locate the file + revision within a repository. Some require additional information, like an explicit branch or workspace ID or a counterpart changeset number.

Even Git diffs don't always provide enough. They provide a Blob SHA, but not a commit SHA by default. This is a problem when talking to APIs on Git hosting services that require a commit SHA along with (or instead of) a blob SHA.

And some have useful data that can't be fetched after the fact.

So a common headache is that we need to inject additional information in the diffs we generate in order to allow the appropriate data to be looked up.

y using a form of metadata storage for the diff file, the commit, and the files within, we have the ability to inject that additional information without worrying about corrupting a state machine or regex or whatever method is used for some parser (or some older version of our own parsers).

We eventually chose JSON here. We initially had a grammar that looked more like Git's format, but found ourselves dealing with some of the same challenges that YAML had. We didn't want the "NO" problem and we didn't want every client to have to decide on what the value of a string in a piece of metadata should be. Some metadata (such as revision IDs in some SCMs) differentiate between a number and a string that may look like a number, and that information is important to know up-front.

The consensus was that there was more value in JSON than some other format, since it's well-understood, parsers are readily available, and there's no sign that it's disappearing or dropping out of maintenance any time soon.

The format allows for future metadata formats here if, say, json5 or YAML#++ becomes a well-adopted standard 10 years from now.

3. Parsing and mutability.

When working with these files, it's sometimes important to scan for information in the file to do some pre-processing. How many files are in the diff? Is this past some threshold that might trigger a rate limit if we fetch data for each from a repository? Are there binary files in this change?

When these files get very large (which can happen in enterprises when posting a change merging two long-running branches together -- no, people shouldn't ever post 100MB diffs, but they do), these operations can get expensive.

So we built in some parsing aids (content lengths for sections, section hierarchy identifiers) to allow for more efficient parsing. We can read a section header, know where we are and what section we're nested in, and jump to the next piece to read. This is far more efficient than parsing-to-scan and avoids a lot of headaches.

We also get mutability. Generating a diff and attaching metadata to a file in the middle of a diff becomes a lot faster and safer this way.

A consumer never needs to do this. A tool does.

We figured we'd address the text encoding issues while we were at it, because oh boy can these cause problems. A whole topic of its own.

4. Multi-commit files.

Yep, there's Git format-patch. That works great for Git. If I'm on Perforce or SOS or ClearCase and I want to represent a series of commits, I don't have an equivalent format.

If one wants to be able to send a diff spanning a series of commits somewhere for processing or application, being able to do that with one file is valuable. One file means one thing to upload, no risk of a patch ordering issue or a missing patch in the series. The tool processing the diff file would have all the state it needs up-front.

5. Binary files.

Binary files are important. A lot of projects are more than source code in text format. Images, documents, 3D models.. these get left out of diff files today by default.

The exception is Git, which can represent changes to binary files as Git Literal and Git Delta formats. This is largely undocumented (outside of our spec) and not supported by really anything else.

We review binary files, so we wanted this. Talking to other SCM vendors, some found this a pain point as well but didn't have a solution in place. So we wanted this to be documented and addressed in the spec.

This is already very long, but I wanted to give a bit of insight into the kinds of problems and inconsistencies that tools (not necessarily end users) have to deal with, and how this is meant to address some of those problems.


Replies

kvemkonlast Wednesday at 8:29 AM

> length=629

If you work often enough with patches, you know it is not so rare you need to modify the patch itself resulting in different length. Please do not hard-code the length.

xorcistlast Wednesday at 11:37 AM

From the outside, it seems like what you really wanted to write was a diff tool, not a new patch format.

You already are user facing. Why interpret a user facing format behind the scenes? It makes no sense. The document speaks about specifying yet another diff format, but luckily, it does nothing of the sort but specifies a new patch set format. But those are by necessity VCS- and file format specific.

chipx86last Wednesday at 6:09 AM

Sorry for the absolute wall of text (I tried nesting things under bullet points but that didn't work out well). Hopefully some of that is useful.

Key point: Much of this is about solving issues with tools that work with the varying file formats of diffs. It's not really something end users should ever have to care about.

show 1 reply
chipx86last Wednesday at 6:12 AM

I should also mention, we didn't want to invent a brand-new diff format that required all-new tooling (a replacement for GNU patch, for instance). We want this to be able to work with existing tools that understand Unified Diffs and respect the garbage areas (as most do).

There are a lot of alternative approaches to how one might generate a diff (see VCDIFF for binary files), and much that's worth thinking about for generating diff-like formats that are not line-based. But this is not meant to be those. It is meant to be able to incorporate those as time goes on (as it does with VCDIFF today).

show 1 reply