DiffX – Next-Generation Extensible Diff Format

355 points • by todsacerdoti • last Wednesday at 2:38 AM • 149 comments • view on HN

Comments

laserbeam • last Wednesday at 4:15 AM

I really don’t like the highly hierarchical format, that there’s a “..meta” and a “…meta” somewhere else. I can imagine we want to annotate the whole diff, each file and each chunk. That’s a total of 3 levels of depth. Let’s just give them distinct names and not go full yaml with a format for once?

This helps with readability (if one of the “meta” blocks is missing, for example, I could still tell at a glance what it refers to without counting dots), and is less error prone (it make little sense to me why the metadata associated with a whole diff should have the same fields as the metadata of a file).

Furthermore, why do we have two formats? Json and key=value pairs? Is there any reason to not just use one format because it sounds like the number of things we’d want to annotate is quite small. Having a single structure makes it much easier to write parsers or integrate with existing tooling (grep, sed or jq - but not both at once)

Other notes:

- please allow trailing commas in lists

- diffs are inherently splittable. I can grab half of a diff and apply it. How does your format influence that? I guess it breaks because I would need to copy the preamble, then skip 20 lines, then copy the block I need?

- revisions are a file property? Not a commit checksum? (I might just be dumb here)

➕ show 1 reply

HelloNurse • last Wednesday at 7:13 AM

A staggering amount of unnecessary and counterproductive scope creep in just 4 items:

    A single diff can’t represent a list of commits

    There’s no standard way to represent binary patches

    Diffs don’t know about text encodings (which is more of a problem than you might think)

    Diffs don’t have any standard format for arbitrary metadata, so everyone implements it their own way.

Of these, only a notation for binary patches would be a reasonable generalization of diff files. Everything else is the internal data structure or protocol of some specific revision control system, only exchanged between its clients and servers and backups.

➕ show 2 replies

blacklion • last Wednesday at 11:32 AM

So, self-delimitered format (JSON) is embedded in format with lengths? I change one space in JSON, JSOM is valid, whole DiffX file is invalid.

Nice, nice.

Format looks very clunky and messy, to be honest, mixture of self-invented headers and JSON payloads, strange structure (without comments here I will not notice different number of dots in `.meta`), need essentialy two parsers.

Idea to have extended diff with standard way to put metadata is good.

This implementation looks bad, sorry.

ed • last Wednesday at 3:54 AM

The patch format addresses all of these issues, no?

https://git-scm.com/docs/git-format-patch

➕ show 2 replies

bawolff • last Wednesday at 7:48 AM

Are these really problems? I feel like i've never really encountered any of these issues and have trouble imagining when they would crop up (except binary files).

- encoding - even if your file is not utf-8, why would that matter? You would still run the patch algorithm the same way. It doesn't really matter if the characters are valid utf-8

-why would i want a single diff to represent multiple commits? Having multiple diffs seems much more natural.

-metadata... i guess, but also the metadata seems like it would mostly only be useful inside a single system.

➕ show 4 replies

xyzzy_plugh • last Wednesday at 3:49 AM

I find this whole document hard to read. A "diff" colloquially refers to the difference between two things -- files, directory trees, whatever. What TFA refers to as a diff has been always known as a patch, at least to me.

This is nothing about diffs, but entirely about patch metadata management. I mean, sure, noble goal, but this is just shuffling bits around. If they proposed that metadata was required to be JSON that would be one thing, but instead it's some weird self-describing length-delimited nonsense that just disguises the same problems that exist today. It's already extensible! Just type words!

I've spent a lot of time parsing things out of git commits and patch files and while some standardization would be neat, this isn't it.

That said I find the argument that git diff style is more or less canonical more compelling than I have in the past. So there's that.

> A single diff can't represent a list of commits

A patch set can! Why on earth would you want that represented by a single diff is beyond me.

➕ show 1 reply

redleader55 • last Wednesday at 5:42 AM

What actual problem is this trying to solve? They mention patch/diff format not being good enough, but they don't explain for whom. Are GNU Patch people complaining? What are these people building that needs a better patch format?

➕ show 2 replies

itake • last Wednesday at 3:59 AM

One of my issues that remains unsolved with diff tools is they are dependent on new line attributes.

Reviewing changes on a long line (like compressed json or long array) is too difficult.

➕ show 3 replies

tlb • last Wednesday at 9:14 AM

The most general and unambiguous way to represent a diff is to just include the contents of the two files. It's more data, but that's rarely an issue these days.

So instead of `diff a b | patch c`, where the data through the pipe needs to be in some interchange format, you'd run `apply a b c` and the apply command can use whatever internal representation it likes.

Diffs also aren't great for human reading. A color-coded side-by-side view is better. For which you also want to start with the two files.

There's really no need to ever transmit a diff and deal with all the format vagaries when you can just send the two files.

➕ show 4 replies

koiueo • last Wednesday at 5:02 AM

> format (string – recommended): > > This would indicate the metadata format. Currently, only json is officially supported, and is the default if not provided.

JSON doesn't seem a good choice for representing metadata in a format that aims to be universal. It is unnecessarily complicated for this purpose IMO

➕ show 1 reply

greatgib • last Wednesday at 8:33 AM

Extending/reworking the format is probably good but I don't think that using multiline (indentation dependant)json or yaml would be good for such a thing.

One of the interesting point of diff files is that all commands are on single lines. You can easily parse or manipulate with simple shell tools just stripping lines out.

➕ show 1 reply

signa11 • last Wednesday at 7:08 AM

difftastic: https://difftastic.wilfred.me.uk/ uses tree-sitter for better diff-info, and is, imho, superior to this.

➕ show 3 replies

chipx86 • last Wednesday at 6:05 AM

Hi, one of the authors of DiffX here. I didn't expect to find this on Hacker News tonight.

I'm going to try to address some of the reasoning behind DiffX here, and answer a few things that have come up in the comments. I'll start by saying, the issues we're addressing are more issues encountered by tools that work with these files, not necessarily the end users of these tools. Most people never have to think hard about the structure of these files, but we do.

A little background. I co-founded Review Board, a code review product that's been around since 2006. We work with a wide variety of source code management solutions, and because of this, we deal extensively with .diff/.patch files. And nobody generates these in any consistent way.

If Git were the only SCM out there, there'd be no need for DiffX. But there are at least a dozen SCMs in use in production across companies today, and more being developed. And each does things differently.

Git, Mercurial, Subversion, ClearCase, Perforce, and others all have their own bespoke type of format, built largely (but not always consistently or fully compatibly) on Unified Diffs. These often augment Unified Diffs by injecting:

* Revision identifiers (which can come in all kinds of forms -- numbers, hashes, paths, reserved words -- and sometimes need to be paired with other information to resolve a file). Sometimes these are on the `---`/`+++` lines, sometimes not.

* Symlink information (some tools provide the old and new symlink paths, some just the new, some neither, some the file contents)

* File modes (similarly, out of those that convey file modes, some provide more details than others, and these can impact application or processing of a patch)

* Commit descriptions (and if not done right, a stray `---` or metadata keyword can break some parsers)

Or any number of other common or SCM-specific metadata.

Pretty much all of these represent data in their own ways. This information goes in the "garbage" area of Unified Diffs, which basically means tools like GNU patch ignore it, but tools aware of that specific variant can parse it out.

At this point in my life, I've written a couple of dozen bespoke diff parsers at this point. Depending on the tool generating the diff, there's all sorts of parsing issues that can come up:

* Varying encodings for file paths and text strings (like commit messages), which can mean a patch on one system doesn't apply on another, depending on the tools generating it.

* BOMs that sometimes show up in strings (we hit this with Perforce years back).

* Messages or metadata can sometimes include characters that resemble Unified Diff content or other variant-specific syntax and can break patching/parsing.

* Differences in how information like symlinks/file modes are conveyed (see above).

* Binary files are almost never able to be represented beyond a "Binary files X and Y differ" line.

* Newlines are sometimes outright broken within the diff (particularly with mixed line endings) and can sometimes break patching.

Just to name a few.

Many SCMs don't even have a diff format to begin with, just generating a Unified Diff. These don't contain any revision information needed to locate a file. In these cases, or when important information is unavailable in some diff variant, we're stuck rolling our own.

Also, here's something you wouldn't normally expect to be a problem, but can be in practice more than you'd think, is that some very large diffs (we've seen ones hundreds of megabytes in size -- don't get me started) are time-consuming to parse. To know everything about the diff, you need to read and scan every line. To generate a list of filenames or stats on a diff, you need to effectively parse all of it.

So we took all those pain points, talked to developers working on a few SCMs, got their pain points and thoughts, and drafted the initial DiffX spec. Went through several rounds with them, iterated until we got where we are today.

The spec had some important goals:

1. Not being vendor-specific.

Git patches were built for Git, and even Git-like patches from Mercurial or Subversion have quirks that can break parsing in a Git-specific patch parser. There's no grammar for how Git stores the metadata and the clients require knowledge up-front of the value types, which isn't a good fit for some of the SCMs out there.

We wanted to draft something that could be used more generically, able to be adopted by newer tools while also being able to represent the information provided from existing tools.

2. Support for arbitrary and injectable metadata.

Some of the formats we work with don't contain enough information to locate the file + revision within a repository. Some require additional information, like an explicit branch or workspace ID or a counterpart changeset number.

Even Git diffs don't always provide enough. They provide a Blob SHA, but not a commit SHA by default. This is a problem when talking to APIs on Git hosting services that require a commit SHA along with (or instead of) a blob SHA.

And some have useful data that can't be fetched after the fact.

So a common headache is that we need to inject additional information in the diffs we generate in order to allow the appropriate data to be looked up.

y using a form of metadata storage for the diff file, the commit, and the files within, we have the ability to inject that additional information without worrying about corrupting a state machine or regex or whatever method is used for some parser (or some older version of our own parsers).

We eventually chose JSON here. We initially had a grammar that looked more like Git's format, but found ourselves dealing with some of the same challenges that YAML had. We didn't want the "NO" problem and we didn't want every client to have to decide on what the value of a string in a piece of metadata should be. Some metadata (such as revision IDs in some SCMs) differentiate between a number and a string that may look like a number, and that information is important to know up-front.

The consensus was that there was more value in JSON than some other format, since it's well-understood, parsers are readily available, and there's no sign that it's disappearing or dropping out of maintenance any time soon.

The format allows for future metadata formats here if, say, json5 or YAML#++ becomes a well-adopted standard 10 years from now.

3. Parsing and mutability.

When working with these files, it's sometimes important to scan for information in the file to do some pre-processing. How many files are in the diff? Is this past some threshold that might trigger a rate limit if we fetch data for each from a repository? Are there binary files in this change?

When these files get very large (which can happen in enterprises when posting a change merging two long-running branches together -- no, people shouldn't ever post 100MB diffs, but they do), these operations can get expensive.

So we built in some parsing aids (content lengths for sections, section hierarchy identifiers) to allow for more efficient parsing. We can read a section header, know where we are and what section we're nested in, and jump to the next piece to read. This is far more efficient than parsing-to-scan and avoids a lot of headaches.

We also get mutability. Generating a diff and attaching metadata to a file in the middle of a diff becomes a lot faster and safer this way.

A consumer never needs to do this. A tool does.

We figured we'd address the text encoding issues while we were at it, because oh boy can these cause problems. A whole topic of its own.

4. Multi-commit files.

Yep, there's Git format-patch. That works great for Git. If I'm on Perforce or SOS or ClearCase and I want to represent a series of commits, I don't have an equivalent format.

If one wants to be able to send a diff spanning a series of commits somewhere for processing or application, being able to do that with one file is valuable. One file means one thing to upload, no risk of a patch ordering issue or a missing patch in the series. The tool processing the diff file would have all the state it needs up-front.

5. Binary files.

Binary files are important. A lot of projects are more than source code in text format. Images, documents, 3D models.. these get left out of diff files today by default.

The exception is Git, which can represent changes to binary files as Git Literal and Git Delta formats. This is largely undocumented (outside of our spec) and not supported by really anything else.

We review binary files, so we wanted this. Talking to other SCM vendors, some found this a pain point as well but didn't have a solution in place. So we wanted this to be documented and addressed in the spec.

This is already very long, but I wanted to give a bit of insight into the kinds of problems and inconsistencies that tools (not necessarily end users) have to deal with, and how this is meant to address some of those problems.

➕ show 4 replies

b0a04gl • last Wednesday at 2:05 PM

tbh this feels like overengineering the wrong pain. devs aren't begging for more metadata in diffs—they're begging for tooling that doesn't explode when someone renames a file or changes line endings. most teams can't even agree on commit message formats, and now we're gonna throw structured diff metadata into the mix? cool idea, but feels like a clean spec solving a messy human problem. the chaos isn't in the diff format—it's in the humans using it.

looneysquash • last Wednesday at 10:14 PM

How does ReviewBoard make use of diffx if none of the existing tools support it?

Is that a chicken and egg problem, or it is useful by itself?

xiphias2 • last Wednesday at 8:02 AM

This format may be backwards compatible, but not forwards compatible.

JSON solved this mostly by standardizing what most implementations were already doing, so that would be a great thing to do.

If git diff isn't documented, the solution is not to create a new format, but to go through the source code and document it.

eisbaw • last Wednesday at 11:35 AM

Why are diffs still so text-based? That should be a last resort. Typically when you edit a text file, your action has meaning beyond the primitive edit. You probably changed a variable name, moved a function above another. Replaced a whole chunk of stuff with other stuff, etc.

The problem with diffs is that they are not easily portable - because their intent is so (accidentally) low-level. Imbue diffX with semantics, and they can become more readable - by humans and AI alike.

karmakaze • last Wednesday at 12:21 PM

Odd that it's a format that couldn't decide on a format "#..meta: format=json, length=270" The "length=270" also has a redundant/fragile smell to it.

Other thought was when was the last time I had problems with a diff file--can't recall (maybe decades ago). Probably only a problem when working with multiple VCSes in which case you could make a diff translator that understands each one intimately.

bravesoul2 • last Wednesday at 8:06 AM

If disk space is no issue, a good difference format is just the entire original file and the entire new file.

If an editor was involved and you want better mergability you could include the original file and the sequence of CRDT ops.

➕ show 1 reply

dedicate • last Wednesday at 5:26 AM

For stuff like commit histories or complex changes, isn't the real power in the tools around the diff (think Git itself, or code review platforms) rather than trying to cram everything into one super-format?

➕ show 1 reply

diffxx • last Wednesday at 10:58 PM

I'm ready for their follow up product.

KingOfCoders • last Wednesday at 5:36 AM

Lets add JSON to everything.

➕ show 1 reply

charcircuit • last Wednesday at 3:36 AM

>A single diff can’t represent a list of commits

I don't understand why this would be a problem. Each commit can have its own diff.

➕ show 1 reply

Jean-Papoulos • last Wednesday at 6:01 AM

This is trying to do multiple things (commit info & file diff) in one. Not a good idea. Commit info should live in the repo metadata (no matter which form this takes), and diff should be its own thing.

➕ show 1 reply

WhyNotHugo • last Wednesday at 3:56 PM

They could have built on top of git’s header syntax for metadata (which itself is based on email headers) instead of reinventing it in a new flavour of pseudo-JSON.

xkcd927 strikes again!

yu3zhou4 • last Wednesday at 5:05 AM

I got me xkcd 927 vibe somehow https://xkcd.com/927/

➕ show 1 reply

edam30 • last Wednesday at 6:30 AM

Nice

quotemstr • last Wednesday at 7:06 AM

> They don’t standardize encodings, revisions, metadata, or even how filenames or paths are represented!

Sure, but some unified diffs, e.g. the ones produced by git, are quite regular. It's also common practice to express diffs as RFC822 email messages (often because they come that way), with headers and descriptive text.

I can't see DiffX getting traction. It's too alien. Too divorced from present practice, no matter how theoretically robust. It's like XHTML2.

Solving the same problems, I'd just establish conventions for sticking the needed metadata in RFC822-style pseudo-headers above the diff. This approach would work with, not against, existing tooling.

Not everything needs to be JSON.

bronlund • last Wednesday at 1:46 PM

"We have four different standards, why don't we just make one proper!". And that's why we now have five standards.

Jokes aside, good luck with this one :)

crabbone • last Wednesday at 4:51 PM

> #...diff: length=629

What an awful idea... Now if I have to edit a patch I need to count characters in it? Come on...

gjvc • last Wednesday at 5:20 AM

they couldn't call it xdiff (which would match the "extensible diff" name) because it would clash with https://github.com/libgit2/xdiff

➕ show 1 reply

curtisszmania • last Wednesday at 5:12 PM

[dead]

forrestthewoods • last Wednesday at 7:03 AM

> Most people and tools work with Unified Diffs. They look like this:

No I don’t. That plus-minus single column view is complete and total trash. Always has been, always will be.

I’ve used Araxis Merge for almost 20 years. Beyond Compare 3 is also a good choice. Not once in my entire life career have I ever relied on a “unified diff” or a “patch” or any of that garbage.

➕ show 2 replies

account42 • last Wednesday at 9:24 AM

Congratulations, you have added "standard" n+1 for patch and other tools to deal with.

didsomeonesay • last Wednesday at 7:30 AM

If the project owner posted this or is reading here: watch out with the project naming; DivX is a highly litigative brand these days.

➕ show 1 reply

alt Hacker News

DiffX – Next-Generation Extensible Diff Format

Comments