Are these really problems? I feel like i've never really encountered any of these issues and ha...

bawolff • last Wednesday at 7:48 AM • 4 replies • view on HN

Are these really problems? I feel like i've never really encountered any of these issues and have trouble imagining when they would crop up (except binary files).

- encoding - even if your file is not utf-8, why would that matter? You would still run the patch algorithm the same way. It doesn't really matter if the characters are valid utf-8

-why would i want a single diff to represent multiple commits? Having multiple diffs seems much more natural.

-metadata... i guess, but also the metadata seems like it would mostly only be useful inside a single system.

Replies

account42 • last Wednesday at 10:07 AM

> - encoding - even if your file is not utf-8, why would that matter? You would still run the patch algorithm the same way. It doesn't really matter if the characters are valid utf-8

Yeah I don't see a use-case for a patch encoding either - just treat the patch data as ascii-delimited binary mistery goo. Patch files need to be able to deal with mixed encoding text (e.g. to fix it) so you can't really just have one encoding anyway.

rwmj • last Wednesday at 10:41 AM

They're not problems at all. They probably should have asked people who regularly use diffs what actual problems they have, rather than trying to reinvent some overengineered yaml in a vacuum.

chipx86 • last Wednesday at 8:09 AM

Generally-speaking, you probably shouldn't have to deal with these problems unless you're writing a tool that has to interface with certain SCMs or SCMs used in certain environments. I'll give you some examples for each of these points:

1. There are two important areas where encoding can matter: The filename and the diff content.

Git pays attention to filename encoding, but most SCMs don't, so when a diff is generated, it's based on the local encoding. If there are any non-ASCII characters in that filename, a diff generated in one environment with one encoding set can end up not applying to another (or, in our case, not being able to be looked up from a repository). This isn't common but it can happen (we've seen this on Perforce and Subversion).

Then there's the content. Many SCMs will actually give you a representation of a text file and not the raw contents itself. That text file will be re-encoded for your local/preferred encoding, and newlines may be adjusted as well (`\r\n`, `\n`). The text file is then re-encoded back when pushing the change. This allows people in different environments to operate on the same file regardless of what encoding they're working with.

This doesn't necessarily make its way into the diff, though. So when you send a diff from a less-common encoding to a tool to process it, and that tries to apply it to the file checked out with its encoding, it can fail to patch.

The solution is to either know the encoding of the file you're processing, or try to guess it (some tools, like ours, let you specify a list of preferred encodings to try).

It's best if you can know it up-front.

Bonus Fun Fact: On some SCMs (Perforce comes to mind), checking out a file on Windows and then diffing it Linux via a shared mount can get you a diff with `\r\r\n` newlines. It's a bad time and breaks patching. And it used to come up a lot, until we worked around it.

Also, Perforce for a while would sometimes normalize encodings incorrectly and you'd end up with BOMs in the diff, breaking GNU patch.

2. It does when you're working with them directly for applying and patching. If you're handing them off to a tool for processing, if there's any risk of one file in a sequence not being included, you can end up with breakages that maybe you don't see until later processing.

It's also just really nice having all the state and metadata up-front so we can process it in one go in a consistent way without having to sanity-check all the diffs against each other.

When working locally, it also depends on your tooling. `git format-patch` and `git am` are great, but are for Git. If I'm working with (let's just say) Subversion, I need to do my own thing or find another tool.

3. It's critical for the kind of information needed to locate files in a repository. Some systems need a commit-wide identifier. Some need per-file identifiers. Some need a combination of the two. Some need those plus additional data not otherwise represented in the path or revision (generally more enterprise SCMs targeting certain use cases).

It's also critical for representing information that isn't in the Unified Diff format (namely, anything but the filename). So, symlink information, file modes, SCM-specific properties on a file or directory, to name a few. This information needs to live somewhere if a SCM provides it, and it's up to every SCM to choose how and where to store that data (and then how it's encoded, etc.).

➕ show 1 reply

ris • last Wednesday at 8:16 PM

Binary data - definitely a problem.

alt Hacker News

Replies