> nailing down Unicode and text encodings was still considered rocket science. Now this is a s...

dspillett • today at 11:14 AM • 2 replies • view on HN

> nailing down Unicode and text encodings was still considered rocket science. Now this is a solved problem

I wish…

Detecting text encoding is only easy if all you need to contend with is UTF16-with-BOM, UTF8-with-BOM, UTF8-without-BOM, and plain ASCII (which is effectively also UTF8). As soon as you might see UTF16 or UCS without a BOM, or 8-bit codepages other than plain ASCII (many apps/libs assume that these are always CP1252, a superset of the printable characters of ISO-8859-1, which may not be the case), things are not fully deterministic.

Thankfully UTF8 has largely won out over the many 8-bit encodings, but that leaves the interesting case of UTF8-with-BOM. The standard recommends against using it, that plain UTF8 is the way to go, but to get Excel to correctly load a UTF8 encoded CSV or similar you must include the BOM (otherwise it assumes CP 1252 and characters above 127 are corrupted). But… some apps/libs are completely unaware that UTF8-with-BOM is a thing at all so they load such files with the first column header corrupted.

Source: we have clients pushing & pulling (or having us push/pull) data back & forth in various CSV formats, and we see some oddities in what we receive and what we are expected to send more regularly than you might think. The real fun comes when something at the client's end processes text badly (multiple steps with more than one of them incorrectly reading UTF8 as CP1252, for example) before we get hold of it, and we have to convince them that what they have sent is non-deterministically corrupt and we can't reliably fix it on the receiving end…

Replies

josephg • today at 11:44 AM

> to get Excel to correctly load a UTF8 encoded CSV or similar you must include the BOM

Ah so that’s the trick! I’ve run into this problem a bunch of times in the wild, where some script emits csv which works on the developers machine but fails strangely with real world data.

Good to know there’s a simple solution. I hope I remember your comment next time I see this!

➕ show 1 reply

7bit • today at 12:00 PM

The very fact that UTF-8 itself discouraged from using the BOM is just so alien to me. I understand they want it to be the last encoding and therefore not in need of a explicit indicator, but as it currently IS NOT the only encoding that is used, it makes is just so difficult to understand if I'm reading any of the weird ASCII derivatives or actual Unicode.

It's maddening and it's frustrating. The US doesn't have any of these issues, but in Europe, that's a complete mess!

➕ show 3 replies

alt Hacker News

Replies