gnabgib points out that this same article has been posted for comment here three other times since i...

mcswell • today at 3:19 AM • 2 replies • view on HN

gnabgib points out that this same article has been posted for comment here three other times since it was written. That said, afaict no one has commented any of these times on what I'm about to say, so hopefully this will be new.

I'm a linguist, and I've worked in endangered languages and in minority languages (many of which will some day become endangered, in the sense of not having native speakers). The advantage of plain text (Unicode) formats for documenting such languages (as opposed to binary formats like Word used to be, or databases, or even PDFs) is that text formats are the only thing that will stanmd the test of time. The article by Steven Bird and Gary Simons "Seven Dimensions of Portability for Language Documentation and Description" was the seminal paper on this topic, published in 2002. I've given later conference talks on the topic, pointing out that we can still read grammars of Greek and Latin (and Sanskrit) written thousands of years ago. And while the group I led published our grammars in paper form via PDF, we wrote and archived them as XML documents, which (along with JSON) are probably as reproducible a structured format as you can get. I'm hoping that 2000 years from now, someone will find these documents both readable and valuable.

There is of course no replacement for some binary format when it comes to audio.

(By "binary" format I mean file formats that are not sequential and readily interpretable, whereas text files are interpretable once you know the encoding.)

Replies

makeitdouble • today at 7:16 AM

Purely anecdotal, but I hoard a lot of personal documents (shopping receipts, confirmation emails, scans etc.) and for stuff I saved only 10 years ago, the toughest to reopen are the pure text files.

You rightly mention Unicode, as before that there was a jungle of formats. I have some in UTF-16, some in SJIS, a ton in EUC, other were already utf-8, many don't have a BOM. I could try each encoding and see what works for each of the files (except on mobile...it's just a PITA to deal with that on mobile).

But in comparison there's a set of file I never had issues opening now and then: PDFs and jpegs. All the files that my scanner produced are still readable absolutely everywhere. Even with slight bitrot they're readable, and with the current OCR processes I could probably put it all back in text if ever needed.

If I had to archive more stuff now and can afford the space, I'd go for an image format without hesitation.

PS: I'm surprised you don't mention the Unicode character limitations for minority languages or academic use. There will still be characters that either can't be represented, or don't have an exact 1 to 1 match between the code point and the representation.

dwattttt • today at 8:51 AM

This is all true, but I think you're too focused on your area. Finding musical notes that we can interpret correctly from an ancient civilization, would that be "text" or "binary"? I think it's a false choice.

Similarly, cave paintings express the painting someone intended to make better than a textual description of it.

alt Hacker News

Replies