logoalt Hacker News

makeitdoubletoday at 7:16 AM0 repliesview on HN

Purely anecdotal, but I hoard a lot of personal documents (shopping receipts, confirmation emails, scans etc.) and for stuff I saved only 10 years ago, the toughest to reopen are the pure text files.

You rightly mention Unicode, as before that there was a jungle of formats. I have some in UTF-16, some in SJIS, a ton in EUC, other were already utf-8, many don't have a BOM. I could try each encoding and see what works for each of the files (except on mobile...it's just a PITA to deal with that on mobile).

But in comparison there's a set of file I never had issues opening now and then: PDFs and jpegs. All the files that my scanner produced are still readable absolutely everywhere. Even with slight bitrot they're readable, and with the current OCR processes I could probably put it all back in text if ever needed.

If I had to archive more stuff now and can afford the space, I'd go for an image format without hesitation.

PS: I'm surprised you don't mention the Unicode character limitations for minority languages or academic use. There will still be characters that either can't be represented, or don't have an exact 1 to 1 match between the code point and the representation.