logoalt Hacker News

tomekfyesterday at 2:13 PM6 repliesview on HN

How it’s done from technical point?


Replies

mmh0000yesterday at 4:30 PM

Layers.

PDF is an absurdly complex file format. It's part of the reason there is no single "good" PDF reader, just a lot of mediocre PDF readers that are all terrible in their own way. Which is a topic for another day.

There are several ways to remove data in a PDF:

- Remove the data. This is much harder than it sounds. Many PDF tools won't let you change the content of a PDF, not because it isn't possible, but because you'll likely massively screw up the formatting, and the tools don't want to deal with that.

- Replace the data. This what what all the "blackout" tools do, find "A" and replace with "🮋". This is effective and doesn't break formatting since it's a 1-to-1 replacement. The problem with "replacing" is that not every PDF tool works the same way, and some, instead, just change the foreground and background color to black; it looks nearly the same, but the power of copy-and-paste still functions.

- Then you have the computer illiterate, who think changing the foreground and background color to black is good enough anyway.

show 5 replies
3eb7988a1663yesterday at 10:40 PM

I remember reading the recommendation for journalists to redact documents is to black them out in the digital version, print it out, and re-scan it. Anything else has too many potential ways by which it might be possible to smuggle data.

show 2 replies
general1465yesterday at 2:40 PM

Mistaking redaction tool (replaces data with black square) and black highlighter (adds black square as another layer). If people doing redactions are computer-illiterate, they won't see the difference.

oliwarneryesterday at 8:03 PM

They drew black boxes over the text. The text is still underneath. On OCR'd scanned documents, the text you'd copy is actually stored in metadata and just linked by position to the image.

Anyway, if you click on a "redaction", you're clicking on the box and can't select the text underneath, but if you just highlight the text around it, you can copy all the original text.

It's a bizarre oversight.

Gigachadtoday at 3:57 AM

PDF is less like an image, and more like a web page where elements can be stacked on top of each other. You can visually obscure things by sticking a black rectangle over the top, but anyone who inspects inside the pdf can remove it or see the text in the source.

There would also be a mix of text documents, and image scans. The way to censor each is different.

Perfectly censoring documents, particularly digital ones is actually surprisingly difficult.

show 1 reply