logoalt Hacker News

X-ray: a Python library for finding bad redactions in PDF documents

454 pointsby rendxyesterday at 9:54 PM85 commentsview on HN

Comments

mlissneryesterday at 11:03 PM

Cool to see this here. It’s funny because we do so many huge, complex, multiyear projects at Free Law Project, but this is the most viral any of our work has ever gone!

Anyway, I made X-ray to analyze the millions of documents we have in CourtListener so that we can try to educate people about the issue.

The analysis was fun. We used S3 batch jobs to analyze millions of documents in a matter of minutes, but we haven’t done the hard part of looking at the results and reporting them out. One day.

show 2 replies
embedding-shapeyesterday at 11:47 PM

I haven't gone through more than just 10% of the files released today, but noticed that at least EFTA00037069.pdf for example has a `/Prev` pointer, meaning the previous revision of the file is available inside of the PDF itself. In this case, the difference is minor (stuff moved around), but I'm guessing if it's in one file, it could be more. You can run `qpdf --show-object=trailer EFTA00037069.pdf` on a PDF file to see for yourself if it's there.

I'm almost fully convinced that someone did this bad intentionally, together with the bad redactions, as surely people tasked with redacting a bunch of files receive some instructions on what to do/not to do?

show 4 replies
jmward01today at 12:07 AM

Hmmm.. The more I think about this the more any font kerning is likely a major leak for redaction. Even if the boxes have randomness applied to them, the words around a blacked out area have exact positioning that constrains the text within so that only certain letter/space combinations could fit between them. With a little knowledge of the rendering algorithm and some educated guessing about the text a bruit force search may be able to do a very credible job of discovering the actual text. This isn't my field. Anyone out there that has actually worked on this problem?

show 3 replies
blitz_skulltoday at 1:19 AM

Explain like I’m stupid: what is the most gracious interpretation of redaction when releasing files like this?

Why should anyone involved retain any anonymity?

I’m asking in good faith because naively it seems like this should not even exist. All of it should be exposed.

EDIT: I did not think about the innocent folks that might be caught in the crossfire. That checks out. Thanks everyone!

show 5 replies
alessandrolivatoday at 8:20 AM

This being on top of the news on Esptein files being badly redacted is pretty funny

brotchieyesterday at 11:24 PM

You'd think the go-to workflow for releasing redacted PDFs would be to draw black rectangles and then rasterize to image-only PDFs :shrug:

show 3 replies
unfocusedyesterday at 10:40 PM

Adobe Pro, when used properly, will redact anything in a PDF permanently.

Whoever did these "bad" redactions doesn't even know how to use a PDF Editor.

We have paralegals and lawyers "mark for redaction", then review the documents, then "apply redactions". It's literally be done by thousands of lawyers/paralegals for decades. This is just someone not following the process and procedure, and making mistakes. It's actually quite amateurish. You should never, ever screw up redactions if you follow the proper process. Good on the X-ray project on trying to find errors.

I just want to add, applying black highlights on top of text is in fact, the "old" way of redaction, as it was common to do this, and then simply print the paper with the black bars, and send the paper as the final product.

Whoever did it is probably old, and may have done it thinking they were going to print it on paper afterwards!! Just guessing as to why someone would do this.

show 2 replies
evikstoday at 7:22 AM

Pity such an awful document format with so many basic fails at being digital, continues to reign in a lot of areas!

shrubbletoday at 12:30 AM

Shockingly, you can see redaction info from within your browser's PDF viewer. I am using Brave on Linux, and went here:

https://www.justice.gov/multimedia/Court%20Records/Matter%20...

As a test, select with your mouse the entire first line of paragraph number 90, and then paste it into a text editor or a shell. The unredacted text appears!

show 1 reply
tamimiotoday at 4:33 AM

Tech people would be shocked and surprised to know how tech-illiterate non-tech people are. Reminds me of old days when the IT guy is AIO in some non-tech facility and is treated like god!!

unstatusthequotoday at 2:56 AM

It’s a bit amusing seeing ediscovery principles go mainstream.

seanw444yesterday at 10:13 PM

The context for OP posting this is that many of the recently-released Epstein documents were PDFs "redacted" by being drawn on top of.

show 3 replies
5ak12agfftoday at 12:26 AM

Given that no U.S. or Israeli citizen apart from Epstein and Maxwell has experienced severe repercussions and Andrew Windsor is the perfect fall guy, there is the possibility that nothing will be revealed from these uncovered redactions.

The releases haven't yielded anything so far. For all we know, Epstein used other methods of communications for the really sensitive stuff. This would not be a surprise, since the whole Maxwell family was deep into tech (Magellan, Chiliad) and Ehud Barak was the head of Israeli military intelligence in the 1980s.

The story is going to be closed in a bipartisan manner except that it might be used to remove some unwanted politicians. The New York Times has already released an article that "explains" Epstein's wealth which names all figures that appear in "conspiracy theories" in an innocent way. Basically, they claim that Epstein could just steal from billionaires like Wexner and the billionaires would roll over and do nothing.

That is the official confirmation that all intelligence angles will be squashed in a bipartisan manner. For all we know, the "incompetence" in the redactions may be a way of saying: "See, we have nothing to hide."

gigatexalyesterday at 11:19 PM

Hilarious that DOJ didn’t flatten the layers so you can unredact stuff. What a clown show of incompetent idiots. Or… a skillful one over on the powers that be internally from someone who knew better but knew that they wouldn’t know … and did this to help us all

show 1 reply
ballpugtoday at 3:41 AM

[dead]

hamonryetoday at 4:03 AM

[dead]

dcollecttoday at 12:19 AM

lol thanks bros

text=about them to damage their credibility when they tried to go public with their stories of being text=Epstein also threatened harm to victims and helped release damaging stories =attorneys' fees and case costs in litigation related to this conduct.

=Defendants also attempted to conceal their criminal sex trafficking and abuse

text=$327,497.48 and $6,487.04 in New York City text=trafficking and abuse conduct. text=destroy evidence relevant to ongoing court proceedings involving Defendants' criminal sex text=Epstein also instructed one or more Epstein Enterprise participant-witnesses to text=trafficked and sexually abused. text=conduct by paying large sums of money to participant-witnesses, including by paying for their

IceHegelyesterday at 10:17 PM

Given recent high profile redaction events, I think one simple use of AI would be to have it redact documents according to an objective standard.

That should in theory prevent overly redacted documents for political purposes.

An approach that could be rolled out today would be redacting with human review, but showing what % of redactions the AI would have done, and also showing the prompt given to the AI to perform redactions.

show 1 reply