logoalt Hacker News

Recreating Epstein PDFs from raw encoded attachments

289 pointsby ComputerGurulast Wednesday at 7:19 PM87 commentsview on HN

Comments

dperfecttoday at 1:26 AM

Nerdsnipe confirmed :)

Claude Opus came up with this script:

https://pastebin.com/ntE50PkZ

It produces a somewhat-readable PDF (first page at least) with this text output:

https://pastebin.com/SADsJZHd

(I used the cleaned output at https://pastebin.com/UXRAJdKJ mentioned in a comment by Joe on the blog page)

show 2 replies
chrisjjyesterday at 11:15 PM

> it’s safe to say that Pam Bondi’s DoJ did not put its best and brightest on this

Or worse. She did.

show 2 replies
bawolffyesterday at 11:52 PM

Teseract supports being trained for specific fonts, that would probably be a good starting point

https://pretius.com/blog/ocr-tesseract-training-data

pyrolisticalyesterday at 11:24 PM

It decodes to binary pdf and there are only so many valid encodings. So this is how I would solve it.

1. Get an open source pdf decoder

2. Decode bytes up to first ambiguous char

3. See if next bits are valid with an 1, if not it’s an l

4. Might need to backtrack if both 1 and l were valid

By being able to quickly try each char in the middle of the decoding process you cut out the start time. This makes it feasible to test all permutations automatically and linearly

show 1 reply
ChocMontePytoday at 1:48 AM

You can use the justice.gov search box to find several different copies of that same email.

The copy linked in the post:

https://www.justice.gov/epstein/files/DataSet%209/EFTA004004...

Three more copies:

https://www.justice.gov/epstein/files/DataSet%2010/EFTA02153...

https://www.justice.gov/epstein/files/DataSet%2010/EFTA02154...

https://www.justice.gov/epstein/files/DataSet%2010/EFTA02154...

Perhaps having several different versions might make it easier.

show 1 reply
percentceryesterday at 11:25 PM

This is one of those things that seems like a nerd snipe but would be more easily accomplished through brute forcing it. Just get 76 people to manually type out one page each, you'd be done before the blog post was written.

show 3 replies
bushbabatoday at 2:29 AM

This proves my paranoia that you should print and rescan redactions. That or do screenshots of the pdf redacted and convert back to a pdf

show 3 replies
pimlottcyesterday at 11:08 PM

Why not just try every permutation of (1,l)? Let’s see, 76 pages, approx 69 lines per page, say there’s one instance of [1l] per line, that’s only… uh… 2^5244 possibilities…

Hmm. Anyone got some spare CPU time?

show 3 replies
kevin_thibedeauyesterday at 11:53 PM

pdftoppm and Ghostscript (invoked via Imagemagick) re-rasterize full pages to generate their output. That's why it was slow. Even worse with a Q16 build of Imagemagick. Better to extract the scanned page images directly with pdfimages or mutool.

Followup: pdfimages is 13x faster than pdftoppm

legitstertoday at 12:30 AM

Given how much of a hot mess PDFs are in general, it seems like it would behoove the government to just develop a new, actually safe format to standardize around for government releases and make it open source.

Unlike every other PDF format that has been attempted, the federal government doesn't have to worry about adoption.

show 3 replies
winddudetoday at 4:28 AM

here's another few to decode,

https://www.justice.gov/epstein/files/DataSet%2010/EFTA01804...

https://www.justice.gov/epstein/files/DataSet%209/EFTA007755...

https://www.justice.gov/epstein/files/DataSet%209/EFTA004349...

and than this one judging by the name of the file (hanna something) and content of the email:

"Here is my girl, sweet sparkling Hanna=E2=80=A6! I am sure she is on Skype "

maybe more sinister (so be careful, i have no ideas what the laws are if you uncover you know what trump and Epstein were into)...

https://www.justice.gov/epstein/files/DataSet%2011/EFTA02715...

[Above is probably a legit modeling CV for HANNA BOUVENG, based on, https://www.justice.gov/epstein/files/DataSet%209/EFTA011204..., but still creepy, and doesn't seem like there's evidence of her being a victim]

show 1 reply
nubgtoday at 12:50 AM

Wait would this give us the unredacted PDFs?

show 2 replies
velaiatoday at 12:13 AM

Bummer that it's not December - the https://www.reddit.com/r/adventofcode/ crows would love this puzzle

FarmerPotatoyesterday at 11:01 PM

If only Base64 had used a checksum.

show 1 reply
linuxguy2yesterday at 10:47 PM

Love this, absolutely looking forward to some results.

Evidlotoday at 2:47 AM

I took at stab at training Tesseract and holy jeebus is their CLI awful. Just an insanely complicated configuration procedure.

SomaticPiratetoday at 4:18 AM

Are there archives of this? I have no doubt after this post goes viral some of these files might go “missing” Having a large number of conspiracies validated has lead me to firmly plant my aluminum hat

zahlmanyesterday at 11:26 PM

> …but good luck getting that to work once you get to the flate-compressed sections of the PDF.

A dynamic programming type approach might still be helpful. One version or other of the character might produce invalid flate data while the other is valid, or might give an implausible result.

show 2 replies
eek2121yesterday at 11:41 PM

Honestly, this is something that should've been kept private, until each and every single one of the files is out in the open. Sure, mistakes are being made, but if you blast them onto the internet, they WILL eventually get fixed.

Cool article, however.

blindrivertoday at 12:15 AM

On one hand, the DOJ gets shit because it was taking too long to produce the documents, and then on another, they get shit because there are mistakes in the redacting because there are 3 million pages of documents.

show 3 replies
queenkjuultoday at 3:03 AM

I'm only here to shout out fish shell, a shell finally designed for the modern world of the 90s

iwontberudeyesterday at 11:01 PM

This one is irresistible to play with. Indeed a nerd snipe.

show 1 reply
prettywomanyesterday at 11:19 PM

[dead]