logoalt Hacker News

Recreating Epstein PDFs from raw encoded attachments

544 pointsby ComputerGuru02/04/2026200 commentsview on HN

Comments

dperfect02/06/2026

Nerdsnipe confirmed :)

Claude Opus came up with this script:

https://pastebin.com/ntE50PkZ

It produces a somewhat-readable PDF (first page at least) with this text output:

https://pastebin.com/SADsJZHd

(I used the cleaned output at https://pastebin.com/UXRAJdKJ mentioned in a comment by Joe on the blog page)

show 4 replies
bawolff02/05/2026

Teseract supports being trained for specific fonts, that would probably be a good starting point

https://pretius.com/blog/ocr-tesseract-training-data

pyrolistical02/05/2026

It decodes to binary pdf and there are only so many valid encodings. So this is how I would solve it.

1. Get an open source pdf decoder

2. Decode bytes up to first ambiguous char

3. See if next bits are valid with an 1, if not it’s an l

4. Might need to backtrack if both 1 and l were valid

By being able to quickly try each char in the middle of the decoding process you cut out the start time. This makes it feasible to test all permutations automatically and linearly

show 2 replies
percentcer02/05/2026

This is one of those things that seems like a nerd snipe but would be more easily accomplished through brute forcing it. Just get 76 people to manually type out one page each, you'd be done before the blog post was written.

show 5 replies
legitster02/06/2026

Given how much of a hot mess PDFs are in general, it seems like it would behoove the government to just develop a new, actually safe format to standardize around for government releases and make it open source.

Unlike every other PDF format that has been attempted, the federal government doesn't have to worry about adoption.

show 5 replies
tcgv02/06/2026

> Then my mom wrote the following: “be careful not to get sucked up in the slime-machine going on here! Since you don’t care that much about money, they can’t buy you at least.”

I'm lucky to have parents with strong values. My whole life they've given me advice, on the small stuff and the big decisions. I didn't always want to hear it when I was younger, but now in my late thirties, I'm really glad they kept sharing it. In hidhsight I can see the life-experience / wisdom in it, and how it's helped and shaped me.

show 1 reply
ChocMontePy02/06/2026

You can use the justice.gov search box to find several different copies of that same email.

The copy linked in the post:

https://www.justice.gov/epstein/files/DataSet%209/EFTA004004...

Three more copies:

https://www.justice.gov/epstein/files/DataSet%2010/EFTA02153...

https://www.justice.gov/epstein/files/DataSet%2010/EFTA02154...

https://www.justice.gov/epstein/files/DataSet%2010/EFTA02154...

Perhaps having several different versions might make it easier.

show 1 reply
pimlottc02/05/2026

Why not just try every permutation of (1,l)? Let’s see, 76 pages, approx 69 lines per page, say there’s one instance of [1l] per line, that’s only… uh… 2^5244 possibilities…

Hmm. Anyone got some spare CPU time?

show 3 replies
kevin_thibedeau02/05/2026

pdftoppm and Ghostscript (invoked via Imagemagick) re-rasterize full pages to generate their output. That's why it was slow. Even worse with a Q16 build of Imagemagick. Better to extract the scanned page images directly with pdfimages or mutool.

Followup: pdfimages is 13x faster than pdftoppm

show 1 reply
chrisjj02/05/2026

> it’s safe to say that Pam Bondi’s DoJ did not put its best and brightest on this

Or worse. She did.

show 3 replies
bushbaba02/06/2026

This proves my paranoia that you should print and rescan redactions. That or do screenshots of the pdf redacted and convert back to a pdf

show 3 replies
velaia02/06/2026

Bummer that it's not December - the https://www.reddit.com/r/adventofcode/ crows would love this puzzle

nubg02/06/2026

Wait would this give us the unredacted PDFs?

show 3 replies
iwontberude02/05/2026

This one is irresistible to play with. Indeed a nerd snipe.

show 1 reply
linuxguy202/05/2026

Love this, absolutely looking forward to some results.

Evidlo02/06/2026

I took at stab at training Tesseract and holy jeebus is their CLI awful. Just an insanely complicated configuration procedure.

show 1 reply
queenkjuul02/06/2026

I'm only here to shout out fish shell, a shell finally designed for the modern world of the 90s

ks204802/06/2026

I wonder if jmail (https://www.jmail.world/) has worked on this?

I tried to find the message in this blog post, but couldn't. (don't see how to search by date).

FarmerPotato02/05/2026

If only Base64 had used a checksum.

show 1 reply
blindriver02/06/2026

On one hand, the DOJ gets shit because it was taking too long to produce the documents, and then on another, they get shit because there are mistakes in the redacting because there are 3 million pages of documents.

show 7 replies
zahlman02/05/2026

> …but good luck getting that to work once you get to the flate-compressed sections of the PDF.

A dynamic programming type approach might still be helpful. One version or other of the character might produce invalid flate data while the other is valid, or might give an implausible result.

show 2 replies
winddude02/06/2026

here's another few to decode,

https://www.justice.gov/epstein/files/DataSet%2010/EFTA01804...

https://www.justice.gov/epstein/files/DataSet%209/EFTA007755...

https://www.justice.gov/epstein/files/DataSet%209/EFTA004349...

and than this one judging by the name of the file (hanna something) and content of the email:

"Here is my girl, sweet sparkling Hanna=E2=80=A6! I am sure she is on Skype "

maybe more sinister (so be careful, i have no ideas what the laws are if you uncover you know what trump and Epstein were into)...

https://www.justice.gov/epstein/files/DataSet%2011/EFTA02715...

[Above is probably a legit modeling CV for HANNA BOUVENG, based on, https://www.justice.gov/epstein/files/DataSet%209/EFTA011204..., but still creepy, and doesn't seem like there's evidence of her being a victim]

show 3 replies
eek212102/05/2026

Honestly, this is something that should've been kept private, until each and every single one of the files is out in the open. Sure, mistakes are being made, but if you blast them onto the internet, they WILL eventually get fixed.

Cool article, however.

show 1 reply
SomaticPirate02/06/2026

Are there archives of this? I have no doubt after this post goes viral some of these files might go “missing” Having a large number of conspiracies validated has lead me to firmly plant my aluminum hat

show 1 reply
IshKebab02/06/2026

Disappointing how terrible open source OCR still is.

wtcactus02/06/2026

My non political take about this gift that keeps on giving is that: PDF might seem great for the end user that is just expected to read or print the file they are given, but the technology actually sucks.

PDF is basically a prettify layer on top of the older PS that brings an all lot of baggage. The moment you start trying to do what should be simple stuff like editing lines, merging pages, change resolution of the images, it starts giving you a lot of headaches.

I used to have a few scripts around to fight some of its quirks from when I was writing my thesis and had to work daily with it. But well, it was still an improvement over Word.

show 1 reply
prettywoman02/05/2026

[dead]

heraldgeezer02/06/2026

[flagged]

show 1 reply