Recreating Epstein PDFs from raw encoded attachments

544 points • by ComputerGuru • 02/04/2026 • 200 comments • view on HN

Comments

dperfect • 02/06/2026

Nerdsnipe confirmed :)

Claude Opus came up with this script:

https://pastebin.com/ntE50PkZ

It produces a somewhat-readable PDF (first page at least) with this text output:

https://pastebin.com/SADsJZHd

(I used the cleaned output at https://pastebin.com/UXRAJdKJ mentioned in a comment by Joe on the blog page)

➕ show 4 replies

bawolff • 02/05/2026

Teseract supports being trained for specific fonts, that would probably be a good starting point

https://pretius.com/blog/ocr-tesseract-training-data

pyrolistical • 02/05/2026

It decodes to binary pdf and there are only so many valid encodings. So this is how I would solve it.

1. Get an open source pdf decoder

2. Decode bytes up to first ambiguous char

3. See if next bits are valid with an 1, if not it’s an l

4. Might need to backtrack if both 1 and l were valid

By being able to quickly try each char in the middle of the decoding process you cut out the start time. This makes it feasible to test all permutations automatically and linearly

➕ show 2 replies

percentcer • 02/05/2026

This is one of those things that seems like a nerd snipe but would be more easily accomplished through brute forcing it. Just get 76 people to manually type out one page each, you'd be done before the blog post was written.

➕ show 5 replies

legitster • 02/06/2026

Given how much of a hot mess PDFs are in general, it seems like it would behoove the government to just develop a new, actually safe format to standardize around for government releases and make it open source.

Unlike every other PDF format that has been attempted, the federal government doesn't have to worry about adoption.

➕ show 5 replies

tcgv • 02/06/2026

> Then my mom wrote the following: “be careful not to get sucked up in the slime-machine going on here! Since you don’t care that much about money, they can’t buy you at least.”

I'm lucky to have parents with strong values. My whole life they've given me advice, on the small stuff and the big decisions. I didn't always want to hear it when I was younger, but now in my late thirties, I'm really glad they kept sharing it. In hidhsight I can see the life-experience / wisdom in it, and how it's helped and shaped me.

➕ show 1 reply

ChocMontePy • 02/06/2026

You can use the justice.gov search box to find several different copies of that same email.

The copy linked in the post:

https://www.justice.gov/epstein/files/DataSet%209/EFTA004004...

Three more copies:

https://www.justice.gov/epstein/files/DataSet%2010/EFTA02153...

https://www.justice.gov/epstein/files/DataSet%2010/EFTA02154...

Perhaps having several different versions might make it easier.

➕ show 1 reply

pimlottc • 02/05/2026

Why not just try every permutation of (1,l)? Let’s see, 76 pages, approx 69 lines per page, say there’s one instance of [1l] per line, that’s only… uh… 2^5244 possibilities…

Hmm. Anyone got some spare CPU time?

➕ show 3 replies

kevin_thibedeau • 02/05/2026

pdftoppm and Ghostscript (invoked via Imagemagick) re-rasterize full pages to generate their output. That's why it was slow. Even worse with a Q16 build of Imagemagick. Better to extract the scanned page images directly with pdfimages or mutool.

Followup: pdfimages is 13x faster than pdftoppm

➕ show 1 reply

chrisjj • 02/05/2026

> it’s safe to say that Pam Bondi’s DoJ did not put its best and brightest on this

Or worse. She did.

➕ show 3 replies

bushbaba • 02/06/2026

This proves my paranoia that you should print and rescan redactions. That or do screenshots of the pdf redacted and convert back to a pdf

➕ show 3 replies

velaia • 02/06/2026

Bummer that it's not December - the https://www.reddit.com/r/adventofcode/ crows would love this puzzle

alhamdulillah23 • 02/07/2026

Got it.

Page 1: https://imgur.com/a/jwgu9uH

Page 2: https://imgur.com/a/4Zi3bkk

Use this: https://github.com/KoKuToru/extract_attachment_EFTA00400459

nubg • 02/06/2026

Wait would this give us the unredacted PDFs?

➕ show 3 replies

iwontberude • 02/05/2026

This one is irresistible to play with. Indeed a nerd snipe.

➕ show 1 reply

linuxguy2 • 02/05/2026

Love this, absolutely looking forward to some results.

Evidlo • 02/06/2026

I took at stab at training Tesseract and holy jeebus is their CLI awful. Just an insanely complicated configuration procedure.

➕ show 1 reply

queenkjuul • 02/06/2026

I'm only here to shout out fish shell, a shell finally designed for the modern world of the 90s

ks2048 • 02/06/2026

I wonder if jmail (https://www.jmail.world/) has worked on this?

I tried to find the message in this blog post, but couldn't. (don't see how to search by date).

FarmerPotato • 02/05/2026

If only Base64 had used a checksum.

➕ show 1 reply

blindriver • 02/06/2026

On one hand, the DOJ gets shit because it was taking too long to produce the documents, and then on another, they get shit because there are mistakes in the redacting because there are 3 million pages of documents.

➕ show 7 replies

zahlman • 02/05/2026

> …but good luck getting that to work once you get to the flate-compressed sections of the PDF.

A dynamic programming type approach might still be helpful. One version or other of the character might produce invalid flate data while the other is valid, or might give an implausible result.

➕ show 2 replies

winddude • 02/06/2026

here's another few to decode,

https://www.justice.gov/epstein/files/DataSet%2010/EFTA01804...

https://www.justice.gov/epstein/files/DataSet%209/EFTA007755...

https://www.justice.gov/epstein/files/DataSet%209/EFTA004349...

and than this one judging by the name of the file (hanna something) and content of the email:

"Here is my girl, sweet sparkling Hanna=E2=80=A6! I am sure she is on Skype "

maybe more sinister (so be careful, i have no ideas what the laws are if you uncover you know what trump and Epstein were into)...

https://www.justice.gov/epstein/files/DataSet%2011/EFTA02715...

[Above is probably a legit modeling CV for HANNA BOUVENG, based on, https://www.justice.gov/epstein/files/DataSet%209/EFTA011204..., but still creepy, and doesn't seem like there's evidence of her being a victim]

➕ show 3 replies

eek2121 • 02/05/2026

Honestly, this is something that should've been kept private, until each and every single one of the files is out in the open. Sure, mistakes are being made, but if you blast them onto the internet, they WILL eventually get fixed.

Cool article, however.

➕ show 1 reply

SomaticPirate • 02/06/2026

Are there archives of this? I have no doubt after this post goes viral some of these files might go “missing” Having a large number of conspiracies validated has lead me to firmly plant my aluminum hat

➕ show 1 reply

IshKebab • 02/06/2026

Disappointing how terrible open source OCR still is.

sorbus-25 • 02/06/2026

Event details: https://web.archive.org/web/20260206040716/https://what2wear...

➕ show 1 reply

wtcactus • 02/06/2026

My non political take about this gift that keeps on giving is that: PDF might seem great for the end user that is just expected to read or print the file they are given, but the technology actually sucks.

PDF is basically a prettify layer on top of the older PS that brings an all lot of baggage. The moment you start trying to do what should be simple stuff like editing lines, merging pages, change resolution of the images, it starts giving you a lot of headaches.

I used to have a few scripts around to fight some of its quirks from when I was writing my thesis and had to work daily with it. But well, it was still an improvement over Word.

➕ show 1 reply

prettywoman • 02/05/2026

[dead]

heraldgeezer • 02/06/2026

[flagged]

➕ show 1 reply

alt Hacker News

Recreating Epstein PDFs from raw encoded attachments

Comments