logoalt Hacker News

A case study in PDF forensics: The Epstein PDFs

157 pointsby DuffJohnsontoday at 2:46 PM69 commentsview on HN

Comments

anigbrowltoday at 6:10 PM

I found this part interesting:

There are also other documents that appear to simulate a scanned document but completely lack the “real-world noise” expected with physical paper-based workflows. The much crisper images appear almost perfect without random artifacts or background noise, and with the exact same amount of image skew across multiple pages. Thanks to the borders around each page of text, page skew can easily be measured, such as with VOL00007\IMAGES\0001\EFTA00009229.pdf. It is highly likely these PDFs were created by rendering original content (from a digital document) to an image (e.g., via print to image or save to image functionality) and then applying image processing such as skew, downscaling, and color reduction.

show 1 reply
ted_bunnytoday at 4:37 PM

Has anyone analysed JE's writing style and looked for matches in archived 4chan posts or content from similar platforms? Same with Ghislaine, there should be enough data to identify them atp right? I don't buy the MaxwellHill claims for various reasons but it doesn't mean there's nothing to find.

show 5 replies
waynenilsentoday at 3:34 PM

> Information leakage may also be occurring via PDF comments or orphaned objects inside compressed object streams, as I discovered above.

hopefully someone is independently archiving all documents

my understanding is that some are being removed

show 4 replies
yonatan8070today at 5:34 PM

A bit off-topic, but I find it kinda funny that the "Decline" button on the cookie popup on this page is labled "Continue without consent".

embedding-shapetoday at 3:49 PM

Re the OCR, I'm currently running allenai/olmocr-2-7b against all the PDFs with text in them, comparing with the OCR DOJ provided, and a lot it doesn't match, and surprisingly olmocr-2-7b is quite good at this. However, after extracing the pages from the PDFs, I'm currently sitting on ~500K images to OCR, so this is currently taking quite a while to run through.

show 2 replies
originalvichytoday at 3:54 PM

Any guesses why some of the newest files seem to have random ”=” characters in the text? My first thought was OCR, but it seemed to not be linked to characters like ”E” that could be mistakenly interpreted by an OCR tool. My second guess is just making it more difficult to produce reliable text searches, but probably 90% of HN readers could find a way to make a search tool that does not fall apart in case a ”=” character is found (although making this work for long search queries would make the search slower).

show 2 replies
_deftoday at 4:25 PM

I can't even download the archive, the transmission always terminates just before its finished. Spooky.

bugeatstoday at 4:00 PM

Somebody ought to train an LLM exclusively on this text, just for funsies.

show 1 reply
corygarmstoday at 3:45 PM

These folks must really have their hands full with the 3M+ pages that were recently released. Hoping for an update once they expand this work to those new files.

NoToPtoday at 5:32 PM

This is so incredibly useful to me right now for incidental reasons I am commenting to make sure I can get back to it.

show 1 reply
nkozyratoday at 3:53 PM

> DoJ explicitly avoids JPEG images in the PDFs probably because they appreciate that JPEGs often contain identifiable information, such as EXIF, IPTC, or XMP metadata

Maybe I'm underestimating the issue at full, but isn't this a very lightweight problem to solve? Is converting the images to lower DPI formats/versions really any easier than just stripping the metadata? Surely the DOJ and similar justice agencies have been aware of and doing this for decades at this point, right?

show 2 replies
tibbontoday at 3:03 PM

That's a lot of PeDoFiles!

(But seriously, great work here!)

show 1 reply
mmoosstoday at 5:13 PM

What is the legal basis for releasing the someone's private files and communications? If they can do it to Epstein, they can do it to you, to the Washington Post journalist, to former President Clinton, etc.

Is the scope at least limited somehow? Generally I favor transparency, but of course probably the most important parts are withheld.

show 6 replies
meidan_ytoday at 2:50 PM

(2025) just follow hn guideline, impressive voter ring though

show 2 replies