Has anyone analysed JE's writing style and looked for matches in archived 4chan posts or content from similar platforms? Same with Ghislaine, there should be enough data to identify them atp right? I don't buy the MaxwellHill claims for various reasons but it doesn't mean there's nothing to find.
> Information leakage may also be occurring via PDF comments or orphaned objects inside compressed object streams, as I discovered above.
hopefully someone is independently archiving all documents
my understanding is that some are being removed
A bit off-topic, but I find it kinda funny that the "Decline" button on the cookie popup on this page is labled "Continue without consent".
Re the OCR, I'm currently running allenai/olmocr-2-7b against all the PDFs with text in them, comparing with the OCR DOJ provided, and a lot it doesn't match, and surprisingly olmocr-2-7b is quite good at this. However, after extracing the pages from the PDFs, I'm currently sitting on ~500K images to OCR, so this is currently taking quite a while to run through.
Any guesses why some of the newest files seem to have random ”=” characters in the text? My first thought was OCR, but it seemed to not be linked to characters like ”E” that could be mistakenly interpreted by an OCR tool. My second guess is just making it more difficult to produce reliable text searches, but probably 90% of HN readers could find a way to make a search tool that does not fall apart in case a ”=” character is found (although making this work for long search queries would make the search slower).
I can't even download the archive, the transmission always terminates just before its finished. Spooky.
Somebody ought to train an LLM exclusively on this text, just for funsies.
These folks must really have their hands full with the 3M+ pages that were recently released. Hoping for an update once they expand this work to those new files.
This is so incredibly useful to me right now for incidental reasons I am commenting to make sure I can get back to it.
> DoJ explicitly avoids JPEG images in the PDFs probably because they appreciate that JPEGs often contain identifiable information, such as EXIF, IPTC, or XMP metadata
Maybe I'm underestimating the issue at full, but isn't this a very lightweight problem to solve? Is converting the images to lower DPI formats/versions really any easier than just stripping the metadata? Surely the DOJ and similar justice agencies have been aware of and doing this for decades at this point, right?
What is the legal basis for releasing the someone's private files and communications? If they can do it to Epstein, they can do it to you, to the Washington Post journalist, to former President Clinton, etc.
Is the scope at least limited somehow? Generally I favor transparency, but of course probably the most important parts are withheld.
(2025) just follow hn guideline, impressive voter ring though
I found this part interesting:
There are also other documents that appear to simulate a scanned document but completely lack the “real-world noise” expected with physical paper-based workflows. The much crisper images appear almost perfect without random artifacts or background noise, and with the exact same amount of image skew across multiple pages. Thanks to the borders around each page of text, page skew can easily be measured, such as with VOL00007\IMAGES\0001\EFTA00009229.pdf. It is highly likely these PDFs were created by rendering original content (from a digital document) to an image (e.g., via print to image or save to image functionality) and then applying image processing such as skew, downscaling, and color reduction.