logoalt Hacker News

slazientoday at 2:57 AM1 replyview on HN

https://www.jmail.world/about

"We compiled these Epstein estate emails from the House Oversight Committee release by converting the PDFs to structured text with an LLM"

and:

"Data Sources

    Gmail emails: House Oversight Committee
    Yahoo emails: DDoSecrets (brought to us by Drop Site News)
Technology

Document parsing and extraction powered by reducto"


Replies

dvrptoday at 7:42 AM

Yes, also many were PPM images (or encoded as such) in PDFs and then I used (cheap/light) multimodal LLMs to classify documents from photos. It was surprisingly cheap: <$1 for a few thousand PDFs / Images.