"We compiled these Epstein estate emails from the House Oversight Committee release by converting the PDFs to structured text with an LLM"
and:
"Data Sources
Gmail emails: House Oversight Committee
Yahoo emails: DDoSecrets (brought to us by Drop Site News)
TechnologyDocument parsing and extraction powered by reducto"
Yes, also many were PPM images (or encoded as such) in PDFs and then I used (cheap/light) multimodal LLMs to classify documents from photos. It was surprisingly cheap: <$1 for a few thousand PDFs / Images.