Real question: what tool do you use? (for long/complex documents with tables, code, maths)
- marker (with --force-ocr) gives me the best results
- Mistral OCR (seems really great, but I never managed to get it work)
- Mathpix (tried a long time ago)
- docling (gives me garbage, I must use it wrong)
- Unlimited OCR (will try it)
- ???
poma-ai has really great chunking techniques that chunk the document based on the document structure/heirarchy.
We use it on 200 page IEEE standards that are notoriously complex, filled with tables and diagram. Highly reccomend.
- Azure Document Intelligence (has an option to return markdown too including headers and footers).
- AWS Textract