I have a Shipibo (indigenous Peruvian language) to Spanish dictionary that I've been trying to translate into a Shipibo to English dictionary using a couple different llms but keep struggling with formatting (two columns, strange line breaks, but also both Shipibo and Spanish in the definitions make it difficult to grok). That all plus being pretty poorly scanned. May need to give this a try.
I have been looking for something that would ingest a decade of old Word and PowerPoint documents and convert them into a standardized format where the individual elements could be repurposed for other formats. This seems like a critical building block for a system that would accomplish this task.
Now I need a catalog, archive, or historian function that archives and pulls the elements easily. Amazing work!
It’s a shame all these models target markdown and not something with more structure and a specification. There are different flavors of Markdown and limited support for footnotes, references, figures, etc.
How's this compare with docling (https://github.com/docling-project/docling)?
How does it compare to Datalab/Marker https://github.com/datalab-to/marker ? We evaluated many PDF->MD converters and this one performed the best, though it is not perfect.
I created a Powershell script to run this locally on any PDF: https://gist.github.com/kordless/652234bf0b32b02e39cef32c71e...
It does work, but it is very slow on my older GPU (Nvidia 1080 8GB). I would say it's taking at least 5 minutes per page right now, but maybe more.
Edit: If anyone is interested in trying a PDF to markdown conversion utility built this that is hosted on Cloud Run (with GPU support), let me know. It should be done in about an hour or so and I will post a link up here when it's done.
How does it handle documents with multi column or multi row tables?
e.g. https://www.japanracing.de/Teilegutachten/Teilegutachten-JR1... page 1 rowspan page29 colspan
I'm curious, how does it do with non-english texts? It's my understanding that LLM-based OCR solutions fall way behind traditional ones once you introduce other languages.
With models like these, when multilingual is not mentioned it will perform really bad on real life non-english pdfs.
It's not open-source (nor open-weight): https://huggingface.co/nanonets/Nanonets-OCR-s/discussions/2
[dead]
It would be interesting to know how it compares with Llamaparse, LLMWhisperer, Marker, Reducto
[dead]
Full disclaimer: I work at Nanonets
Excited to share Nanonets-OCR-s, a powerful and lightweight (3B) VLM model that converts documents into clean, structured Markdown. This model is trained to understand document structure and content context (like tables, equations, images, plots, watermarks, checkboxes, etc.). Key Features:
LaTeX Equation Recognition Converts inline and block-level math into properly formatted LaTeX, distinguishing between $...$ and $$...$$.
Image Descriptions for LLMs Describes embedded images using structured <img> tags. Handles logos, charts, plots, and so on.
Signature Detection & Isolation Finds and tags signatures in scanned documents, outputting them in <signature> blocks.
Watermark Extraction Extracts watermark text and stores it within <watermark> tag for traceability.
Smart Checkbox & Radio Button Handling Converts checkboxes to Unicode symbols like , , and for reliable parsing in downstream apps.
Complex Table Extraction Handles multi-row/column tables, preserving structure and outputting both Markdown and HTML formats.
Huggingface / GitHub / Try it out: https://huggingface.co/nanonets/Nanonets-OCR-s
Try it with Docext in Colab: https://github.com/NanoNets/docext/blob/main/PDF2MD_README.m...