Show HN: Ocrbase – pdf → .md/.json document OCR and structured extraction API

81 points • by adammajcher • yesterday at 1:10 PM • 30 comments • view on HN

Comments

Instead of markdown -> LLM to get JSON, you can just train a slightly bigger model which you can constrain decode to give JSON rightaway. https://huggingface.co/nanonets/Nanonets-OCR2-3B

We recently published a cookbook for constrained decoding here: https://nanonets.com/cookbooks/structured-llm-outputs/

sync • yesterday at 4:12 PM

This is essentially a (vibe-coded?) wrapper around PaddleOCR: https://github.com/PaddlePaddle/PaddleOCR

The "guts" are here: https://github.com/majcheradam/ocrbase/blob/7706ef79493c47e8...

➕ show 3 replies

binalpatel • yesterday at 8:03 PM

This is admittedly dated but even back in December 2023 GPT-4 with it's Vision preview was able to very reliably do structured extraction, and I'd imagine Gemini 3 Flash is much better than back then.

https://binal.pub/2023/12/structured-ocr-with-gpt-vision/

Back of the napkin math (which I could be messing up completely) but I think you could process a 100 page PDF for ~$0.50 or less using Gemini 3 Flash?

>560 input tokens per page * 100 pages = 56000 tokens = $0.028 input ($0.5/m input tokens) >~1000 output tokens per page * 100 pages = $0.30 output ($3/m output tokens)

(https://ai.google.dev/gemini-api/docs/gemini-3#media_resolut...)

➕ show 1 reply

v3ss0n • yesterday at 3:20 PM

How this is better over Surya/Marker or kreuzberg https://github.com/kreuzberg-dev/kreuzberg.

➕ show 1 reply

hersko • yesterday at 2:19 PM

I have a flow where i extract text from a pdf with pdf-parse and then feed that to an ai for data extraction. If that fails i convert it to a png and send the image for data extraction. This works very well and would presumably be far cheaper as i'm generally sending text to the model instead of relying on images. Isn't just sending the images for ocr significantly more expensive?

➕ show 4 replies

sgc • yesterday at 2:55 PM

How does this compare to dots.ocr? I got fantastic results when I tested dots.

https://github.com/rednote-hilab/dots.ocr

➕ show 1 reply

fmirkowski • yesterday at 9:36 PM

having worked with paddleocr, tesseract and many other ocr tools before this is still one of the best and smoothest ocr experiences ive ever had, deployed in minutes

constantinum • yesterday at 4:18 PM

What matters most is how well OCR and structured data extraction tools handle documents with high variation at production scale. In real workflows like accounting, every invoice, purchase order, or contract can look different. The extraction system must still work reliably across these variations with minimal ongoing tweaks.

Equally important is how easily you can build a human-in-the-loop review layer on top of the tool. This is needed not only to improve accuracy, but also for compliance—especially in regulated industries like insurance.

Other tools in this space:

LLMWhisperer/Unstract(AGPL)

Reducto

Extend Ai

LLamaparse

Docling

mechazawa • yesterday at 2:10 PM

Is only bun supported or also regular node?

➕ show 1 reply

cess11 • yesterday at 7:17 PM

Why is 12GB+ VRAM a requirement? The OCR model looks kind of small, https://huggingface.co/PaddlePaddle/PaddleOCR-VL/tree/main, so I'm assuming it is some processing afterwards it would be used for.

➕ show 1 reply

alt Hacker News

Show HN: Ocrbase – pdf → .md/.json document OCR and structured extraction API

Comments