logoalt Hacker News

Show HN: Ocrbase – pdf → .md/.json document OCR and structured extraction API

81 pointsby adammajcheryesterday at 1:10 PM30 commentsview on HN

Comments

prats226today at 12:48 AM

Instead of markdown -> LLM to get JSON, you can just train a slightly bigger model which you can constrain decode to give JSON rightaway. https://huggingface.co/nanonets/Nanonets-OCR2-3B

We recently published a cookbook for constrained decoding here: https://nanonets.com/cookbooks/structured-llm-outputs/

syncyesterday at 4:12 PM

This is essentially a (vibe-coded?) wrapper around PaddleOCR: https://github.com/PaddlePaddle/PaddleOCR

The "guts" are here: https://github.com/majcheradam/ocrbase/blob/7706ef79493c47e8...

show 3 replies
binalpatelyesterday at 8:03 PM

This is admittedly dated but even back in December 2023 GPT-4 with it's Vision preview was able to very reliably do structured extraction, and I'd imagine Gemini 3 Flash is much better than back then.

https://binal.pub/2023/12/structured-ocr-with-gpt-vision/

Back of the napkin math (which I could be messing up completely) but I think you could process a 100 page PDF for ~$0.50 or less using Gemini 3 Flash?

>560 input tokens per page * 100 pages = 56000 tokens = $0.028 input ($0.5/m input tokens) >~1000 output tokens per page * 100 pages = $0.30 output ($3/m output tokens)

(https://ai.google.dev/gemini-api/docs/gemini-3#media_resolut...)

show 1 reply
v3ss0nyesterday at 3:20 PM

How this is better over Surya/Marker or kreuzberg https://github.com/kreuzberg-dev/kreuzberg.

show 1 reply
herskoyesterday at 2:19 PM

I have a flow where i extract text from a pdf with pdf-parse and then feed that to an ai for data extraction. If that fails i convert it to a png and send the image for data extraction. This works very well and would presumably be far cheaper as i'm generally sending text to the model instead of relying on images. Isn't just sending the images for ocr significantly more expensive?

show 4 replies
sgcyesterday at 2:55 PM

How does this compare to dots.ocr? I got fantastic results when I tested dots.

https://github.com/rednote-hilab/dots.ocr

show 1 reply
fmirkowskiyesterday at 9:36 PM

having worked with paddleocr, tesseract and many other ocr tools before this is still one of the best and smoothest ocr experiences ive ever had, deployed in minutes

constantinumyesterday at 4:18 PM

What matters most is how well OCR and structured data extraction tools handle documents with high variation at production scale. In real workflows like accounting, every invoice, purchase order, or contract can look different. The extraction system must still work reliably across these variations with minimal ongoing tweaks.

Equally important is how easily you can build a human-in-the-loop review layer on top of the tool. This is needed not only to improve accuracy, but also for compliance—especially in regulated industries like insurance.

Other tools in this space:

LLMWhisperer/Unstract(AGPL)

Reducto

Extend Ai

LLamaparse

Docling

mechazawayesterday at 2:10 PM

Is only bun supported or also regular node?

show 1 reply
cess11yesterday at 7:17 PM

Why is 12GB+ VRAM a requirement? The OCR model looks kind of small, https://huggingface.co/PaddlePaddle/PaddleOCR-VL/tree/main, so I'm assuming it is some processing afterwards it would be used for.

show 1 reply