I have a flow where i extract text from a pdf with pdf-parse and then feed that to an ai for data ex...

hersko • yesterday at 2:19 PM • 4 replies • view on HN

I have a flow where i extract text from a pdf with pdf-parse and then feed that to an ai for data extraction. If that fails i convert it to a png and send the image for data extraction. This works very well and would presumably be far cheaper as i'm generally sending text to the model instead of relying on images. Isn't just sending the images for ocr significantly more expensive?

Replies

trollbridge • yesterday at 3:27 PM

I always render an image and OCR that so I don’t get odd problems from invisible text and it also avoids being affected by anything for SEO.

saaaaaam • yesterday at 3:29 PM

There was an interesting discussion on here a couple of months back about images vs text, driven by this article: https://www.seangoedecke.com/text-tokens-as-image-tokens/

Discussion is here: https://news.ycombinator.com/item?id=45652952

unrahul • yesterday at 6:21 PM

I have seen this flow in what people in some startups call "Agentic OCR", its essentially a control flow that is coded that tries pdf-parse first or a similar non expensive approach, and if it fails a threshold then use screenshot to text extraction.

mimim1mi • yesterday at 2:47 PM

By definition, OCR means optical character recognition. It depends on the contents of the PDF what kind of extraction methodology can work. Often some available PDFs are just scans of printed documents or handwritten notes. If machine readable text is available your approach is great.

alt Hacker News

Replies