OCR has been solved long time ago with vision models. Solutions are consistent, reliable, and stable...

Oras • today at 12:15 PM • 12 replies • view on HN

OCR has been solved long time ago with vision models. Solutions are consistent, reliable, and stable. What is the point of reinventing the wheel?

I would definitely understand post processing, like extracting data, answering question .. etc, but why re-doing the OCR engine itself?

Replies

chpatrick • today at 12:25 PM

It absolutely hasn't been solved, it's just got pretty decent in recent years.

➕ show 1 reply

joss82 • today at 1:01 PM

I've been working on Parseur for the last 10 years, and OCR has not been solved yet, let me tell you.

OCR still sucks in 2026. Hopefully this might improve the situation but I haven't tested it yet.

gettingoverit • today at 2:59 PM

Is it? I've never seen a single OCR that would replace a human just typing it by hand.

What if the goal is something actually useful, such as converting scientific paper PDF back to LaTeX that renders into a pixel-perfect copy? What about converting tables from electronics datasheets into computer-readable form? I wouldn't even expect it in the next decade.

➕ show 1 reply

sscaryterry • today at 12:26 PM

Detecting characters almost, layout no.

➕ show 1 reply

ljouhet • today at 1:04 PM

Real question: what tool do you use? (for long/complex documents with tables, code, maths)

- marker (with --force-ocr) gives me the best results

- Mistral OCR (seems really great, but I never managed to get it work)

- Mathpix (tried a long time ago)

- docling (gives me garbage, I must use it wrong)

- Unlimited OCR (will try it)

- ???

➕ show 2 replies

vulture916 • today at 12:20 PM

I haven't done much long-run OCR, so unsure of the current state, but it would seem they overcome this (from their paper):

"A widely held view is that employing a large language model (LLM) as the decoder allows the model to leverage the prior distribution of language, leading to improved OCR performance. However, the downside is equally evident: as the output sequence lengthens, the accumulated KV cache drives up memory consumption and progressively slows down generation."

mamcx • today at 3:26 PM

Aside: what is the best to read receipts/bank statements/invoices?

cannonpalms • today at 12:25 PM

I guess, in theory, the prior distribution of language would allow for improved performance in some cases, especially where input quality is low.

➕ show 1 reply

Aboutplants • today at 12:51 PM

lol nope it hasn’t been solved. I deal with this constantly and we still have a longggg ways to go

mschuster91 • today at 1:10 PM

> I would definitely understand post processing, like extracting data, answering question .. etc, but why re-doing the OCR engine itself?

Well... the idea seems to be (as far as I understand it, at least) that optical errors and artifacts can now be compensated as the OCR engine is now context-aware.

Say, for example, some random long ass name chemical. It's not going to be in a word correction database, but a context-aware engine (ideally, one that has been supplemented with chemistry data) can now correct "bad" reads of the chemical's name.

Of course, there remains the issue of how to prevent the infamous Xerox bug [1]...

[1] https://www.dkriesel.com/en/blog/2013/0802_xerox-workcentres...

ta988 • today at 12:21 PM

Cost, throughput, latency...

➕ show 1 reply

JohnKemeny • today at 12:25 PM

OCR has definitely not "been solved long time ago", what are you talking about?

In your opinion, what is SOTA here?

alt Hacker News

Replies