OCR has been solved long time ago with vision models. Solutions are consistent, reliable, and stable. What is the point of reinventing the wheel?
I would definitely understand post processing, like extracting data, answering question .. etc, but why re-doing the OCR engine itself?
I've been working on Parseur for the last 10 years, and OCR has not been solved yet, let me tell you.
OCR still sucks in 2026. Hopefully this might improve the situation but I haven't tested it yet.
Is it? I've never seen a single OCR that would replace a human just typing it by hand.
What if the goal is something actually useful, such as converting scientific paper PDF back to LaTeX that renders into a pixel-perfect copy? What about converting tables from electronics datasheets into computer-readable form? I wouldn't even expect it in the next decade.
Real question: what tool do you use? (for long/complex documents with tables, code, maths)
- marker (with --force-ocr) gives me the best results
- Mistral OCR (seems really great, but I never managed to get it work)
- Mathpix (tried a long time ago)
- docling (gives me garbage, I must use it wrong)
- Unlimited OCR (will try it)
- ???
I haven't done much long-run OCR, so unsure of the current state, but it would seem they overcome this (from their paper):
"A widely held view is that employing a large language model (LLM) as the decoder allows the model to leverage the prior distribution of language, leading to improved OCR performance. However, the downside is equally evident: as the output sequence lengthens, the accumulated KV cache drives up memory consumption and progressively slows down generation."
Aside: what is the best to read receipts/bank statements/invoices?
I guess, in theory, the prior distribution of language would allow for improved performance in some cases, especially where input quality is low.
lol nope it hasn’t been solved. I deal with this constantly and we still have a longggg ways to go
> I would definitely understand post processing, like extracting data, answering question .. etc, but why re-doing the OCR engine itself?
Well... the idea seems to be (as far as I understand it, at least) that optical errors and artifacts can now be compensated as the OCR engine is now context-aware.
Say, for example, some random long ass name chemical. It's not going to be in a word correction database, but a context-aware engine (ideally, one that has been supplemented with chemistry data) can now correct "bad" reads of the chemical's name.
Of course, there remains the issue of how to prevent the infamous Xerox bug [1]...
[1] https://www.dkriesel.com/en/blog/2013/0802_xerox-workcentres...
OCR has definitely not "been solved long time ago", what are you talking about?
In your opinion, what is SOTA here?
It absolutely hasn't been solved, it's just got pretty decent in recent years.