I've worked on document extraction a lot and while the tweet is too flippant for my taste, it&#...

belval • last Friday at 9:00 PM • 2 replies • view on HN

I've worked on document extraction a lot and while the tweet is too flippant for my taste, it's not wrong. Mistral is comparing itself to non-VLM computer vision services. While not necessarily what everyone needs, they are a very different beasts compared to VLM based extraction because it gives you precise bounding boxes, usually at the cost of larger "document understanding".

Its failure mode are also vastly different. VLM-based extraction can misread entire sentences or miss entire paragraphs. Sonnet 3 had that issue. Computer vision models instead will make in-word typos.

Replies

wills_forward • yesterday at 12:00 AM

Why not use both? I just built a pipeline for document data extraction that uses PaddleOCR, then Gemini 3 to check + fix errors. It gets close to 99.9% on extraction from financial statements finally on par with humans.

➕ show 2 replies

zerocrates • yesterday at 6:17 AM

Is DeepSeek's not VLM?

alt Hacker News

Replies