From a tweet: https://x.com/i/status/2001821298109120856
> can someone help folks at Mistral find more weak baselines to add here? since they can't stomach comparing with SoTA....
> (in case y'all wanna fix it: Chandra, dots.ocr, olmOCR, MinerU, Monkey OCR, and PaddleOCR are a good start)
after clicking on your link I browsed twitter for a minute and damn that place has become weird (or maybe it always was?)
Also, do you know if their benchmarks are available?
In their website, the benchmarks say “Multilingual (Chinese), Multilingual (East-asian), Multilingual (Eastern europe), Multilingual (English), Multilingual (Western europe), Forms, Handwritten, etc.” However, there’s no reference to the benchmark data.
I'd want to see a comparison with Qwen 3 VL 235B-A22B, which is IME significantly better than MinerU.
On the OP link, they compare themselves to the capabilities of leaderboard AI's and beat them.
I've worked on document extraction a lot and while the tweet is too flippant for my taste, it's not wrong. Mistral is comparing itself to non-VLM computer vision services. While not necessarily what everyone needs, they are a very different beasts compared to VLM based extraction because it gives you precise bounding boxes, usually at the cost of larger "document understanding".
Its failure mode are also vastly different. VLM-based extraction can misread entire sentences or miss entire paragraphs. Sonnet 3 had that issue. Computer vision models instead will make in-word typos.