We have a benchmark for evaluating VLM on document understanding tasks: https://idp-leaderboard.org/ . But unfortunately, it does not include image to markdown as a task. The problem with evaluating an image to markdown is that even if the order of two blocks are different, it can still be correct. Eg: if you have both seller info and buyer info side by side in the image one model can extract the seller info first, and another model can extract the buyer info first. Both model will be correct but depending on the ground truth if you do fuzzy matching one model will have higher accuracy than the other one.
Normally, a company will train and test on a dataset that is trained on the same type of annotation (either left block first or right block first), and all other models can get a low score on their benchmark because they are trained on the opposite order of annotations.
Hi, author of the model here..
We have a benchmark for evaluating VLM on document understanding tasks: https://idp-leaderboard.org/ . But unfortunately, it does not include image to markdown as a task. The problem with evaluating an image to markdown is that even if the order of two blocks are different, it can still be correct. Eg: if you have both seller info and buyer info side by side in the image one model can extract the seller info first, and another model can extract the buyer info first. Both model will be correct but depending on the ground truth if you do fuzzy matching one model will have higher accuracy than the other one.
Normally, a company will train and test on a dataset that is trained on the same type of annotation (either left block first or right block first), and all other models can get a low score on their benchmark because they are trained on the opposite order of annotations.