Re the OCR, I'm currently running allenai/olmocr-2-7b against all the PDFs with text in them, comparing with the OCR DOJ provided, and a lot it doesn't match, and surprisingly olmocr-2-7b is quite good at this. However, after extracing the pages from the PDFs, I'm currently sitting on ~500K images to OCR, so this is currently taking quite a while to run through.
Did you take any steps to decrease the dimension size of images, if this increases the performance? I have not tried this as I have not peformed an OCR task like this with an LLM. I would be interested to know at what size the vlm cannot make out the details in text reliably.