If I would want to achieve 100% recognition results I would combine this method with an image model recreating the original document from the transcribed text and matching the layout. One can do that with using all but the page or paragraph from the document you want to recreate (to avoid recreating the exact passage under test from the image artifact directly). After reconstructing you can do an optical comparison that specifically matches misaligned characters and find the errors. Rinse and repeat. Expensive but it would guarantee 100% recognition.