logoalt Hacker News

jmyeettoday at 1:29 AM1 replyview on HN

Translating PDFs is more complicated than that because the strcture of a PDF document doesn't lend itself well to this kind of thing.

For example: if there's a dish name with a 2 line description below it and some allergy symbols below that, in HTML you can imagine the document structure that produces that. In PDF terms that might be 4 separate objects and, in particular, the eyes can see the two lines are adjacent so they fit together but the document structure doesn't really represent it taht way, necessarily.

This might also not work with translation because the lines are set for the size of the text they contain. Same for resizing the font.

Put another waay, PDF should be viewed as a typeset and layout format, not a document format.


Replies

AlotOfReadingtoday at 2:00 AM

I think you're misunderstanding what I'm describing. It's getting a screenshot of the visible portion of the rendered document, not the document itself with all the tags and nastiness inside. The same feature works with a photo of handwritten text, where obviously no digital document exists. It's not perfect, but usually adequate for menu translation.