logoalt Hacker News

FailMoretoday at 9:42 AM2 repliesview on HN

Ah, I didn't know that. It's not something I had worked on before, and the file format is highly prevalent (so I assumed things would be easy), so it was surprising to me


Replies

SirHumphreytoday at 10:15 AM

Nothing about PDF is easy. Similarly to what once Tom Scott said about time zones, every time I must deal with PDFs I pray that PDF.js can be hacked in to doing it instead, otherwise I just don’t bother.

It’s on of the few examples when converting it in to picture and chucking it in a multimodal llm is a more sensible solution than trying to parse it.

caspper69today at 11:38 AM

You would think that, but PDF is not really a format for text. It's a format that describes typography and graphics layout & formatting. It's not uncommon for a text pdf to not contain all of the text it renders (due to ligatures).