Ah, I didn't know that. It's not something I had worked on before, and the file format is highly prevalent (so I assumed things would be easy), so it was surprising to me
You would think that, but PDF is not really a format for text. It's a format that describes typography and graphics layout & formatting. It's not uncommon for a text pdf to not contain all of the text it renders (due to ligatures).
Nothing about PDF is easy. Similarly to what once Tom Scott said about time zones, every time I must deal with PDFs I pray that PDF.js can be hacked in to doing it instead, otherwise I just don’t bother.
It’s on of the few examples when converting it in to picture and chucking it in a multimodal llm is a more sensible solution than trying to parse it.