It's definitely far easier to emit a controlled, useful subset of PDF than it is to parse PDF documents. I wrote a small PDF library for the Decker ecosystem that just focuses on bitmaps and page layout; roughly 4kb and 135 LoC.
docs/demos: https://beyondloom.com/decker/pdf.html
browsable source: https://github.com/JohnEarnest/Decker/blob/main/examples/dec...
I’m working on one rn. It takes arbitrary PDFs and builds composable dynamic pandoc pipelines to match the source byte for byte output. It’s very very complex. But if I can get it finished it will fuck over Adobe so worth it.
This decker stuff is pretty nifty too