The surprisingly complex journey to text-selectable client-side generated PDFs

31 points • by FailMore • yesterday at 5:37 AM • 20 comments • view on HN

Comments

I wonder if using Typst would be a viable solution: the compiler can be built into a wasm component that runs locally in the browser (that's what the Typst webapp does) and it generates good PDFs with working selection/copy/paste.

There's even a package (cmarker) than can translate Markdown to Typst which could be enough for a MVP.

Worf • today at 10:46 AM

PDFs should be only for printing or maybe for keeping scanned versions of things. For anything else they're just not the right tool for the job. Not for things meant to be accessed on a computer like books, scientific papers or, for some weird reason, catalogs and price lists from websites.

We have responsive and open standards like HTML and EPUB (zipped XTML) and they work great. arXiv has HTML papers, and libgen and anna's archive often have EPUB versions of books. The issue for me with EPUB is the lack of good readers now.

➕ show 4 replies

ashishb • today at 10:30 AM

Software engineers drastically underestimates GUI - Web layouts, mobile app layouts, and even PDF layouts are non-trivial pieces of work to get right in all circumstances.

➕ show 2 replies

josefrichter • today at 9:22 AM

It’s not that surprising. It’s one of those well known pandora boxes of web development: email templates, PDFs, printing,…

➕ show 1 reply

gobdovan • today at 10:31 AM

Thanks, this puts into perspective why copy-paste from PDFs is so bad.

I months into building a pasteboard transform library that normalises VS Code, Google Docs, PDFs and a bunch of Chromium apps provider-specific data so I can start pasting everything everywhere exactly how I want it. It's much, much messier than I expected.

Apps put different UTTypes on the pasteboard that are not really compatible with each other. Usually there's a plain text fallback, then rich text/HTML, then provider-specific data. You show how much insane work is needed just to make text selectable with glyph mappings, layout, links, code blocks, rendered styles, etc. But once you copy from that PDF, most viewers still only expose raw text, and often broken raw text at that...

➕ show 1 reply

alansaber • today at 11:12 AM

You don't know the hell of trawling through PDF XML and HTML construction until you've done it

alt Hacker News

The surprisingly complex journey to text-selectable client-side generated PDFs

Comments