logoalt Hacker News

lulzxyesterday at 7:57 PM6 repliesview on HN

I built a PDF text extraction library in Zig that's significantly faster than MuPDF for text extraction workloads.

~41K pages/sec peak throughput.

Key choices: memory-mapped I/O, SIMD string search, parallel page extraction, streaming output. Handles CID fonts, incremental updates, all common compression filters.

~5,000 lines, no dependencies, compiles in <2s.

Why it's fast:

  - Memory-mapped file I/O (no read syscalls)
  - Zero-copy parsing where possible
  - SIMD-accelerated string search for finding PDF structures
  - Parallel extraction across pages using Zig's thread pool
  - Streaming output (no intermediate allocations for extracted text)
What it handles:

  - XRef tables and streams (PDF 1.5+)
  - Incremental PDF updates (/Prev chain)
  - FlateDecode, ASCII85, LZW, RunLength decompression
  - Font encodings: WinAnsi, MacRoman, ToUnicode CMap
  - CID fonts (Type0, Identity-H/V, UTF-16BE with surrogate pairs)

Replies

DannyBeetoday at 2:50 AM

FWIW - mupdf is simply not fast. I've done lots of pdf indexing apps, and mupdf is by far the slowest and least able to open valid pdfs when it came to text extraction. It also takes tons of memory.

a better speed comparison would either be multi-process pdfium (since pdfium was forked from foxit before multi-thread support, you can't thread it), multi-threaded foxit, or something like syncfusion (which is quite fast and supports multiple threads). Or even single thread pdfium vs single thread your-code.

These were always the fastest/best options. I can (and do) achieve 41k pages/sec or better on these options.

The other thing it doesn't appear you mention is whether you handle putting the words in reading order (IE how they appear on the page), or only stream order (which varies in its relation to apperance order) .

If it's only stream order, sure, that's really fast to do. But also not anywhere near as helpful as reading order, which is what other text-extraction engines do.

Looking at the code, it looks like the code to do reading order exists, but is not what is being benchmarked or used by default?

If so, this is really comparing apples and oranges.

tveitayesterday at 9:24 PM

What kind of performance are you seeing with/without SIMD enabled?

From https://github.com/Lulzx/zpdf/blob/main/src/main.zig it looks like the help text cites an unimplemented "-j" option to enable multiple threads.

There is a "--parallel" option, but that is only implemented for the "bench" command.

show 1 reply
cheshire_catyesterday at 9:28 PM

You've released quite a few projects lately, very impressive.

Are you using LLMs for parts of the coding?

What's your work flow when approaching a new project like this?

show 2 replies
littlestymaartoday at 8:11 AM

> I built

You didn't. Claude did. Like it did write this comment.

And you didn't even bother testing it before submitting, which is insulting to everyone.

show 1 reply
jeffbeeyesterday at 9:57 PM

What's fast about mmap?

show 2 replies
jonstewartyesterday at 10:27 PM

What’s the fidelity like compared to tika?

show 1 reply