logoalt Hacker News

Zpdf: PDF text extraction in Zig

202 pointsby lulzxyesterday at 7:57 PM79 commentsview on HN

Comments

lulzxyesterday at 7:57 PM

I built a PDF text extraction library in Zig that's significantly faster than MuPDF for text extraction workloads.

~41K pages/sec peak throughput.

Key choices: memory-mapped I/O, SIMD string search, parallel page extraction, streaming output. Handles CID fonts, incremental updates, all common compression filters.

~5,000 lines, no dependencies, compiles in <2s.

Why it's fast:

  - Memory-mapped file I/O (no read syscalls)
  - Zero-copy parsing where possible
  - SIMD-accelerated string search for finding PDF structures
  - Parallel extraction across pages using Zig's thread pool
  - Streaming output (no intermediate allocations for extracted text)
What it handles:

  - XRef tables and streams (PDF 1.5+)
  - Incremental PDF updates (/Prev chain)
  - FlateDecode, ASCII85, LZW, RunLength decompression
  - Font encodings: WinAnsi, MacRoman, ToUnicode CMap
  - CID fonts (Type0, Identity-H/V, UTF-16BE with surrogate pairs)
show 6 replies
forgotpwd16yesterday at 10:54 PM

  74910,74912c187768,187779
  < [Example 1: If you want to use the code conversion facetcodecvt_utf8to output tocouta UTF-8 multibyte sequence
  < corresponding to a wide string, but you don't want to alter the locale forcout, you can write something like:\237 D.27.21954
                                                                                                                                \251ISO/IECN4950wstring_convert<std::codecvt_utf8<wchar_t>> myconv;
  < std::string mbstring = myconv.to_bytes\050L"Hello\134n"\051;
  ---
  >
  > [Example 1: If you want to use the code conversion facet codecvt_utf8 to output to cout a UTF-8 multibyte sequence
  > corresponding to a wide string, but you don’t want to alter the locale for cout, you can write something like:
  >
  > § D.27.2
  > 1954
  >
  > © ISO/IEC
  > N4950
  >
  > wstring_convert<std::codecvt_utf8<wchar_t>> myconv;
  > std::string mbstring = myconv.to_bytes(L"Hello\n");
Is indeed faster but output is messier. And doesn't handle Unicode in contrast to mutool that does. (Probably also explains the big speed boost.)
show 2 replies
fainpultoday at 5:48 AM

These vibe coded tests are terrible:

https://github.com/Lulzx/zpdf/blob/main/python/tests/test_zp...

show 1 reply
xvilkatoday at 6:14 AM

Test it on major PDF corpora[1]

[1] https://github.com/pdf-association/pdf-corpora

mpegyesterday at 9:27 PM

very nice, it'd be good to see a feature comparison as when I use mupdf it's not really just about speed, but about the level of support of all kinds of obscure pdf features, and good level of accuracy of the built-in algorithms for things like handling two-column pages, identifying paragraphs, etc.

the licensing is a huge blocker for using mupdf in non-OSS tools, so it's very nice to see this is MIT

python bindings would be good too

show 1 reply
manmaltoday at 8:05 AM

Is there the possibility to hook in OCR for text blocks flattened into an image, maybe with some callback? That’s my biggest gripe with dealing with PDFs.

cevingtoday at 12:28 PM

The spacing issue isn't working quite right yet.

    zpdf extract texbook.pdf | grep -m1 Stanford
    DONALD E. KNUTHStanford UniversityIllustrations by
agentifyshyesterday at 9:14 PM

excellent stuff what makes zig so fast

show 2 replies
odie5533yesterday at 9:35 PM

Now we just need Python bindings so I can use it in my trash language of choice.

show 1 reply
pm2222today at 5:13 AM

What’s the format that’s perhaps free, easy to parse and render? Build one please.

littlestymaaryesterday at 10:02 PM

- First commit 3hours ago.

- commit message: LLM-generated.

- README: LLM-generated.

I'm not convinced that projects vibe coded over the evening deserve the HN front page…

Edit: and of course the author's blog is also full of AI slop…

2026 hasn't even started I already hate it.

show 3 replies
nulloremptytoday at 4:28 AM

Tomorrow's headlines

fpdf

jpdf

cpdf

cpppdf

bfpdf

ppdf

...

opdf

amkharg26today at 4:04 AM

Impressive performance gains! 5x faster than MuPDF is significant, especially for applications processing large volumes of PDFs. Zig's memory safety without garbage collection overhead makes it ideal for this kind of performance-critical work.

I'm curious about the trade-offs mentioned in the comments regarding Unicode handling. For document analysis pipelines (like extracting text from technical documentation or research papers), robust Unicode support is often critical.

Would be interesting to see benchmarks on different PDF types - academic papers with equations, scanned documents with OCR layers, and complex layouts with tables. Performance can vary wildly depending on the document structure.

show 1 reply