fixed. | alt Hacker News

lulzx • last Tuesday at 11:04 PM • 2 replies • view on HN

fixed.

Replies

forgotpwd16 • last Tuesday at 11:20 PM

Yeah, sorry for confusion. When said Unicode, meant foreign text rather (just) the unescaped symbols, e.g. Greek. At one random Greek textbook[0], zpdf output is (extract | head -15):

  01F9020101FC020401F9020301FB02070205020800030209020701FF01F90203020901F9012D020A0201020101FF01FB01FE0208 
  0200012E0219021802160218013202120222 0209021D0212021D012E013202200222000301FA021A0220021C022002160213012E0222000F000301F90206012C

  020301FF02000205020101FC020901F90003020001F9020701F9020E020802000205020A 
  01FC028C0213021B022002230221021800030200012E021902180216021201320221021A012E00030209021D0212021D012E013202200222000301FA021A0220021C022002160213012E0222000F000301F90206012C 
 
  0200020D02030208020901F90203020901FF0203020502080003012B020001F9012B020001F901FA0205020A01FD01FE0208 
  020201300132012E012F021A012F0210021B013202200221012E0222 0209021D0212021D012E013202200222000301FA021A0220021C022002160213012E0222000F000301F90206012C

This for entire book. Mutool extracts the text just fine.

[0]: https://repository.kallipos.gr/handle/11419/15087

➕ show 2 replies

TZubiri • last Tuesday at 11:08 PM

Lol, but there's 100 competitors in the PDF text extraction space, some are multi million dollar industries: AWS textract, ABBY PDFreader, PDFBox, I think you may be underestimating the challenge here.