logoalt Hacker News

shboomsyesterday at 11:42 PM2 repliesview on HN

often times you will have requirements that the documents you release be digitally searchable and so in these cases, this would not be an option


Replies

pottertheottertoday at 1:23 AM

This made me think of something I came across recently that’s almost the opposite problem of requiring PDFs to be searchable. A local government would publish PDFs where the text is clearly readable on screen, but the selectable text layer is intentionally scrambled, so copy/paste or search returns garbage. It's a very hostile thing to do, especially with public data!

show 1 reply
8notetoday at 12:00 AM

run some ocr on them after to recreate the text layer?

show 1 reply