logoalt Hacker News

Convert potentially dangerous PDFs to safe PDFs

165 pointsby dp-hackernewsyesterday at 10:54 PM58 commentsview on HN

Comments

coppsilgoldtoday at 12:24 AM

While useful it needs a big red warning to potential leakers. If they were personally served documents (such as via email, while logged in, etc) there really isn't much that can be done to ascertain the safety of leaking it. It's not even safe if there are two or more leakers and they "compare notes" to try and "clean" something for release.

https://en.wikipedia.org/wiki/Traitor_tracing#Watermarking

https://arxiv.org/abs/1111.3597

The watermark can even be contained in the wording itself (multiple versions of sentences, word choice etc stores the entropy). The only moderately safe thing to leak would be a pure text full paraphrasing of the material. But that wouldn't inspire much trust as a source.

show 5 replies
majkinetortoday at 10:50 AM

Is there any benefit of this tool over opening docs in Windows Sandbox/VM with disabled network? Conversion can be easily done with a simple tool that screenshots each page within the sandbox (could be done for example with few lines of AHK script).

jevinskietoday at 12:57 AM

Seems like a similar but less elegant solution as parsing and normalization to a “safe” subset but not just blasting it to pixels.

https://github.com/caradoc-org/caradoc

http://spw16.langsec.org/slides/guillaume-endignoux-slides.p...

chapstoday at 12:29 AM

Heh, I've seen this a bunch of times and it's of interest to me, but honestly? It's sooooo limiting by being an interface without a complementary command line tool. Like, I'd like to put this into some workflows but it doesn't really make sense to without using something like pyautogui. But maybe I'm missing something hidden in the documentation.

show 3 replies
gu009today at 1:10 AM

A handy side use for this is compressing PDFs.

For some reason, printing 1 page of an Excel or Word document to a PDF often gets up to around 4MB in size. Passing it through this compresses it quite well.

Just ran a quick test:

- 1-page Excel PDF export: 3.7MB

- Processing with Dangerzone (OCR enabled): 131KB

show 1 reply
dfajgljsldkjagyesterday at 11:00 PM

I personally just upload them to google drive. It would be a serious pwn if they could somehow still do a compromise through google drive.

show 4 replies
PaulDavisThe1sttoday at 1:00 AM

Is there some reason why just viewing the PDF with a FLOSS, limited PDF viewer (e.g. atril) would not accomplish the same level of safety? What can a "dangerous PDF" do inside atril?

show 2 replies
robertktoday at 2:33 AM

Why not just open it inside of and print to a static image output within a fully sandboxed Docker container?

show 3 replies
mike_dtoday at 12:23 AM

Shameless self promotion: preview.ninja is a site I built that does this and supports 300+ file formats. I'm currently weekend coding version 2.0 which will support 500+ formats and allow direct data extraction in addition to safe viewing.

It is a passion project and will always be free because commercial CDR[1] solutions are insanely expensive and everyone should have access to the tools to compute securely.

1. https://en.wikipedia.org/wiki/Content_Disarm_%26_Reconstruct...

anthktoday at 8:44 AM

Why not DJVU with a high DPI instead of a PDF?

snowmobileyesterday at 11:57 PM

It's a neat program, but what's the use for JPGs and PNGs?

show 1 reply
rurbantoday at 4:51 AM

Now teach this HR departments. They still ask for Word docs or PDF from untrusted people. ASCII text is frowned upon. Go figure.

The employment readyness check if you can trust a company.

nullctoday at 6:10 AM

To review documents received from a hostile and dishonest actor in litigation I used disposable VMs in qubes on a computer with a one way (in only) network connection[1], while running the tools (e.g. evince) in valgrind and with another terminal watching attempted network traffic (an approach that did detect attempted network callbacks from some documents but I don't think any were PDFs).

This would have been useful-- but I think I would have layered it on top of other isolation.

([1] constructed from a media converter pair, a fiber splitter to bring the link up on the tx side, and some off the shelf software for multicast file distribution).

s5300today at 12:49 AM

[dead]

NedFtoday at 12:17 AM

[dead]

theturtletoday at 12:32 AM

[dead]