logoalt Hacker News

mlissneryesterday at 11:03 PM2 repliesview on HN

Cool to see this here. It’s funny because we do so many huge, complex, multiyear projects at Free Law Project, but this is the most viral any of our work has ever gone!

Anyway, I made X-ray to analyze the millions of documents we have in CourtListener so that we can try to educate people about the issue.

The analysis was fun. We used S3 batch jobs to analyze millions of documents in a matter of minutes, but we haven’t done the hard part of looking at the results and reporting them out. One day.


Replies

thangalintoday at 12:02 AM

https://www.argeliuslabs.com/deep-research-on-pdf-redaction-...

> Information Leaking from Redaction Marks: Even when content is properly removed, the redaction marks themselves can leak some information if not done carefully. For example, if you have a black box exactly covering a word, the length of that black box gives a clue to the word’s length (and potentially its identity).

Does X-ray employ glyph spacing attacks and try to exploit font metric leaks?

show 1 reply
hsbauauvhabzbtoday at 5:44 AM

Presumably with font kerning and pixel perfect recreation of the source, it would be possible to guess the word very accurately.

The strings oioioi and oooiii will have different widths in some fonts because character organisation matters a lot.

show 1 reply