Agreed entirely. I want to add some technical details, since this is a peeve I've also had fo...

lynndotpy • last Thursday at 7:11 PM • 4 replies • view on HN

Agreed entirely.

I want to add some technical details, since this is a peeve I've also had for many years now:

The standard for this is Microsoft's PhotoDNA, a paid and gatekept software-as-a-service which maintains a database of "perceptual hashes." (Unlike cryptographic hashes, these are robust against common modifications).

It'd be very simple for Microsoft to release a small library which just wraps (1) the perceptual hash algorithm and provides (2) a bloom filter (or newer, similar structures, like an XOR filter) to allow developers to check set membership against it.

There are some concerns that an individual perceptual hash can be reversed to a create legible image, so I wouldn't expect or want that hash database to be widely available. But you almost certainly can't do the same with something like a bloom filter.

If Microsoft wanted to keep both the hash algorithm and even an XOR filter of the hash database proprietary, that's understandable. But then that's ok too, because we also have mature implementations of zero-knowledge set membership proofs.

The only reason I could see is that security-by-obscurity might be a strategy that makes it infeasible for people to find adversarial ways to defeat the proprietary secret-sauce in their perceptual hash algorithm. But I that means giving up opportunities to improve the algorithm, while excluding so many ways it could be useful to combat CSAM.

Replies

markatlarge • last Thursday at 9:03 PM

I’m not a CSAM-detection expert, but after my suspension I ended up doing a lot of research into how these systems work and where they fail. And one important point: Google isn’t just using PhotoDNA-style perceptual hashing.

They’re also running AI-based classifiers on Drive content, and that second layer is far more opaque and far more prone to false positives.

That’s how you get situations like mine: ~700 problematic images in a ~700k-image dataset triggered Google to delete 130,000+ completely unrelated files and shut down my entire developer ecosystem. Hash-matching is predictable.

AI classification is not. And Google’s hybrid pipeline: isn’t independently vetted isn’t externally audited isn’t reproducible

has no recourse when it’s wrong

In practice, it’s a black box that can erase an innocent researcher or indie dev overnight. I wrote about this after experiencing it firsthand — how poisoned datasets + opaque AI detection create “weaponized false positives”: https://medium.com/@russoatlarge_93541/weaponized-false-posi...

I agree with the point above: if open, developer-accessible perceptual hashing tools existed — even via bloom filters or ZK membership proofs — this entire class of collateral damage wouldn’t happen.

Instead, Big Tech keeps the detection tools proprietary while outsourcing the liability to everyone else. If their systems are wrong, we pay the cost — not them.

Hizonner • last Thursday at 7:38 PM

> There are some concerns that an individual perceptual hash can be reversed to a create legible image,

Yeah no. Those hashes aren't big enough to encode any real image, and definitely not an image that would actually be either "useful" to yer basic pedo, or recognizable as a particular person. Maybe they could produce something that a diffusion model could refine back into something resembling the original... if the model had already been trained on a ton of similar material.

> If Microsoft wanted to keep both the hash algorithm and even an XOR filter of the hash database proprietary

That algorithm leaked years ago. Third party code generates exactly the same hashes on the same input. There are open-literature publications on creating collisions (which can be totally innocent images). They have no actual secrets left.

➕ show 1 reply

johnea • last Thursday at 11:22 PM

So, given your high technical acumen, why would expose yourself to goggle's previously demonstrated willingness to delete your career's and your life's archive of communications?

Stop using goggle!

It's as simple, and as necessary, as that.

No technically astute person should use ANY goggle services at this point...

➕ show 1 reply

bigfatkitten • last Thursday at 8:09 PM

PhotoDNA is not paid, but it is gatekept.

alt Hacker News

Replies