Im the developers who actually got banned because of this dataset. I used NudeNet offline to benchma... | alt Hacker News

alt Hacker News

markatlarge • last Thursday at 6:02 PM • 2 replies • view on HN

Im the developers who actually got banned because of this dataset. I used NudeNet offline to benchmark my on-device NSFW app Punge — nothing uploaded, nothing shared.

Your dataset wasn’t the problem. The real problem is that independent developers have zero access to the tools needed to detect CSAM, while Big Tech keeps those capabilities to itself.

Meanwhile, Google and other giants openly use massive datasets like LAION-5B — which also contained CSAM — without facing any consequences at all. Google even used early LAION data to train one of its own models. Nobody bans Google. But when I touched NudeNet for legitimate testing, Google deleted 130,000+ files from my account, even though only ~700 images out of ~700,000 were actually problematic. That’s not safety — that’s a detection system wildly over firing with no independent oversight and no accountability.

Big Tech designed a world where they alone have the scanning tools and the immunity when those tools fail. Everyone else gets punished for their mistakes. So yes — your dataset has done good. ANY data set is subject to this. There needs to be tools and process for all.

But let’s be honest about where the harm came from: a system rigged so only Big Tech can safely build or host datasets, while indie developers get wiped out by the exact same automated systems Big Tech exempts itself from.

Replies

lynndotpy • last Thursday at 7:11 PM

Agreed entirely.

I want to add some technical details, since this is a peeve I've also had for many years now:

The standard for this is Microsoft's PhotoDNA, a paid and gatekept software-as-a-service which maintains a database of "perceptual hashes." (Unlike cryptographic hashes, these are robust against common modifications).

It'd be very simple for Microsoft to release a small library which just wraps (1) the perceptual hash algorithm and provides (2) a bloom filter (or newer, similar structures, like an XOR filter) to allow developers to check set membership against it.

There are some concerns that an individual perceptual hash can be reversed to a create legible image, so I wouldn't expect or want that hash database to be widely available. But you almost certainly can't do the same with something like a bloom filter.

If Microsoft wanted to keep both the hash algorithm and even an XOR filter of the hash database proprietary, that's understandable. But then that's ok too, because we also have mature implementations of zero-knowledge set membership proofs.

The only reason I could see is that security-by-obscurity might be a strategy that makes it infeasible for people to find adversarial ways to defeat the proprietary secret-sauce in their perceptual hash algorithm. But I that means giving up opportunities to improve the algorithm, while excluding so many ways it could be useful to combat CSAM.

➕ show 4 replies

petee • last Thursday at 6:08 PM

700 were csam, if I'm reading this right?

➕ show 3 replies