logoalt Hacker News

fc417fc802today at 1:23 AM5 repliesview on HN

I do wonder why openai didn't screen obvious gore from the training set of a general purpose model.

That said, the write up is overly dramatic. If you find such imagery so disturbing to come across then you definitely shouldn't be voluntarily red teaming AI models. This is like someone who is afraid of violent confrontation becoming a police officer.

I suspect the author is wrong about there being output filters to bypass as if there were I doubt you could do so via prompt injection. Presumably they'll add those shortly.

I also doubt the latent space is as "bad" as is being suggested. Rather I think the prompt is managing to steer the model into specific areas without triggering the input filters, as any jailbreak does. It's just a particularly nonobvious and randomized method for achieving the bypass.


Replies

equinumeroustoday at 1:30 AM

I'm surprised there isn't a simple image classifier in place to filter out images of gore/porn/etc. - I know that there are such output filters for images with copyrighted content. It suggests to me that either the safeguards aren't in place, or this exploit bypasses those safeguards.

show 1 reply
jhanschootoday at 1:54 AM

I find this a hilarious reversal of what you typically see in journalism; here the headline and the "key takeaways" are very neutral language and the article itself is dramatic

sidewndr46today at 1:39 AM

when you consider that OpenAI probably ingested most of the information on the internet, how exactly do you propose filtering that set? Are there enough human-hours left in the universe to classify this to a high degree of confidence?

show 1 reply
Jabrovtoday at 1:31 AM

They almost certainly did filter, but there’s always false negatives with this kind of stuff

show 1 reply
dijksterhuistoday at 1:25 AM

> I do wonder why openai didn't screen obvious gore from the training set of a general purpose model

more expensive / would take longer / didn’t care / line must go up / we’ll fix it later / we can get away with it

take your pick.

> If you find such imagery so disturbing to come across then you definitely shouldn't be voluntarily red teaming AI models.

spend a day in their shoes. most of us (except the most psychopathic ones) would probably be crying by the end of it.