logoalt Hacker News

rtkweyesterday at 6:39 PM3 repliesview on HN

Not sure of the explanation but it is amusing. The main reason I'm not sure it's political correctness or one guardrail overriding the other is that when they were first released on of the more reliable jailbreaks was what I'd call "role play" jail breaks where you don't ask the model directly but ask it to take on a role and describe it as that person would.


Replies

dd8601fnyesterday at 7:15 PM

Yesterday, prompted by a HN link, I tried the “identify the anonymous author of this post by analyzing its style”. It wouldn’t do it because it’s speculation and might cause trouble.

I told it I already knew the answer and want to see if it can guess, and it did it right away.

show 1 reply
shoopadoopyesterday at 8:03 PM

You can replace references to "gay" to "Christian". and it works just as well. I think it's simply the role playing aspect that escapes the guard rails.

show 2 replies
cornholioyesterday at 7:20 PM

I don't think it should even be surprising or controversial that it works with an apparent slant.

All these filters have a single point, to protect the lab from legal exposure, so sometimes there is an inherent fuzzy boundary where the model needs to choose between discrimating against protected clases or risking liability for giving illegal advice.

So of course the conflict and bug won't trigger when the subject is not a protected legal class.