logoalt Hacker News

shikon7last Sunday at 5:30 PM1 replyview on HN

It seems to me that thinking models are harder to decensor, as they are trained to think whether to accept your request.


Replies

int_19hlast Sunday at 11:08 PM

It goes both ways. E.g. unmodified thinking Qwen is actually easier to jailbreak to talk about things like Tiananmen by convincing it that it is unethical to refuse to do so.