FWIW, I already used Heretic to decensor gpt-oss-20b [1], and it works just fine. Note that the numb...

p-e-w • last Sunday at 5:08 PM • 1 reply • view on HN

FWIW, I already used Heretic to decensor gpt-oss-20b [1], and it works just fine. Note that the number of refusals listed on the model card is actually an overestimate because refusal trigger words occur in the CoT, even though the model doesn't actually end up refusing in the end.

[1] https://huggingface.co/p-e-w/gpt-oss-20b-heretic

Replies

NitpickLawyer • last Sunday at 5:23 PM

What's your intuition on other "directions"? Have you tried it on something other than "refusals"? Say "correctness" in math or something like that. I have some datasets prepared for DPO on "thinking" traces that are correct / incorrect, wondering if it'd be something that could work, or if it's out of scope (i.e. correctness is not a single direction, like refusal training)

➕ show 1 reply

alt Hacker News

Replies