logoalt Hacker News

tempaccsoz5today at 3:12 AM0 repliesview on HN

Even if you don't fully retrain, you could get what's likely a pretty good safety improvement. Honestly, I'm a bit surprised the main AI labs aren't doing this

You could just include an extra single bit with each token that represents trusted or untrusted. Add an extra RL pass to enforce it.