Even if you don't fully retrain, you could get what's likely a pretty good safety improvem...

tempaccsoz5 • today at 3:12 AM • 0 replies • view on HN

Even if you don't fully retrain, you could get what's likely a pretty good safety improvement. Honestly, I'm a bit surprised the main AI labs aren't doing this

You could just include an extra single bit with each token that represents trusted or untrusted. Add an extra RL pass to enforce it.

alt Hacker News