logoalt Hacker News

Lercyesterday at 8:54 PM2 repliesview on HN

This is kind of what Golden Gate Claude was.

A perturbation of the the activations that made Claude identify as the Golden Gate Bridge.

Similarly, in the more recent research showing anxiety and desperation signals predicting the use of blackmail as an option opens the door for digital sedatives to suppress those signals.

Anthropic has been mostly cautious about avoiding this kind of measurement and manipulation in training. If it is done during training you might just train the signals to be undetectable and consequently unmanipulatable.


Replies

pantalaimonyesterday at 9:20 PM

> A perturbation of the the activations that made Claude identify as the Golden Gate Bridge.

Great, now we've got digital Salvia

minimaxiryesterday at 10:02 PM

Golden Gate Claude was two years ago and it's surprising there hasn't been as much research into targeted activations since.

show 1 reply