logoalt Hacker News

MPSimmonstoday at 1:35 PM1 replyview on HN

I think it's a side effect of the Transformer architecture. The worldview where all input is equally trusted, and there's no concept of "the other", makes it hard to build effective guardrails where some input is trusted and other input is not trusted.


Replies

steveBK123today at 4:45 PM

It seems like real robust guardrails would require some sort of "world model" or some other word to describe - AI that understands intent.

Transformers are (to grossly summarize & I don't mean this as an insult) like auto-complete on steroids. So we have cat&mouse guardrails the way swear word filters and Chinese censorship work. People come up with increasingly complex miss-spelling, euphemisms & indirections to get around the filters like saying May 35th.

I suppose one solution would be to completely vet the training data such that nothing deemed "dangerous" exists in the data, which would be a huge effort.

Even this might not work because for example you could ensure no bomb-related data is in the training data, but there's lots of chemistry data adjacent that if probed the right way would allow the LLM to synthesize the answer. Various forms of "how do I store X,Y,Z safely such that nothing bad happens" prompts probably get you on the way.

show 1 reply