> If you make an LLM more safe, you are going to shift the weight for defensive actions as well.
>
> There’s no physical way to assign weights to have one and not the other.
Do you think a human is capable of providing assistance with defense but not offense, over a textual communication channel with another human?
If no, how does a cybersec firm train its employees?
If yes, how can you make the bold claim that it's possible for a human to differentiate between the two cases using incoming text as their basis for judgement, but IMpossible for an LLM to be configured to do the same? Note that if some hypothetical completely-determinstic LLM that always rejects "attack" requests and accepts "defense" ones can exist, the claim it's impossible is false. Providing nondeterministic output for a given input is not a hard requirement for language models.
> If you make an LLM more safe, you are going to shift the weight for defensive actions as well. > > There’s no physical way to assign weights to have one and not the other.
Do you think a human is capable of providing assistance with defense but not offense, over a textual communication channel with another human?
If no, how does a cybersec firm train its employees?
If yes, how can you make the bold claim that it's possible for a human to differentiate between the two cases using incoming text as their basis for judgement, but IMpossible for an LLM to be configured to do the same? Note that if some hypothetical completely-determinstic LLM that always rejects "attack" requests and accepts "defense" ones can exist, the claim it's impossible is false. Providing nondeterministic output for a given input is not a hard requirement for language models.