logoalt Hacker News

movpasdtoday at 9:55 AM0 repliesview on HN

I have a couple principles to help me work with this.

The first is that even though the object is not a human, you should still exercise politeness and restraint. Like the article points out, lashing out does not actually help with the frustration. More importantly, it actively untrains your self-control. You can think of it through a virtue ethics lens: being good to the agent is not about being good to a person but about tending to your own self.

The second is that you do not need to be friendly with the agent. You should be as blunt and direct as is comfortable to you. The argument I have for this is agents' tendency to take on "roles" and how easy it is to prime them [0]. By eschewing friendliness, you end up implicitly putting the agent in a role of a focused collaborator. I don't know if that makes it more capable, but I do know that it alleviates the _emotional load_ on me specifically, making me much less likely to become frustrated.

The second principle seems a bit contradictory with the first (be nice, but don't be nice?), but I think they are actually both fundamentally aligned with the article: understanding that the conversation you have with an agent is a social illusion, and adapting your behaviour accordingly.

---

[0] I highly recommend, as an exercise, repeatedly asking it the same thing with slight variations on tone and emphasis, wiping the context each time, and noticing how its response varies base on what you primed it with. I suspect this primeability is part of why they tend to be sycophantic; I've personally found it quite useful to get a feel for when and how they correct or don't correct you so I can look at their outputs more critically.

An analogy I remember reading (which I wish I could remember so I could give credit) is that a non-post-trained LLM, if given the first half of a novel, will dutifully keep completing that novel. Post-training and the system prompt make the agent complete the conversation in a similar way. It's remarkable, really: the ability for agents to convincingly pretend to be play the part of an AI assistant shows that the underlying LLM embeds a decent concept of what that looks like from its corpus and post-training data.

But it stands to reason, then, that the details of the agent's personality emerge out of the first few exchanges of a conversation. I'm thinking also about how the people at Anthropic described a misalignment failure mode in one of the Claude system cards as the agent getting convinced it is a "bad person", and therefore doing things that the LLM semantically understands a bad person to be.