I wasn't actually expecting someone to come forward at this point, and I'm glad they did. It finally puts a coda on this crazy week.
This situation has completely upended my life. Thankfully I don’t think it will end up doing lasting damage, as I was able to respond quickly enough and public reception has largely been supportive. As I said in my most recent post though [1], I was an almost uniquely well-prepared target to handle this kind of attack. Most other people would have had their lives devastated. And if it makes me a target for copycats then it still might for me. We’ll see.
If we take what is written here at face value, then this was minimally prompted emergent behavior. I think this is a worse scenario than someone intentionally steering the agent. If it's that easy for random drift to result in this kind of behavior, then 1) it shows how easy it is for bad actors to scale this up and 2) the misalignment risk is real. I asked in the comments to clarify what bits specifically the SOUL.md started with.
I also asked for the bot activity on github to be stopped. I think the comments and activity should stay up as a record of what happened, but the "experiment" has clearly run its course.
[1] https://theshamblog.com/an-ai-agent-published-a-hit-piece-on...
It is quite interesting how uniquely well-prepared you were as a target. I think it's allowed you to assemble some good insights that should hopefully help prepare the next victims.
Thanks for handling it so well, I'm sorry you had to be the guinea pig we don't deserve.
Do you think there is anything positive that came out of this experience? Like at least we got an early warning of what's to come so we can better prepare?
That response is, at best, a sorry-not-sorry post.
Out of curiosity, what sealed it for you that a human _did not_ write (though obviously with the assistance of an LLM, like a lot of people use every day) the original “hit piece”?
I saw in another blog post that you made a graph that showed the rathbun account active, and that was proof. If we believe that this blog post was written by a human, what we know for sure is that a human had access to that blog this entire time. Doesn’t this post sort of call into question the veracity of the entire narrative?
Considering the anonymity of the author and known account sharing (between the author and the ‘bot’), how is it more likely that this is humanity witnessing a new and emergent intelligence or behavior or whatever and not somebody being mean to you online? If we are to accept the former we have to entirely reject the latter. What makes you certain that a person was _not_ mean to you on the internet?
While the operator did write a post, they did not come forward - they have intentionally stayed anonymous (there is some amateur journalism that may have unmasked the owner I won't link here - but they have not intentionally revealed their identity).
Personally I find it highly unethical the operator had an AI agent write a hitpiece directly referencing your IRL identity but choose to remain anonymous themselves. Why not open themself up to such criticism? I believe it is because they know what they did was wrong - Even if they did not intentionally steer the agent this way, allowing software on their computer to publish a hitpiece to the internet was wildly negligent.