I also have a 100% success rate jail breaking them by breaking the work down into small pieces and stripping all security related language. Smaller tasks, test engineering and normal programming language. Fable found a few bugs in my harness for me before they pulled it. I was testing it vs ChatGPT, Gemini, and Opus. It was doing well at bug hunting.
>by breaking the work down into small pieces and stripping all security related language
Compartmentalization in practice, nice. It's also very hard to do anything about because the agents that have been divided rarely realize they are working on something larger, hence why militaries and businesses with security risks commonly do this with their employees.
Me as well. I was struggling to make a pixel bot for, erm, research! It did not like this and kept insisting I was breaking some arcane TOS rule. I started just breaking the tasks down, something benign. Kept iterating and it could never get a holistic grasp of the task at hand.
I took an assembler class in college. Before that, I'd been messing around with Core Wars and working my way through Peter Norton's book on assembly. So when an assignment came up, I used self modifying code to solve it. It was the shortest solution, it ran perfectly, and I submitted it.
The next day, the professor caught me in the math department office (my dad worked there) and said she wanted to talk. Once we were in her office, she told me I wasn't allowed to use self modifying code. I pushed back: "Nothing in the assignment said I couldn't, and the output is correct."
The next class, she walked in and announced that self modifying code was no longer allowed on any assignment. Then she handed back the graded work and I'd gotten a 100.
Thinking back on that: about a week and a half ago I asked Antigravity to build a modern GPU version of Core Wars, except with Redcode mapped directly onto the GPU instruction set. I've had some good success and it's more or less working now, though visualizing what's happening at the GPU/Redcode level is much harder.
But before Fable 5 got yanked, I asked it to "fix" the project and it refused, flipping straight to Opus 4.8. Every single request I sent triggered the fallback. I spent over an hour trying different angles, and I even turned Antigravity loose on automatic so it was the one talking to Fable 5 same result. Every exchange tripped the fallback to 4.8. I wish I'd recorded it.
I also tried a variety of direct requests in a fresh directory "build simple self modifying assembler code" or just "self modifying assembler" and it would switch to 4.8 immediately. It was almost laughable.
There's ZERO credibility to any of these stories right now. If Anthropic really sent something over to this security person, and it's what she says it is, then why on earth didn't they just blog about it?
Hubris is a thing. Companies would do well to remember Steve Jobs in the early Apple days: ship early, ship often, and above all take responsibility for what you ship even when it's broken. Code, hardware, the whole kit all of it can be fixed. Trust is much harder to repair. Anthropic has lost mine, and while I may use them from time to time, it'll be in limited ways.
This is the same way you get people to do bad stuff as well. Make the task small enough so that the moral curvature of the topology is flat and even though they know it is a not-good part of a larger bad part they just shrug. Look at all the wonderful people we know who are working at Amazon and Meta? Corporatism has already jailbroken society.