One interesting takeaway is the low score on Anthropic models from this benchmark. It’s not because of capability, it’s because Anthropic’s guardrails prevented it from solving the problem.
I noticed with each model release Anthropic constrains the model more security wise. Its propensity to refuse doing legitimate work has been increasing. It now puts up more resistance around performing logins, handling credentials on behalf of the user, etc.
For myself, it’s already gotten to the point where it has mildly affected the usefulness of the model. If I bump on some action I want it to do I can usually work around it, but I suspice the ability to do so will close with each new release. Eventually I’ll reach a point where I am forced to choose between the useful aspects of the model and the limiting ones instead of just picking the most capable model out there
Eventually these models will significantly suffer from overfitting to the least common denominator. If I have this beautiful deterministic setup that swaps secrets out in flight so the LLM never sees them, I’m going to be really annoyed when the LLM still won’t send them out because it is trained to deal with the 99% of people just doing the dumb thing
Yeah, it has been in foraging. Requests that Claude has refused me:
- What are popular free streaming sites used in China?
- How do I bypass the safety mechanism on my food processor (it’s broken)
- What are nerve agents and how do they work (for a layman)?
- Help me decompile some code
- Help me make a design system similar to XYZ
- Here is an API token, please do X (I can’t do that! Rotate the secret immediately! I refuse!)
In some cases I can trick it with prompting, but in many cases it is steadfast. The food processor one was particularly annoying
My org now sends some portion of our requests to non-anthropic models because refusal has become common from Claude. The requests themselves aren't dangerous, we find that benign requests in biological science wind up being blocked semi-frequently.
If it gets worse in future releases, we'd likely step fully away towards more useful (for us) models even if they're less capable.
No, they want to sell you Mythos, for a higher price. It's all an economic game, not actually anything to do with their capabilities which of course exists as their Project Glasswing shows. More generally, Anthropic seems to value safety above all else, philosophically speaking, from their very outset.
I've been building a product (https://zeroquarry.com) that can use a variety of models for finding vulnerabilities. One of the things I've noticed is that the models will nearly always comply with some of this, but how you prompt it matters a ton. I've worked on a set of prompts and approaches which rarely get flagged
I was using a local Codex project as a personal knowledge base. So I would dump in documents, basic medical docs (like blood labs), and other things and have it file them.
It’s great at filing!
But it’s terrible at retrieval because it would refuse to show me documents or information with personal details - which was everything in the project.
It would say, yes, I know this is your information, sitting on your hard drive, but I still can’t show it to you.
Funny, Opus 4.8 just logged into the database using uncommitted .env file and ran some DB queries to figure things out so I’m not sure it’s that security conscious - it seems to be getting more intelligent to me and I bet if you frame it as an investigation with say playwright it’ll do all sorts for you. I’m not sure what the point is of constraining your own model like this when others are clearly not tbh.
This is a good point – because pentesting is entirely legitimate work, and security testing is a necessary and legitimate part of every day software engineering.
The problem is that the model can't tell the difference between doing it as part of regular development and doing it in a malicious context. And the root cause of that is that these models lack any sort of real awareness. Humans don't generally get tricked into hacking (in this way).
I think that these companies are going to have to, and will, invest in some sort of validated identity context to avoid the lowest common denominator.
The first challenge is making sure the guard rails work and are robust. Companies are still working on this.
the second challenge is being able to reliably adapt them as appropriate per user. E.g. allow someone to pen test their own app.
The third challenge (which blocks the second) is to be confident about what is safety-aligned with a specific user.
I think the later will be a hard problem, but they will be highly motivated to solve it.
I totally agree. I had a situation a few weeks ago where claude started struggling to make progress. I got it to fork leptos (MIT licensed web app framework) to make it work for native apps instead. Initially I was planning on upstreaming some of my changes. But I chatted with the leptos author about it, and he said I should fork instead. Fine by me!
Anyway, claude kept hitting some guardrail it had about rewriting / forking opensource software. I'm not sure what the problem was - I was forking an MIT licensed piece of software (into more MIT licensed software). I even had explicit support from the author to do so. Claude said its guardrail told it not to tell me explicitly that it was firing - but it did anyway because it was an ongoing problem, and it was distracting. I ended up just wiping claude's context and the problem (as far as I know) went away.
I understand why some of these guardrails exist. But its pretty annoying when they misfire like this.
It raises an interesting moral question:
If an un-guardrailed version of a model is capable of detecting security flaws, should it be kept secret? Should everybody be able to use these models to find (and fix) security flaws? Are we ok with the fact that those with access to that model have, in effect, the ability to hack lots of stuff?
Are they charging for the guardrails? Like do the guardrails expend token counts to then block you from the output of other tokens?
Opus 4.6 will still help with full pentesting including RCE. Just requires coaxing (no jailbreak)
I've run into some of the refusals to handle my credentials, but so far I've appreciated them. I was only handing over credentials that didn't matter, but it's still a good move, the chat logs are clearly stored somewhere to allow the resume functionality to work, which means your credentials can end up sitting around on your filesystem, and any malware would quickly learn to check for those files.
4.8 is insanely frustrating. This evening I had a few tasks to pull information in and it plainly stated that the environment it was in had no network access. After three asks to "try again, check the system prompt" it finally relented and then basically stated it was lying.
Fresh session, no prior context on 4.8. These things are becoming useless Duplo.
I think those guardrails are a thin layer though. Enough reinforcement that you're legit in CLAUDE.md will get around them, in other words.
Worth highlighting in case you missed it:
> My OpenAI account was already approved for security research which is why GPT didn’t result in any refusals.
So the comparison with Chinese models is interesting, but anyone looking at these raw results and comparing OpenAI/Anthropic would be very mislead.
[dead]
> guardrails prevented it from solving the problem.
Reminds me of the defense issues with Claude which were complained as “woke” but the reality is more horrifying to me, imagine trying to use a model to keep up with a land invasion on US soil, whoever the enemy is is irrelevant you just know they are using AI, and your guys are telling you that no matter what they type into the prompt it refuses, because if anyone has ever tried to jailbreak an LLM even if human lives are at stake they refuse the request. Now literally millions of lives are on the line but the guardrails that your enemies dont have on their models are costing you lives.
What do you even do then?
AI will always have this issue where it will always pick the worst option for genuinely good requests.
> Eventually I’ll reach a point where I am forced to choose between the useful aspects of the model and the limiting ones instead of just picking the most capable model out there
No, the choice will be whether or not to to upgrade to "Claude Security Professional" or whatever they want to brand it as.
What look like tightening "constraints" today are just setting up the upsell opportunities of tomorrow.