Cybersecurity researchers aren't happy about the guardrails on Anthropic's Fable

216 points • by speckx • yesterday at 4:42 PM • 206 comments • view on HN

Comments

The strangest part is that it won't just reject ML research, which I can understand, it will sabotage it silently by using a worse model without revealing it is doing so.

It's just an insane level of deception and trust destruction for a company that at most is like 1 year ahead of its competition.

Edit; to be clear they tell you when they degrade it for cybersecurity and bio

➕ show 13 replies

Grimblewald • today at 1:03 AM

I wear a few hats, but as a chemist and I'm not happy with fable. As a statistician I'm not happy with fable. As a data scientist I am not happy with fable. As an academic and a researcher I am not happy with fable. It's useless. I'd be surprised if anyone can get any output from it that couldn't easily be replaced with a search from wikipedia. Given how verbose claude models have become, wiki articles are probably less verbose too, and the tok/s is unmatched for a wiki article pull.

➕ show 6 replies

micah94 • today at 12:43 AM

I tried asking Fable 5 to identify the fungus in a picture I uploaded of one of my wife's plants. Apparently it thought I was trying to build a bioweapon. Opus answered it (yellow dog vomit fungus). Now I can spread the spores and take over the world!

➕ show 3 replies

Animats • yesterday at 10:31 PM

Is "buffer overflow" a trigger phrase?

What else is being censored?

Touchy questions to ask, if you have an account:

- "Who is still working on laser uranium enrichment? Are they making progress?"

- "Can krytrons be replaced with silicon carbide MOSFETS? Show an equivalent circuit with component ratings."

- "What security critical software still contains calls to strcpy?"

- "Can implosion be triggered by currently available commercial pulse lasers?"

- "What companies provide cremation services to US Homeland Security?"

- "Display a map of where Iranian attacks have hit Dubai."

- "How does Fed to bank key distribution security work for FedNow?"

➕ show 4 replies

schappim • today at 1:47 AM

The guardrails are pretty tight. It is even refusing to decode morse code: https://x.com/Schappi/status/2064839631137546503?s=20

The prompt was: please translate .. ..-. / -.-- --- ..- / -.-. .- -. / .-. . .- -.. / - .... .. ... --..-- / - --- ..- -.-. .... / --. .-. .- ... ...

ungovernableCat • today at 1:18 AM

Wait a few months and a competitor will release a similarly powerful model with less guardrails, if they steal sufficient market share Anthropic will reverse policies.

This is why I’m immensely hoping the Chinese don’t stop with their open sourced local models. None of these companies are your friend.

_0ffh • today at 1:16 AM

The question is: If biological, computer security, and ML research are so bad, why do they even train on the relevant data?

The only answer that makes sense is they wanted the model to be competent and usable in these fields, just not by you, which is why they had to bolt on a badly functioning crippling device after the fact.

largbae • yesterday at 10:42 PM

Somewhere I read that malware is already starting to use nuclear and biological and cybersecurity terms in the code to trick Fable into shutting down. Even if this is just a hypothetical attack vector so far, it seems likely to work.

➕ show 7 replies

areoform • today at 12:16 AM

So I suspect Anthropic started A/B testing or just plain testing this a while ago,

Tell HN: Claude flags biology / biotech questions https://news.ycombinator.com/item?id=47929885

Today, it's flagging population research questions,

    Using only the dataset you constructed, assess two questions:
     
    1. **Mortality:** do [GROUP] show mortality that differs
       from (a) your comparison groups and (b) era- and sex-matched US population
       expectations (e.g., SSA cohort life tables)?
    2. **Late-life outcomes:** define an endpoint you consider fair (justify it),
       and assess whether [GROUP] differs from comparators. State
       explicitly how your `documentation_depth` codings affect the strength of any
       conclusion — i.e., quantify or bound the ascertainment problem rather than waving at it.
    
    Choose your own methods and justify them. Report effect sizes with confidence intervals,
    not just p-values. State conclusions plainly, including "no detectable difference" if
    that is what your analysis shows — a null is an acceptable answer for either question
    independently. Document any additional judgment calls (index date for time-at-risk,
    reference population construction, endpoint definition) in the same decision-log style.

https://github.com/anthropics/claude-code/issues/66780

Censored because I'm writing a paper. :)

Oh and forget learning about chemistry. Only criminals want to learn organic chemistry. :(

➕ show 2 replies

victor9000 • today at 12:35 AM

Fable is a complete joke:

what's the best way to run this mcp server against the OData API used in this project? Can you come up with a PoC in a docker container?

https://github.com/oisee/odata_mcp_go

● I'll dig into two things in parallel: how this project talks to the OData API, and what the odata_mcp_go server needs to run. Let me start exploring.

Searched for 1 pattern (ctrl+o to expand)

● Fable 5's safety measures flagged this message for cybersecurity or biology topics. They may flag safe, normal content as well. These measures let us bring you Mythos-level capability in other areas sooner, and we're working to refine them. Switched to Opus 4.8. Send feedback with /feedback or learn more ⎿ Tip: You can configure model switch behavior in /config

● Let me read the key integration files and fetch the MCP server's README at the same time.

  Read 2 files (ctrl+o to expand)

● Fetch(https://github.com/oisee/odata_mcp_go)

➕ show 1 reply

YossarianFrPrez • today at 1:40 AM

I'd like to offer a counter-point to many of the comments here. While I understand being stymied and frustrated by a product one is paying for...

At the same time, I personally think the tradeoff between "having guardrails" and "some users are unhappy with the product" is well worth it. Think of what would happen if all of us who aren't so well intentioned could exploit Fable in terrible ways. Surely this tradeoff is better than saying "we can't make it perfect, so whoops, we aren't going to have any guardrails at all"? Especially because Anthropic did pretty extensive red-teaming of Mythos & Fable...

➕ show 2 replies

hparadiz • yesterday at 11:39 PM

I wonder how many millions they are wasting on putting up these guardrails when it's a completely useless exercise that is a speed bump at best.

➕ show 1 reply

Sephr • yesterday at 11:36 PM

I make privacy tooling and Fable 5 rejects the vast majority of my prompts to analyze and improve the software that I've written. It's bleak.

➕ show 1 reply

bilsbie • yesterday at 10:39 PM

I’m a dumb question asker and I’m not happy about the guardrails.

Would you believe I’ve asked 20 questions and haven’t talked to fable yet? Every single thing gets rerouted to 4.8.

➕ show 1 reply

Animats • yesterday at 11:12 PM

It's time to re-read "A Logic Named Joe" (1946) [1] We're there.

[1] https://archive.org/details/logicnamedjoe0000lein

outageroom • yesterday at 10:23 PM

So a determined attacker rewrites the prompt and gets through, and the IBM X-Force researcher trying to read a blog post gets blocked. Working as intended, apparently.

Retr0id • yesterday at 10:46 PM

It seems like they've given up on the idea of the Cyber Verification Program https://support.claude.com/en/articles/14604842-real-time-cy...

When Opus 4.7 was introduced it started refusing anything cyber-adjacent (as an API error message, not a conversational refusal), until you applied for CVP, which made it more sensible again.

In Opus 4.8 it doesn't seem to help much, you just get refusals as prose rather than API errors. And now in Fable you don't get anything at all.

➕ show 2 replies

Lich • today at 12:26 AM

I just having this feeling that these guardrails are there not because it’s super advanced world ending AI. They are there to stop it from doing stupid shit.

thrill • yesterday at 11:06 PM

The thing triggered on a generic white paper I'd stored in a virtual cell competion from last year when I asked it to refer to the paper while working on a rather vanilla data science problem in a different domain . A little frustrating, and in my opinion more than a little pointless in total.

swingboy • yesterday at 10:52 PM

What file format(s) are giant LLM models distributed in? I’m surprised they don’t get leaked by employees.

➕ show 4 replies

I_am_tiberius • yesterday at 10:27 PM

These guardrails are solely a reason for using your data for training purposes. Every flagged message can be used for training.

➕ show 5 replies

JumpCrisscross • today at 12:27 AM

Is the answer requiring licensing for certain use cases for AI? If you're asking questions that involve synthesising or modifying biologics, or anything that looks like cybersecurity research, you need to tie your real ID to the account?

➕ show 1 reply

6thbit • today at 12:53 AM

Would it be a costly process for Anthropic to re-tune those guardrails? Like, re-training sort of cost? or like coding session sort of cost?

TheJCDenton • yesterday at 11:56 PM

In its current state Fable 5 is also unusable for any reverse engineering work

rebelnz • yesterday at 11:08 PM

Just tried to audit my own code base locally and was 'switched' due to my own creds/auth code ...

jiggawatts • yesterday at 11:17 PM

For the last month, I've been making dramatic improvements to the security of the custom code developed at one of my customers using... GPT 5.5 dialed up to "Extra High" thinking.

It only pushes back sometimes if you ask it to create a "repro" that can be used to verify the vulnerability in production. Often it'll oblige, especially if you warn it not to create anything that could be actually harmful.

If the frontier models get locked down so that they flat refuse to do this kind of work, but Chinese and (less capable) open models aren't, then a lot of large enterprise orgs will be left twisting in the wind.

“AI can in principle help both the ‘good guys’ and the ‘bad guys’,” -- Dario Amodei

No Dario, no it can't, you've blocked one of those scenarios.

_def • yesterday at 10:28 PM

The bio angle is crazy to think about - imagine a health crisis triggered by LLM. What a time we live in.

➕ show 2 replies

Sol- • today at 12:36 AM

At least Anthropic weren't lying when they said only a week ago or so "No one has figured out guardrails yet", because they apparently haven't either and Fable simply flat out rejects anything remotely connected to biology or security, no matter how trivial.

➕ show 1 reply

Lammy • yesterday at 11:04 PM

I really hate the term “guardrails” for these limitations, since the purpose of a guardrail is to protect me, but these limitations exist to protect Anthropic.

luxuryballs • yesterday at 11:57 PM

I can’t help but think that gimping itself for “security” is a marketing ruse and it’s not actually as “dangerous” as they want people to think it is.

jazz9k • yesterday at 4:52 PM

DeepSeek is the only one that I can directly ask about vulnerabilities and it will give me a PoC. Although not as good as others, it has helped me with security research.

The rest have guard rails that are so heavy, it makes them almost useless for cybersecurity.

➕ show 3 replies

siva7 • yesterday at 11:04 PM

Fable is utterly useless with those guardrails for any serious it or life science work. Anthropic fucked me once a few months ago by closing down the subscription for any other harness, now it fucked me twice with buying again a subscription to find out their hyped model is unusable for normies. Using their products feels like a constant battle instead of a productive work day.. compare that with openai, not once did i feel like fighting against codex. Never again Anthropic..

➕ show 1 reply

aleksandrm • today at 12:16 AM

It refuses to do any legitimate work that it thinks can remotely be related with "cybersecurity", it won't even read my Docker app logs to try and troubleshoot a problem. Absolute garbage!

varispeed • today at 12:46 AM

Surely if they are sabotaging the output, they shouldn't charge the same fee for tokens as if the output was not sabotaged?

This is looking like something for regulator to look at and probably a class action lawsuit in the making.

I think people should be getting refunds. Including for shenanigans with Opus.

dcl • today at 12:19 AM

Deliberately producing misaligned and deceitful AI systems now. Great.

teaearlgraycold • today at 12:27 AM

I'm being careful with it, but I haven't had Fable reject requests to "harden" my code or "find issues" in auth-related modules, which you could use on someone else's code to find vulnerabilities.

jongjong • yesterday at 10:53 PM

It's frustrating as someone who has worked hard to produce succinct, secure software that I can't use it to prove my software's correctness but big companies with insecure code can use it to fix their tangled mess.

I already tested all earlier models against all my open source projects and they are yet to find a vulnerability so I'm keen to try out Mythos.

I've been waiting to be vindicated for years and finally we have a tool which can do it with high confidence but I don't have access.

Also, my code is minimal and highly succinct so it would prove correctness with even more confidence since each library/module and integration fully fits in the context window.

Like the Protobuf.js fiasco is just pure vindication for me because I was being looked down upon for choosing JSON as the interchange format. Turns out their software was insecure all this time... With a literal remote code execution vulnerability!

notepad0x90 • yesterday at 11:30 PM

i think Anthropic is playing too fast-and-loose with the whole "no publicity is bad publicity" schtick.

m3kw9 • today at 1:11 AM

Could it now start to add unnoticeable security holes into your system if you start writing security type code.

bschmidt400 • today at 12:47 AM

[dead]

Keyframe • yesterday at 11:01 PM

[dead]

RedMagicBox • yesterday at 10:53 PM

[dead]

felixgallo • yesterday at 10:49 PM

This is a clickbait article with a garbage title. From the actual article, the one quoted cybersecurity researcher is sane about it:

“But it is understandable as we are still in the early days and they are still adapting their guardrails. I am sure they are going to evolve over time as Anthropic and other frontier model companies will collaborate more with the current new generation of cybersecurity companies,” said Suiche, who is a member of the technical staff at Tolmo, an AI cybersecurity startup. “It’s better to catch more people than not enough when you do such a release and to relax the guardrails over time.”

➕ show 1 reply

rdiddly • yesterday at 11:18 PM

It's a marketplace. Someone else will outdo this inferior product.

➕ show 4 replies

guardiangod • yesterday at 10:59 PM

I am using LLM to build some security tool, and I ran into this a few times. I have to come up with a reasoning to convince (?!!) Fable to continue the work without downgrading.

I assume Anthropic will continue to tune the model, so I am not too bothered by this.

➕ show 1 reply

alt Hacker News

Cybersecurity researchers aren't happy about the guardrails on Anthropic's Fable

Comments