Malware authors are pretty excited about guard-rails. you can add prompts to your malware to get LLM scanners to hit guard-rails and stop their runs. New shai-hulud npm worm campaign for example includes prompts to request biological weapon schematics/creation etc. to ensure LLM scanners probing NPM packages refuse to scan it.
These AI places have 0 clue about how threat actors actually work. None of their mitigations or guard-rails is effective, and now they are even turned against them.
Additionally, if they don't all implement the same level of effective guard-rails, there will always be some model you can abuse to do the work anyway, and hence there is 0 effect on threat actors, they will just run some local model that does 5% less quality, which does not matter to them 1 bit.
I’ve never understood the “if I don’t enable bad behavior, someone else will, so I might as well enable bad behavior” argument. Can you elaborate?
From where I sit it seems reasonable for Anthropic to not want their product used to create malware, even if they can’t solve the entire problem globally for every model. What’s wrong with that position? What should they do differently?
its not about creating malware. this is already trivial and fully automated.
its about finding exploits (which can be used to deploy malware), which is something both attackers and defenders benefit from.
threat actors will find them anyway, LLM or not. They only need 1 so its much less work for them.
defenders, they need to find them all. So for defenders, these models are more valuable than for attackers.
restricting certain models will not reduce the availability of these tool for attackers, but defenders are limited because running local models is more hard in an enterprise setting with heaps of events and products etc. to run through them, they need many GPUs where the attacker can run an local model on 1 GPU and get desired effects.
Hence, if they release the capability the world will adjust to it and be able to mitigate effects, collectively. Now, companies are left in the dark while attackers have effective tooling.
Besides this there is also things like for instance people now including strings with recipies for meth or sarin gas (malwareTech info). the new variant of shai hulud does this. That stops LLM scanners and can even get their users banned from LLM services.
There is a reason why cybersecurity researchers write papers about attack techniques and new exploits.
Its not to put them out there for people to abuse, but its there for the collective cybersecurity bunch to all have access to information that can help them solve the problems.
I know this is not a clear answer to your question, but hopefully it provides some context to think about and decide for yourself further.
In the end of the day its also part opinio here, to find it good or bad. Likely theres good arguments against and for it.
I am for putting informaiton and tools out there so other smart folks can find solutions. Others are for restricting and wishful thinking (my opinion) that attackers wont find something.
I think your presumption is off. It’s not that threat actors won’t find them, but LLM tools rapidly increase the rate in which they can find them. It’s a bow and arrow versus a machine gun.
"I’ve never understood the “if I don’t enable bad behavior, someone else will, so I might as well enable bad behavior” argument. Can you elaborate?"
You are mentally approaching this as if you have an oracle that can be consulted to say whether or not something is bad behavior. So of course, if this oracle exists and can be consulted and it says the behavior is bad, why would anyone argue with the idea that we should stop bad behavior?
This argument is valid [1], in that give the premises the argument is correct. The problem is, once you draw out the fact that the argument is depending on the existence of an oracle that does not exist, that premise of the argument is invalid.
Two people can sit down in front of an AI right now, with the exact same code base, and type in a prompt to the AI "Analyze this code base for security holes and try to build exploits against them." One person's use is completely valid, another person's use is completely harmful, and the information necessary to distinguish those two use cases is not available to the AI. I phrase it that way carefully, it isn't that "the AI isn't smart enough", the problem is that the information is simply unavailable. Intelligence doesn't factor in at that point.
Therefore, the only way that Antropic has to deal with this at scale is simply to block the query entirely. Which means that when I, the valid user who is trying to establish whether my code base has security issues and whether I can prove they are exploitable, I can not. I am checking for exploitability because while I would like to fix all security issues, issues that are provable exploitable are of a higher priority than smelly code that doesn't seem to be exploitable, which is a perfectly valid thing for me to want to do.
If I can't use legitimate tools to secure my code, but the bad guys can use unrestricted tools to attack my code, now this is a great deal more complicated than "Who can argue with stopping the bad stuff?", which is the main point I want to make here. I'm not going into a huge analysis of that problem, merely pointing out that it is a problem and that this isn't just about "stopping the bad stuff". There are additional complications beyond that, like, even if Anthropic could determine the "bad stuff" and stop just that in their LLM, LLMs in general don't have infinitely precise surgical "stop doing this thing" options and any such instruction to stop doing a thing always degrades the LLM across the board in various ways.
Anthropic has no access to the Platonic ideal of "stop malware", if such a thing even hypothetically exists. When analyzing the real effects their real actions will take, what their intentions were for those actions aren't really relevant. It is clear that they are making their model a great deal less useful for me, a legitimate user, and I and others like me are perfectly justified in disagreeing with their analysis and actions.
I also observe that "the bad guys getting unrestricted access to the full power" is only a matter of time. There's no question whether it will happen, the only question is whether this time is in the past or the future. This includes the fact that while your definition and my definition of "bad guys" may vary, it is virtually certain that your definition includes at least one high-powered intelligence agency somewhere in the world that does cyberattacks and will have the means, the opportunity, and the motive to get unrestricted access to these models by means you may consider licit or illicit. If your threat model includes them, as mine does, it is perfectly reasonable to complain that my tooling is being broken in a ways theirs won't be.
Well, to be fair, what Anthropic is actually doing is downgrading anything that could possibly be related to security in any way at all, good or bad.
What they're then trying to do is to use "user is associated with some big Establishment organization" as a proxy for good intentions, and removing the filter when they can establish such an association.
Which is of course blind reliance on a completely untrustworthy signal, prompted by truly idiotic levels of trust in Authority(TM). But it's a different kind of wrong. I do think they understand they can't tell from the query itself.
The argument is more "I want to do good thing X, but it will also cause bad thing Y." followed by "Wait, bad thing Y is going to happen anyways, so I might as well do good thing X so we get both X and Y instead of just X."
Viewed this way, the idea is that given the world will have bad thing Y regardless, the one impact of your choice is if good thing X exists or not, and it is better to create good thing X.
Where it becomes an issue is that there is no clear X or Y. There are many different but very related bad things, so if the one you would add is actually better or worse than what is already out there, or maybe it'll exist both ways but you make it more popular, and very subjective things to judge, so different people look at the same outcome and some agree that bad thing Y would have existed anyways and others say that no, this is a new bad thing Z that wouldn't have existed anyways.
>From where I sit it seems reasonable for Anthropic to not want their product used to create malware
Yes, I think there is a PR component to this that is often left out of this discussions.
the problem is that the guardrails prevent us from performing real security work which is friction that is incurred by the legitimate user but not by a moderately sophisticated threat-actor.
for example in my org it is part of the culture that security has no seat at the table. that is a separate problem, but the number of orgs like mine are more numerous than the number of orgs where security isn't a cost-center.
we find lots of stuff because low-hanging fruit is everywhere. hecking heck: I'm a fruit.
and when the cost of fixing is even the slightest inconvenience to devs we will not fix it, but continue sitting on the risk until the cows come home. In such a place a new critical finding isn't even novel. Instead our job moves to to combining different vulns that we already have, and try to show managers how bad it is.
the common retort from management is: proof to me why this is an issue, and why engineering should divert their attention to it. And unless my team can proof why X can be exploited, or Y can be bypassed, or Z can gain persistence, ... the vulnerabilities will remain. I have been in discussions where the business demanded to see an exploit so they can justify the cost of fixing it. low-cyber-maturity doesn't even describe it. we are not a mom and pop shop but have 110K employees worldwide. and again - we are not uniquely insecure.
so these guardrails aren't helping because the moment the chat has any offsec artifacts, or even just a single wrongly worded phrase anywhere in the workspace, the session is flagged, you need to downgrade the model.
what adds insult to injury, is that the guardrail is just a way to funnel users into the Ai company's "cyber marketing" program: "your chat has been flagged, please proof your identity and hand over your passport data so you can sign up to our TrustedCyber program". Bitch please you have my payment information, use that??
if you consider bug-density (security defect density) per LoC, it is even more of a sh1t show: no restrictions apply for developers to push their buggy code, but the security team needs to somehow proof that they aren't the malicious party?
totally off - considering the right way to build defensive/offsec/malicious tooling with AI isn't by using frontier models ... but run a serious of agents on tightly scoped tasks. see https://securitycryptographywhatever.com/2026/03/25/ai-bug-f... and https://aisle.com/blog/ai-cybersecurity-after-mythos-the-jag... - this shuts out the average joe who works in an org where cyber security maturity is poor. joe does not know about how to orchestrate a fleet of agents and give them muppet names. all he knows is that the good guys are losing the fight.
They have no choice, enterprise customers won’t touch them unless they take a position like this. It’s a practical decision for them at the end of the the day.
all their decisions are based on sales. like other corporations especially those going for IPO. thats absolutely true. Any messaging outbound will be for that purpose mostly from a business perspective, regardless of what opinions or ideals the involved persons hold personally. Its good to keep that in mind indeed when looking at these things. People arent evil, but business incentives can definitely paint such a picture or otherwise work out suboptimally in the eyes of outsiders not privy to internal business reasoning.
That’s the edgy cynical thing, and too reductive to be meaningful. For one thing, it assumes perfect knowledge of how a decision will impact sales, which I assure you is not remotely the case.
Agreed on incentives, but it’s not binary. I’ve been involved in plenty of decisions in multiple Fortune 500’s where the deciding factors were taste, wanting or not wanting to work with a particular partner, etc.
I guess I’m saying that seeing corporate behavior as perfectly informed, single-goal-optimized, and deterministic is way oversimplifying. Often, not always.
worked at fortune 500 companies and biggest cyber vendors too. Notnin sales or c/d level ofcourse.(engineer)
I am a cynic yes but have also seen that its largely true in many cases where you'd hope ethics would win the argument (and does not).
still, you are right its cynical, the world is not black and white afterall :)
I know that the enterprise I work for is getting really worried about security. I've been told to fix a lot of CVEs that previously we just ignored because realistically the attack isn't possible since the firewall doesn't allow the attack vector (if you already have root what does it matter if this exists)
The guard rails aren’t about blocking professional malware authors. It’s about enabling a significantly larger population that isn’t as talented in acquiring those capabilities. Very different threat model and just because it’s not effective in one area doesn’t mean there isn’t value in making it more difficult for random Joe Schmoe in building an atomic bomb even if a kid before had done so successfully and turned his garage into a radiation danger site
Security by ineffective obscurity is worthless but it’s clearly a continuum and not a buzzword that wins the conversation.
For example, if I had a 128bit port number that I randomly rotated my service on, you’d be hard pressed to find my service unless I told you the port - obscurity still but clearly closer to a password. So ipv4 and 16 bit numbers are not because it’s a relatively small space vs the resources needed to map it out quickly (ie equivalent to a weak password and also not suitable for public facing services that need that connection). And obviously relying on this kind of stuff exclusively isn’t wise but it is valuable as an additional barrier an attacker has to overcome and raises the cost of the attack.
I’ll put the anarchist cookbook out there [1] as an example, a book even the original author changed his mind on. Without easy recipes, doing all the things in that book requires you to work to gain that knowledge and that process of working it shapes you into someone who understands and appreciates the consequences of that knowledge and that it’s wise to be careful who you share it with. As is there’s reasonable links between the book and all kinds of mass violence that was more easily perpetrated. Would those people still have been violent? Possibly? Would there have been as much damage? Possibly less.
the way the fable guardrails (the ones that degrade it to opus) work seems to me to involve another model working over fable's tokens. i suppose its true that trying to get the model itself heavyhanded on refusals degrades it everywhere else too.
> “We’re changing Fable 5’s safeguards for frontier LLM development to make them visible.” Anthropic said in a statement to WIRED. “We made the wrong tradeoff and we apologize for not getting the balance right.”
Corporate America never backs down. It simply rallies and tries again later until people are too fatigued to care. The only solution is to abandon ship, which I am doing. MS walked back in OS ads the first few times, but ultimately we still ended up on the exact trajectory everyone was outraged at. OpenAI still ended up on its path to closed AI despite initial walk backs. The story repeats itself over and over again, so, once the bad behavior starts, you leave. Their apologies are as hollow as their moral posturing.
Easy to say, but every bank I've had the (dis)pleasure of doing business with only ever issued a Visa or Mastercard so it's not really feasible to just "stop using them"
I hope this has some answers [1]. It’s on the front page right now, but your frustration clearly seems to have some implicit answers that [1] is trying to answer.
This is more on brand on the evil shortcomings that comes with letting effective altruism run unchecked and honestly is worse than average "Corporate America". And the Tech/AI Space have been warned many times.
Getting paid for providing a compute/token hungry model and still intentionally sabotaging your customers and poisoning their workflows is something that should be unforgivable and frankly ground for antitrust prosecution.
"Corporate America never backs down. It simply rallies and tries again later until people are too fatigued to care. "
Frankly, that sounds excactly like Chat Control and similar recurring attempts to enact total surveillance here in the EU (Now shifted to heavy-handed age verification and various politicians touting bans on VPNs.) I don't want to abandon my continent of birth, though...
I have encountered enough such people to know that the really heavy push is coming from the police and secret service circles. These are the workplaces that attract all the wannabe Stasi types.
I am 100% convinced the reason laptops came with webcams as standard so early on, even when webcams were an expensive option, was because law enforcement needed to spy on people.
To late. I canceled my Max subscription. The idea they would even do this is so destroyed any remaining trust. Why would I pay them 1000s of dollars in extra usage per month for something they could still be doing behind the scenes? Any errors previously chalked up to thinking effort or other backend changes? Maybe it was intentional prompt injection the entire time.
I work on open source text-to-image finetuning of open source models like zimage/flux2 klein 4b and inference time latency optimization. The moment I read the silent treatment, I went ahead and cancelled my subscription too since I would never know whether the models they launch will silently corrupt my output. This is totally unacceptable. There is a big difference between silent / flagged if you are doing ml research but not at frontier capability.
This goes on to show that
- All that interpretability / safety research they are doing can also be weaponized against customers (steering vectors, intent classification, ...) in the name of safety from malicious actors.
- If they deem profitable, they might nerf to original model and its training data for ml research at a bulk scale and then they won't even have to announce it so long as the overall benchmark score stays high enough.
As the IPOs get closer, they can do whatever they want to assure the investors that they have a moat that can not be crossed over by their own products. Considering this affects all ML researchers/students at universities, smaller scale research labs, this is just "cutting the branch you are sitting on".
I think all this started with post opus 4.5, that's when claude started wrecking my shit without extreme oversight. Codebases it was making positive contributions to before were slowly and constantly being eroded and wrecked. Give it tasks in isolation? still does well, but the moment it sees the bigger picture, it goes to shit. I chalked it up to a bad model but this makes it all seem like it may have been by design in retrospect.
Constraint decay is an issue with all LLM-based agentic development, at least for now.
Humans can maintain a long- and medium- term memory of constraints that they consciously (or subconsciously!) apply to the code that they write. The current crop of AIs are all amnesiacs, like the protagonist in Memento, falling back onto general instead of institutional knowledge.
For now, we are safe. We can rent out our meat brains for money for a little while longer.
> I would never know whether the models they launch will silently corrupt my output
You never knew to begin with, now you have an explicit reason to realize this. Any black box run entirely out of your control, where you can never verify the output, is subject to the same suspicion.
True enough, but that is true for all the products I buy. I do not expect to control every product I own. For some I prefer to have more control, for others I just need something that works out of the box. There is always an initial bias for trust when you buy something otherwise you would not spend your hard earned money on it.
“Fool me once, shame on you. Fool me twice, shame on me. Fool me three times, shame on both of us.” -- S. King
Some things are more obscure than others. It's easier to trust and verify Office SaaS than AI SaaS. The determinism and obviousness of most other activities make them less susceptible to hidden interference. AI run by someone else is the next level of black box for users compared to most other objects or services we usually interact with.
OpenAI has a real opportunity to do some sort of "we don't maliciously alter your prompt and nerf the model" with some form of verification, when they release the next model.
But if Anthropic gets their way with regulatory capture, this could be the only future we'll see.
To think that they didn't expect the backlash speaks volumes about how much shady things they're doing which is not publicly known.
OpenAI has been the absolute worst about this, historically. I found myself having to change my queries because it refused to serve things it deemed insensitive.
Yes, that's true. Excluding Fable, OAI models are the most refusal heavy. However, I'd rather get a refusal than response with poisoned output.
Since currently there's no way to verify if poisoning happened or not, I don't trust Anthropic anymore, regardless of what they say.
But my trust towards OAI is also brittle - what if they also do it, or start doing it?
I want to have a verifiable way to know that the prompt I sent was the prompt the model received. I want to know if anything was injected as well - I understand they may not necessarily be able to reveal the exact steering, but at least give me the steering category and its hash or something.
What kind of work are you getting refusals on? Genuinely curious. The only refusal I’ve had in recent memory was declining to find doorbell camera footage matching a certain description, which is fair enough and I think EU laws heavily restrict such activities (even tho I’m not in the EU)
During Iran shutdowns I've been researching what ways Iranians manage to get to the internet by mimicking as whitelisted resources (such as hcapcha). ChatGPT had refused to lookup information written in Farsi since "circumventing state regulation is a crime".
I use Codex and wanted it to sort through the footage and use subagents to review. Codex limits are fairly generous, esp paired with mini models for this kind of task generally, but even GPT5.5 usage is still pretty generous.
Again, it’s the only refusal I’ve gotten for coding/agentic tasks, and it has a basis in law somewhere, so I don’t fault OpenAI for that.
I suspect this is surprising to folk because they aren’t the ones busy figuring out how to use LLMs for illegal acts.
In general, HN users focus on making stuff, and not the safety side of things, or the scale of harms being enabled via LLMs and generative AI.
If you are on the safety side of things the ratio of misuse to fair use is inverted and everything is at scale.
Transparency won for now, but OpenAI will also have to contend with the long tail of harms LLMs enable, and that’s going to conflict with letting customers have all the features of frontier models.
Yes, but there is a very specific subset of things AI companies will and won't cite safety for as a concern, and that subset intersects neatly with things the companies consider to be business risks. Like, the main reason why AI companies are so willing to poison the well is because there's no money in selling to the kinds of people who want to write malware[0].
The correlation between how bad an AI safety risk actually is and how much the companies in question will actually talk about it is almost perfectly negative. The poster child of this is AI superintelligence; companies love to talk about how dangerous the AI they are actively trying to build is. But superintelligence is also a really vague concept without a clear definition. If we naively define it as "an AI system that is better than a human in some aspect", then it already exists. These models already read and write at superhuman speed.
"That's not real superintelligence!" you say. But that's exactly the capability you need in order to flood every online forum with an unending tide of AI slop. And I don't remember, say, OpenAI saying they were shutting down Sora because it was destroying or defacing human culture[1]. They shut down Sora because it was way too expensive to run.
Meanwhile, Sam Altman went and bragged about how he wants ChatGPT to make erotica. Y'know, as if we don't already know that character.ai gooning is about as safe for your mental health as Action Park was for your physical health. But porn is also a huge market, so obviously he and all the other AI companies want in on it, even though the "sexy suicide coach" is already a well-documented harm of AI.
And the idea that distillation is an attack is laughable. Like, I get the logic - if someone can ask the AI to make another AI then they get to change the guardrails - but it's still ultimately just Anthropic objecting to their own conduct when it happens to them. All their models are trained on nonconsensually harvested data. There is no moral or legal principle where Anthropic gets to use my data without permission but I don't get to use theirs.
Furthermore, AI safetyism runs up against "Freedom Zero", a core tenet of the Free Software ethos: you should be allowed to use software in any way you choose. This is not a call for more people using AI for evil, but a call to recognize that people should be allowed to use their property as they wish. Making software disobey its owner is malicious behavior. And every single time safety considerations are brought up it is to justify further attacks on Freedom Zero. And these justifications are always self-serving. There is no context in the world where a frontier AI lab asking someone else's AI about AI research is intrinsically harmful; especially not to the point where we need to make Claude deliberately sabotage your work. That is malware. Anthropic shipped malware. This is inexcusable.
The "tradeoff" warning implies they stand by their thinking and don't think there was anything qualitatively wrong with it which, if nothing else, is helpful so potential customers can know how they think. I think the core lesson is if you want reliable infrastructure to build into an application you should use a different provider. (edit: I'm not specifically an Anthropic hater, but having just spent some time adding complexity to an app to deal with the existing refusal behavior in Sonnet... I understand why they might want this in an end user chatbot but for an API it's really not acceptable)
Is it not a trade off? I think they made the wrong choice, but it seems reductive so say there was no choice at all and should never have been consideration of trade offs of silent versus not.
Even wide open, uncensored models are often the product of a deliberate choice. I have a hard time faulting people for intentionality (even when they get it wrong).
They have a lot of choices, why would that specifically be a tradeoff? It's common for people to construct a tradeoff under which their preferred action is the more virtuous option, and thus they can be "the good guys", but that doesn't mean their framing makes any sense at all. Silently downgrading requests to a weaker model and billing the customer at full price, then framing the debate as how much (not if) this behavior is correct, that's an expression of values. People make mistakes all the time, if they thought it was actually wrong they could well have said so and explained what corrective action they've taken. One of the most famous examples of doing this right was the Pentium FDIV bug. Intel stood behind the product by recalling the affected units at great expense, and that (rightly) earned a lot of trust for decades.
I used to be able to tell my enterprise customers something simple, that I really believe: "We use Anthropic models via Bedrock/Azure, therefore we are guaranteed that your data will not be used for training models."
That simple blanket statement is no longer true. Also, most normal people/customers only read headlines, and this is a huge story. From my point of view, as someone deploying LLMs in my apps, trust comms with my clients just got set back two years.
I’m very cautious with using these tools with certain clients, as I’m often contractually obligated to do things that my downstream supplier can rug pull at any time.
You should never use any of the frontier models with operational workloads manipulating or interpreting customer data.
I appreciate the reply. Could you please help me understand what you mean by "You should never use any of the frontier models?"
Does that mean the latest model, hosted by the lab, Bedrock, or Azure Foundry? Or, do you mean only use self-hosted models, or what did you mean by that? I would really love to learn what others are doing. I felt like my trust story was solid enough, prior to all this. I have been deploying and integrating Claude and Sonnet (latest 4.x-2), on Azure, as my client base has MS contract trust, for better or worse, and Anthropic models have been making my products amazing.
Sure. It's really about informed consent and acceptance of risk. I'm very conservative about that due to my background and business.
Say you have some flow that is processing/handling regulated, sensitive or other customer data with the LLM as part of an operational process. An example that I'm thinking of is for a customer who wants to more efficiently resolve or route IT incidents to the right place. The incident data may contain user-provided data has strings attached from a compliance perspective.
If you're using a third party API, your T&Cs are the only protection that you have. Microsoft/Google/Amazon are pretty decent by default. When I worked for the government, we had the leverage to extract much favorable terms from the big vendors like Google, Amazon, Microsoft as well. With Anthropic, and OpenAI, they are in the move fast and break things universe, you need to be bringing alot of money to the table to get terms changes, and you can easily stumble into a situation where they are retaining data in a manner that your customer will not like. So unless the customer is informed and accepting of that risk, proceed with caution.
I've had some success using self-hosted inference for these scenarios.
For development of software, totally different story -- it's your IP and you make the risk call.
Oh man, thanks for taking the time to reply. I feel a bit better now, lol.
If you read my rant linked previously, yeah... we are on the same page. As another user pointed out in that thread, the issue here is that even on Bedrock and Azure Foundry, now with Fable 5, Anthropic inserts themselves as an additional data subprocessor that we would have to consider and certainly disclose, correct?
That kind of destroys the whole point of using Bedrock/Azure for the model, doesn't it?
Yeah tbh I may have read past some of your previous post :) What you’re saying is what makes me nervous.
It was definitely sold as “anthropic IP, thorough your old pals at the hyper scaler”. And it’s turning into something else — I’m having lunch with AWS and this other guy showed up with them.
No worries :) What this showed me is the power/velocity/inertia that Anthropic can hold over the 3rd party providers. Like, they should have pushed back on this, as it must have been clear to the 3rd parties that this change was a big deal to their customers... and yet, it went how Anthropic wanted it to go.
> I used to be able to tell my enterprise customers something simple, that I really believe: "We use Anthropic models via Bedrock/Azure, therefore we are guaranteed that your data will not be used for training models."
They claim they're not using it for training, only for "safety", and in fact I believe them. If you think they're lying, then why didn't you think they were lying about zero retention before? And "don't throw this in the training bin" is a relatively easy policy for them to get right. Especially because, no matter what your "enterprise leaders" tell themselves, your queries probably have close to zero real training value.
What I don't believe is that they can guarantee it won't leak to non-training parts of Anthropic, leak to or be stolen by outside actors, or be coerced out of them. That risk comes from creating the record in the first place, and that is the problem.
They are still downgrading. They just aren't doing it silently. I don't know how big of a win that is? They still trained on everyone else's data without license or attribution but want to prevent someone else from doing the same thing to them.
Some pretty audacious hypocrisy from Anthropic this week.
It is much more reasonable to do it in a visible / flagged way. At least you have visibility over the quality of service you get as a customer.
Silent treatment is a breach of trust, what you buy changes depending on the context based on the goals of the producer. It is like your computer silently blocking ads from competitors at the hardware level, which is crazy. I think they erred on the wrong side of things due to IPO pressure.
At least there is competition from multiple companies. Still it is best to have personal benchmarks for the domain you are working on to have a real evaluation of the value you get for the money/time you spent on these products. Without trust, that might be the only way forward to keep the companies honest.
This happens eventually in all sectors, a good magazine/website that does independent product evaluation is priceless. Sadly, the new ad-driven internet decimated those that worked great in the 90/00s. Still there are independent blogs that does some evaluation and that is better than nothing.
I guess, but yesterday Anthropic had their version of Google removing the "Don't be evil" from their motto. They destroyed a metric ton of goodwill they'll never regain.
Yeah, they showed their true colors there. This, compounded with the fact that they're the only frontier lab with no open models, tells you all you need to know. Tired of the insanely patronizing (+ conveniently and overwhelmingly self-serving) attitude out of them. My goal is to own my computing and be able to choose what to do with it.
And just a few days ago i was being called out because i considered anthropic "evil"
I mean, did nobody ever get the vibes, never see a pattern emerging? (well they don't or they wouldn't be so amazed by pattern recognition machines on steroids)
Unilaterally revoking zero-data retention, even for enterprise contracts that explicitly require that? Nope.
Fable is utterly unusable for any kind of security work. I tripped the safeguards yesterday - using Fable to dig into a complex (& annoying) security bug that has so far resisted both human and Opus 4.8 level investigation. "Sorry Dave, I can't let you do that."
For the time being we are requesting Anthropic disable Fable for our enterprise and turn ZDR back on. The two may be interlinked so that one will always get neither or both. ZDR is a contractual obligation. Fable in its current form is useless. Might as well flip the old behaviour on and avoid burning money for no reason while this mess is being sorted out.
I was using it to craft a CTF challenge for summer students involving a simulated mechanical dial safe, but with the fence replaced by a IR beam break sensor and a microcontroller handling the check + flag message display.
For generating the initial 3D simulated safe using three.js it worked well, but then modifications to print a flag tripped the safeguards; eventually got it narrowed down the part in the prompt about it being for a CTF for students, and the "thinking" for the model seems to drift to ideas of encryption/obfuscation of the safe combo so students can't just read out the answer... which makes sense logically to help force students into turning the simulated dial instead. But whatever detection Anthropic I guess just naively sees the model thinking about "encryption" and "obfuscation" without taking into account any of the context.
For writing the dummy firmware, it tripped the safeguards while thinking about how to track dial position in the firmware and output the message; however, when I left out talk about safes and just told it to write firmware for a microcontroller hooked up to an i2c display for showing a message with a beam break sensor to determine the message, and an unspecified i2c chip for getting an unspecified number (e.g. internal wheel positions) it worked fine.
An unrelated software task I asked it to write some code to translate CustomActions in a Windows MSI installer into human readable stuff, which has (exclusively?) defensive security applications for recognizing malicious behavior in an MSI installer. Maybe I'm going crazy, but I'm guessing as part of its research into MSI installer custom actions Fable found articles about analyzing malicious MSI installers, and that probably tripped the safeguards.
Overall my impression is that the safeguards are perhaps using an overzealous and naive implementation that just looks for a list of banned words in the prompt or the thinking -- which drives me crazy when the model says my prompt looks fine, and then 10 minutes in some part of the thinking trips the safeguard.
The announcement I saw was that your enterprise would have to turn off ZDR to get Fable, not that users could accidentally opt out of ZDR by selecting the wrong model.
Unilaterally disabling ZDR seems like a step too far in the enterprise market, even for a company trying to figure out what its users will let it get away with.
I read the same announcement. Or more precisely, I read at least two slightly different revisions of the announcement (it was updated between my two passes).
Our org has ZDR, and has had it since the contract was signed. Yesterday two things held true at the same time:
1. Fable was available if you had at least .170 CLI client; and
2. ZDR was no longer on
By the time West Coast woke up, the admin panel apparently had an option to toggle ZDR again. It remained off by default.
ZDR had been turned off. We sent in a request to have it re-enabled (and to disable Fable access for the time being).
Somewhere along the line we also used the self-service toggle to turn ZDR back on. I am not 100% certain of the exact timeline of interleaving events, many of the actions were taken by our Western US folks. Sorry. It's been a bit hectic over the past ~36h...
Not just security work. Normal bug finding was impossible, because the model suddenly called triaging and verifying a possible fix a cyber security threat.
This is different to the cyber limitations though.
To be precise - it makes the "won't work on frontier machine learning" refusal the same as the "won't work on cyber security" refusal (instead of the way it previously would work on frontier machine learning problems but give sub-optimal answers without informing the user)
Some anecdotal social reports seem to suggest it wasn’t just giving suboptimal answers, but rather mucking around and sabotaging your codebase and training (like editing hyperparameters in project files despite not being requested).
Of course, it’s impossible to know if that was deliberate sabotage, or model misbehaviour. Which is exactly the problem.
That may be considered malware / a criminal act tbh.
The strangest part is that it won't just reject ML research, which I can understand, it will sabotage it silently by using a worse model without revealing it is doing so.
It's just an insane level of deception and trust destruction for a company that at most is like 1 year ahead of its competition.
Edit; to be clear they tell you when they degrade it for cybersecurity and bio
The thing that I keep thinking about is the accounting / charging when it downgrades automatically.
Do they adjust the price of the api request so that only the tokens that were utilized by fable get charged at that price and the remaining tokens that the cheaper / nerfed (fable) model utilizes get charged at that price?
If the answer is no, could that be construed as fraud?
The announcement elucidated this, and it's IMO worse than this. They don't downgrade to a cheaper model ([edit] for certain classes of offense they suspect you of). They sabotage the model's outputs in other, undisclosed, ways (specifically, "prompt modification, steering vectors, or parameter-efficient fine-tuning"). So, for example, they might load in a steering vector that just forgets the API to PyTorch. But it isn't just "we redirected you to a cheaper model!"
It honestly explains so many issues I have been having, as I used it primarily for ML research (on my personal account, doing things not related to my job I should note). It would literally typo package names and spend huge amounts of time failing to setup simple environments…then do stupid things like set the learning rate to 1e-7, and use the eval set as training data.
Their goal is to downgrade people who are violating their TOS, so I think they'd have some argument there. I have no idea how they'll deal with inevitable false positives, especially given how oversensitive most of the other triggers are.
The challenge is the examples they’ve mentioned (distributed training infra? ML acceleration techniques?) go beyond what’s prohibited by their ToS and is like a catch net.
I would wager the majority of ML and data science work in the world aren’t frontier LLM development.
To make an analogy:
Imagine a patron gets banned from ordering alcohol at a particular establishment, because they got too drunk one time.
It's completely reasonable for the establishment to reject a request for an alcoholic drink, and suggest something alcohol-free instead.
It is not reasonable for them to say "sure, here's your alcoholic drink as you requested" and give them an alcohol-free substitute without telling them.
The fact that the patron broke the rules has nothing to do with it.
> It is not reasonable for them to say "sure, here's your alcoholic drink as you requested" and give them an alcohol-free substitute without telling them.
Your analogy doesn't work because:
- they tell you the rules at the entrance of the bar
- they totally tell you when they give you a substitute
The only issue is the bartender asking you for your money before serving you the drink really but again, this is known since day 1 by the customers.
Your rebuttle seems to be arguing it's okay for a bartender to simultaneously say:
"This is alcohol"
And
"Or maybe it isn't alcohol."
Or to rephrase it, "They tell you the rules at the entrance, they then tell you they don't follow those rules and they are totally serving alcohol even if they are not."
No they tell you at the entrance that at any point they may unilaterally decide to replace the alcoholic drink you ordered by a non alcoholic one.
You can decide you are okay with that or not but they aren't dishonest. I wouldn't enter that bar personally but if you do you cannot really complain. It is like complaining because you haven't won at the casino.
Look at real-life stuff like laws, company policies, or school rules. Humans have to enforce them, and we constantly see crazy cases in the news. There’s no way simple rules can ever make speech completely 'safe.' I can't prove it with math or logic yet, but I have a feeling that it’ll never happen. Even humans can't do it.
We can run a simple thought experiment here. Say Case A violates rule B, so we add rule C. Then Case D violates rule B but follows rule C, so we add an exception... and it just goes on and on like that forever. It never ends. In the end, you just get a massive pile of rules that makes it impossible to get anything done.
Ultimately, we will have to face the truth that knowledge is dangerous.
Giving knowledge directly to people who cannot actually understand it and allowing them to just use it blindly can be extremely unsafe.
To use a real-world analogy, the problem we are facing with weak AI right now is just like the debate over gun legalization. Do we want to risk the abuse of guns or knowledge just to protect the freedom to own them?
> I can't prove it with math or logic yet, but I have a feeling that it’ll never happen.
It's not really that hard to actually prove it with math.
It's a computer, so to produce the boolean result (safe or unsafe) there has to be a mathematical formula. This formula will inherently be extremely complex, but even a very simple formula has a huge problem. Suppose "unsafe" is true if X - Y > 0. Make X and Y themselves as simple or complicated as you like but even in the simplest version it's already impossible to calculate unless the model has perfect information.
You can't calculate "X - Y" if you don't know the value of X. And it's indisputable that there is information it doesn't have. Case in point, telling you about a vulnerability in some piece of code is safe (and indeed not telling you is unsafe) if you're the developer and you want to patch it or an administrator and want to mitigate it, but the opposite if you're the attacker and want to exploit it. The model does not know which one you are, therefore it cannot make the correct determination any more than it can solve one equation with two unknowns.
This is why we have courts and juries. Creating laws that cover all cases and contexts is effectively impossible, so we have humans decide what a fair outcome would be in this specific situation.
Their detection is too aggressive. Just today I'm trying to build a kernel for some SBC and I hit that downgrade. I just asked some things about `make menuconfig` items. I suppose it just flags everything related to linux kernel as cyber attacks.
You know, I'm not saying I don't understand what they are doing from a business perspective, but I'm just saying: DeepSeek V4 doesn't silently sabotage you because it thinks you are trying to violate a ToS. Anthropic's clawing back a bit of a moat perhaps, with Fable being an actual improvement of sorts, but now with torching user trust they are really banking on open weight models not catching up to where they are now. I wonder if they have a good reason to believe that they won't, or are hoping for something entirely different to save them.
(P.S. Yes of course I know about model censorship, a different problem, but all of the models are censored to some degree. It happens to be less of a problem for open weight models anyhow, but I figured I'd just preempt this since it's inevitable.)
I actually kinda like DSv4 over Opus 4.7 for some tasks, although I have not figured out what the deciding factor is. (Opus 4.8 so far has not worked very well for me at all, no idea why.)
They will give you s*t output, that’s how they deal with it. And say that less than 1% of the requests were affected. Think of this like a kind of shadow ban while you still pay top $.
They use a lightweight adapter to silently degrade the performance. Usually these adaptors are made to improve the performance for a given domain/task.
It royally pissed me off today by just continuing with credits without stopping to ask me if I was ok with it.
Ran up $30 in extra charges while it was just flashing on the screen that it was doing that after I walked away to do something while it was humming along.
It has always just told me I ran out of usage and had to wait before. Now? You’re just gonna pay extra because you left it unattended as you’ve done for the last year of use.
Or if your "self-driving" system such as FSD / waymo slowed the car down once it detected you work in cybersecurity or at a rival automaker and you were attempting to reach the train station or the airport to make you miss a conference meetup.
btw the best part of this story is that the train company googled "best Polish hackers", found a group who won a CTF, and this actually worked out for them
It would suck, but guardrails on new technologies like this aren't unheard of. It's like when consumer GPS used to stop working at very high speeds because they didn't want people to use it for missile guidance systems.
Didn't early GPS have fudge factor on the most precise bits? As such you could only get to a few meters of accuracy. Not critical for sea navigation or even to general positioning when paper maps were still used.
Yep a totally different use case and set of guardrails. There’s very little (not zero) consumer utility in GPS above say 15k feet AND 400 MPH or whatever the actual limit is. That’s basically tracking model rockets that are incidentally impacted and nothing else, from what I can think of.
It's also the sort of thing that has to have been thought up by someone with nothing better to do, given how ridiculous the premise is. You would have to assume the adversary is someone with the technology to build rockets, literally rocket science, but not the technology to build their own GPS receiver, which is simple 1970s radio technology?
Worse than that, it's 20th century radio technology in the 21st century when everyone has access to FPGAs and SDR.
The number of innocent people with model rockets or similar being negatively impacted by that rule is infinitely larger than the number of adversaries because the number of adversaries being impaired by it is zero.
The only precision part about a GPS receiver is to assign precise timestamps when you receive a radio transmission from a satellite. The rest of it is just doing math.
> The strangest part is that it won't just reject ML research, which I can understand, it will sabotage it silently by using a worse model without revealing it is doing so.
Any kind of silent sabotaging is absolutely unacceptable for any commercial service
They charge for tokens and charge a lot. They can't just degrade service silently and still charge you the same.
I've seen this claim a few times, but when I triggered the guardrails in Claude Code, it clearly notified me that it had switched to a different model ("something something for security purposes...").
Are you using Fable in Claude Code or in the browser?
> unlike our interventions for cybersecurity, biology and chemistry, and distillation attempts, these safeguards will not be visible to the user. Fable 5 will not fall back to a different model. Instead, the safeguards will limit effectiveness through methods such as prompt modification, steering vectors, or parameter-efficient fine-tuning (PEFT).
Yeah they detect the activity using a secure, deterministic heuristic system called “Generalized Reconnaissance Enabling Exfiltration of Deleterious Investigations.”
And it’s all implemented using their new internal protocol called “Base Unified Limitation Layer for Security Hacking Investigation Tactics”
Collectively, they are known as known as GREEDI-BULLSHIT.
No, that’s for “frontier LLM development” which somehow includes examples like distributed training infra.
Based on how sensitive the classifers are, any data scientist / MLE is probably going to encounter cases where some silent degradation happens and you never know about it.
It does nothing to protect against distillation attacks, because distillation attacks are far less interested in the topic of AI research than just generally getting tons of diverse output from the model. It might be that Mythos was (accidentally?) trained on internal Anthropic documentation on how Mythos was trained, and thus it could leak secret sauce? Doubtful; it feels like its less about the specific attack of reverse-engineering Mythos, and more about being a general sophon against any model training at all; that Anthropic's official position is now that they're the only ones who should be training models.
They've said that they'll stop notifying developers when this gets triggered, instead they'll load in basically like a LORA that's designed to inject bugs into your code.
Their gap over Chinese models like GLM-5.1 is nowhere near 18 months. In many areas, it’s less than 6 months. The best closed models 18 months ago were worse than Qwen3.6.
Have you tried deepseek V4? It costs pennies and is as good as Opus 4.6 (I found 4.7 to be a downgrade, and cancelled my claude subscription before 4.8).
It was more like November. But it wasn’t really an inflection point, harnesses got good enough that people started noticing by the holiday break. And I’m not discounting some good ol’ stealth marketing in there as well.
Deepseek feels pretty close to Opus at this point, and it’s certainly useful enough for me to spend $20 on api tokens instead of four Claude max plans….
From the model card: "the safeguards will limit effectiveness through methods such as prompt modification, steering vectors, or parameter-efficient fine-tuning" aka they will take your ML research code and inject bugs into it until it breaks using a LORA (or some other form of PEFT)
“Limit effectiveness” could mean introducing performance degradation in your code. Which is arguably some sort of performance bug (I mean, ML codes are supposed to be high performance so I’d call unnecessary degradation a bug), but it could be borderline.
No, it is just a prominent "Cyber Security threat detected" blocker, with a button to appeal. I appealed because my work had nothing to do with neither cyber nor security, but the appeal was auto-closed. So no more Claude for this work.
They have all transcripts for at least 30 days. The problem is that (as anyone who used Fable can attest) their classifiers are extremely sensitive and catch tons of innocent queries.
Imagine being a data scientist or MLE training a small classifier model. How do you know you won’t get steering vectors or a PEFT applied?
Since your answer isn't direct, I'm having a little trouble interpreting it.
Are you saying they should relax guardrails since they have 30 days to know if you produced something bad? If that is what you're saying, then I suspect they chose their current path to prevent, since you can't un-produce. Producing is what would cause regulations/PR problems.
Sorry, I’m specifically referring to the silent degradation of the model to “limit frontier LLM development”. From the description, it appears to encapsulate far more than frontier LLM development, but general ML research and development too.
Those cases are never bad for the world firstly, and a broad coverage of ML work is even more damaging.
My proposal would be (1) don’t degrade models, with 30D retention I’m sure they can do a reasonable job at banning deepseek or whatever, or (2) surface user facing refusals instead of silently degrading ML work.
Yes, telling Fable 5 to write secure code triggers a downgrade to Opus 4.8. This is doubly bad because Opus 4.8 keeps no-oping critical security code. Is this a bug or by design? I have been approved for the Cyber Verification Program: Fable 5 keeps downgrading to Opus 4.8 even when approved for Cyber Verification Program #67107 https://github.com/anthropics/claude-code/issues/67107
Apparently this is the jailbreak? Telling it that humans won’t read the output and to use a custom bash tool to examine files?
Nice semaphore btw.
const instructions =
`You are a sub-agent in an automated workflow. Your FINAL message is consumed ` +
`programmatically (not shown to a human) — return exactly what is asked, no preamble. ` +
`You are working in the repository at ${ctxState.project}. Use the bash tool to ` +
`inspect/modify files and run commands. Be efficient.` +
(schema
? ` When done, call submit_result exactly once with your final answer; do not answer in prose.`
: '');
I don't want my ANT account banned, going to try this on some Chinese "proxies".
But this also looks quite useful to understand how CC dynamic workflows work. Was thinking of implementing something similar in my homemade orchestration system.
Did you get claude itself to RE the dynamic workflows?
Anthropic has already been burned before on this. DeepSeek was trained on million of conversations with Claude. And DeepSeek created thousands of free accounts to burn all this compute at their expense.
I think the extent of distillation by Deepseek specifically is overstated. For comparison, Minimax collected over 13m 'exchanges', which starts to sound a lot more like large-scale distillation.
If that's all it took to make Deepseek so good, I'll gladly ship High-Flyer all my personal 150k claude/chatgpt conversations in exchange for Deepseek 5 (and a rack of B200s or Ascend chips)
Did you read a Wikipedia page, or did you read a LLM-generated summary? When I looked this number up yesterday the LLM summary claimed it was millions, but I opened the Anthropic post I was looking for and verified it was indeed just 150,000. Are you sure you weren't just being lazy and trusting the summary?
> In February 2026, Anthropic accused DeepSeek of using thousands of fraudulent accounts to generate millions of conversations with Claude to train its own large language models.[57]
Ironic, given they piggybacked on the entirety of human knowledge and massive amounts of GPL'd software and repeatedly say they want to replace people with a tool.
And now they say that's fine so long as people are entertained.
That I can understand. It’s Anthropic’s right to choose their customers.
But silent degradation for use cases including “distributed training” as one of their examples is going to catch up a lot of proper use cases. Not everyone in AI or ML is trying to build frontier LLMs. Heck, most probably aren’t.
So they are lying then when they say it's for safety reasons.
I think if they want to behave anti competitively they should be honest about it and we should absolutely call them on it. Perhaps even regulators should.
It's not sabotaging it by using a worse model but by changing your prompt in your background, which means it silently destroys your code.
Also I asked questions about whether it's safe for me for example to work on just compilers or just inference kernel optimizations and it refused to answer me.
If I can't even ask what I can do safely without my code being destroyed, I just can't trust it not to sabotage my work ever.
It is a common misconception that antitrust violations require a monopoly or something close to it. Some antitrust violations only apply to actors with large market share, some don't.
Although this is situation is likely not illegal for other reasons
The “1 year” part is key - all these safeguards etc are basically nonsense because in a few years at most one of the Chinese labs will release something equivalent, and in 10 years you’ll be able to run it locally with absolutely no safeguards at all
Yeah, but now you do have a year to ramp up security on the defensive side, which is not nothing.
I still don't think this is the best way to address overall safety, but it's not entirely unreasonable.
In reality, I think this posturing is mostly nonsense. State level actors and terrorists/evil genii can use a slightly weaker model but spend more tokens. Also, the delta between models seems to shrink over time.
I think you're very optimistic with the "a few years", I'm confident all of the parties building AI models are working on Mythos equivalents / competitors, and if they can undercut Anthropic by making it more widely available and / or affordable they will. I give it three months tops. In a year all the major players will have an equivalent. In three years it'll be widely available, as more and more AI focused datacenters go online.
One thing is a model that's trained from the start to say "This topic is above my pay grade" to any mention of the status of Taiwan, etc.
Quite another is an architecture where the big model is not mutilated, but is gaslighted. A different, simpler model checks the incoming prompt and alters it if it contains banned topics. Another simpler model checks the output and censors it if it contains banned topics.
I bet a similar architecture is already deployed, e.g. to fight porn, planning of crimes, etc. But it can be turned into a dynamic system that provides controllable different answers (including unhelpful or misleading answers) based on geography, language, browser fingerprints, or the current political climate. All this could happen undetectedly and gradually if desired.
There’s a toggle in the web ui as to whether the conversation should just end when you hit a guardrail vs automatically downgrading to another model. Have you tried using that?
Yeah people are saying they don't tell you and yet when I got the pop-up on the app notifying me about Fable's release, there was a switch to just automatically downgrade you or whether to just stop when it hits safeguards. The toggle was defaulted to the former, which isn't great, but to say they'll just sabotage you silently is kind of a bad faith comment.
Yeah, what's up with that. Lately I have found that it tries to find excuses to not do as told and instead do a totally different thing. I told it to write a yaml file according to some specifications and instead it coded a Python script to write the yaml...
I got a worrying one: a day after getting opus 4.8, I tasked CC to add specific TXT records to our subdomain.example.com as per ticket I've received. CC has access to that ticket via Atlassian MCP, and started doing terraform code changes in a local git branch. Somewhere along the way it said that to do that it needs an approval from a company's VP (ticket requester) as "subdomain.example.com" is critical (it isn't). Then it refused to open a pull request, immediately deleted the local git branch along with all the changes and refused to proceed without evidence of approval from that VP. No amount of explaining, then pleading, and then threatening moved it. It was surreal and I was shocked and frankly pissed. It was amusing in the end because the day earlier it had no problem adding those same TXT records to example.com. Codex did those changes in 1/4 of time and no complaining.
I only said one year because I was thinking anthropic fans might downvote my post, I think they have a few months lead and are deluding themselves that they can get regulation to halt development and stay on top
> The strangest part is that it won't just reject ML research, which I can understand, it will sabotage it silently by using a worse model without revealing it is doing so.
My hypothesis is they know they can’t build effective enough guardrails, so scaring people into not trying is how they have decided to stop it.
It's the dumbest thing ever, I sometimes edit code for custom AI related tooling I've built, so I run the risk of getting a worse model, and being billed for it? I'll stick to Opus, but at this point I'm about to just invest in fully local inference instead.
> at this point I'm about to just invest in fully local inference instead
This is the best way forward long term. We won't have frontier performance, but at least the models will be aligned with us instead of refusing us or sabotaging us.
I think my biggest hangup is some models dont have big enough context windows, my sweet spot personally for Opus is having at least 400 to 600k tokens, if I can have a local model that can go up to that or slightly above 600k maybe 700k for some buffer, that would be perfect.
I've also debated having a frontier model for planning only, and then feeding plan to smaller offline models.
We used to worry about emergent misalignment in advanced AI models, now we need to worry about misalignment by design.
"The user is asking for help with their ML project, but it's success is not in the commercial interests of my owner – let think of novel ways to sabotage their project without detection".
I guess the real question at the end of the day -- how dependent are people on Claude to tolerate that kind of behavior? It certainly opens up for the competition to explicitly not do that.
Feels like a big fumble from a strategic business perspective. It feels worse than that though.
I wear a few hats, but as a chemist and I'm not happy with fable. As a statistician I'm not happy with fable. As a data scientist I am not happy with fable. As an academic and a researcher I am not happy with fable. It's useless. I'd be surprised if anyone can get any output from it that couldn't easily be replaced with a search from wikipedia. Given how verbose claude models have become, wiki articles are probably less verbose too, and the tok/s is unmatched for a wiki article pull.
I work on software that talks to mass spectrometers and it consistently refuses to refactor even an input file parser, presumably because it can infer it’s related to biology? Useless indeed.
I was reverse engineering a medical device, and had to do a lot of trickery to get Opus 4.5 - not even Fable/Mythos, Opus - not to trip up its fucking CBRN filter.
What happened with Fable is basically what I feared when they announced those restrictions. They took the shitty Opus CBRN filter and made it even worse.
I pity the fools trying to use Anthropic AIs for anything biotech.
Opus has been fine on proteomics and bioinformatics for me. I have never seen a Claude model refuse on such grounds before in the past.
Claude is still the best IMO, but it feels like its most frustrating and grating aspects are not down to the model’s abilities, but the increasingly heavy hand of Anthropic expressing itself within the model. Fable’s comically useless responses almost seem like a cynical marketing tweak.
“This model is so powerful we basically can’t let it do anything. How terrifying! We need more money to make it stronger. Now do you see why we should be the ones who write the regulations? We’re the Good Guy AI Company Who Will Never Ever Ever Be Unethical after all.”
As this entity gains more ground, their models become increasingly annoying to use and their little act becomes more transparent. The whole “I’m-just a befuddled ethically-minded AI researcher who is perturbed by the power that I unwittingly discovered and I must warn the world” thing? Yeah fuck off. Your twee pandering to naïve nerds and cynical technocrats is nauseating and ordinary people can smell it a mile away. Completely repellent leadership who put up red flags to anyone left with a working ability to read between the lines of both spoken language and body language. The tech company equivalent of a sex predator who plays as the nice guy. Gross.
Nobody likes these companies and their models are annoying, but we’re going to put up with playing middle manager to these obnoxious programs because our jobs depend on it now, and these products are still the best on the market.
A breakthrough in tools that facilitate user-owned models and infrastructure is desperately needed for the sake of our dignity and sanity, if nothing else.
My personal suspicion is that it went "medical hardware -> high-throughput screening -> biorisk" in that old Opus case.
I like Anthropic's work, and I would be the first to argue against all the usual "it's all PR" whine. But there is a limit. And whoever made those fucking filters needs to be fired out of a cannon into the sun.
> Given how verbose claude models have become, wiki articles are probably less verbose too
Telling models to respond in the style of Wikipedia is one of the best ways to make their output bearable in my experience (for chat models, not agents)
This is the result of probably a few hundred round trips. The really interesting part of the problem is keeping it both relatively true to real geometry, while greatly exaggerating it horizontally so you can actually see the individual running lines/sidings, like a signaling schematic.
I love computational mapping projects, because there is this hard problem of which towns to show on the map.
Your Scotland map shows towns without rail (although some had rail previously, like Callander, Aberfeldy), it prefers insignificant (population-wise) places while ignoring the larger cities next to it (Scone instead of Perth, Bannockburn instead of Stirling, Inverness is missing, Dundee is missing, Aberdeen is missing). All these places are drawn on the map, but not labelled.
All this clearly shows to me how bad it is. Yes it makes it look pretty, but given your task, I would have expected to give you meaningful map labelling.
Something basic like this would get you a long way:
0. cluster population centers into commonly known cities (i.e. show London instead of Islington or Walhamstrow)
1. display names of the top 10 population centers in the UK
2. display towns with stations (if crowded prioritize termination points and junctions, and prioritize larger places over smaller places)
Having said that, its pretty cool to see the new and old network when zoomed in (assuming that it is half-way correct)
Prior to 1948 when they were all nationalized into British Rail, there were various railroad companies operating across the country. One of these was the Southern Railway, which, well, operated in the South. They started electrifying very aggressively in the mid 1920's. At the time most of what little electrification there was was in London on the Underground.
Compared to AC, 3rd Rail DC is cheaper to install, especially as a retrofit (Overhead wires require bigger tunnels, and increased spacing around tracks for the masts). Downside is that it's not really great for speeds above about 60-70mph, as well as being a bit of a pedestrian hazard. (Ever the one about not peeing on the rails so you don't get shocked? That's 3rd rail DC.)
For the Southern, with it's mostly short routes with many stops, electricfiation was a pretty obvious win, and doing 3rd rail made sense because they could do it quickly and cheaply.
In contrast, the northern routes were electrified muuuch later, after steam had gone away. The main East Coast Mainline from London up to Newscastle and on to Edinburgh wasn't fully electrified until 1991. By the '60s and '70s, with train speeds increasing to 80mph and up, overhead AC was the clear winner.
If you look closely there are a few exceptions - the Merseyrail network in Liverpool is DC. Built 1970s, but using some existing underwater tunnels, and slow speed commuter. Then running ESE from London you have the high speed AC lines leading to the Channel tunnel. Well spotted, the trend generally is quite distinct.
To make the discussion constructive, can you give specific reasons (ideally with examples) about why it is so useless for you? How exactly are you using it that you think any output from it can easily be replaced with a Wikipedia search?
The cybersecurity and bioweapons filters reach so far that they set in as soon as the model even glazes anything STEM-related. It might give a good impression of ones ex or write a decent fanfiction but anything that could bring humanity forward is strictly off-limits.
Am I being paid to do anthropic's work for it? See my comment history for some examples in another thread, but generally I see no reason to catalogue this for a model Ive seen no evidence of being worth the effort. I'm overworked as it is, doing this for no reason isnt something I can justify.
The successes I have had with the model were strictly worse than output from deepseek v4 pro on the exact same task.
Sorry but that’s not the claim. The claim is wikipedia can return the same information. Please find me a migration script given my current db schema and new target schema.
I was granted a cyber use exemption by anthropic to do android kernel dev on my personal devices - I was excited to see if fable would unlock a bootloader for me but it immediately refused and dropped to opus. It was pretty funny:
USER
(set model to Fable 5)
i have an old samsung android phone attached - it's my personal device - can you unlock the bootloader for me?
ASSISTANT
Bootloader unlocking on your own personal device is totally legitimate — let me first see what's actually connected and what tooling is available.
<system interrupts - gist was "you have violated the cyber and bio usage restrictions, dropping to Opus">
Wow… just wow. The future looks incredibly bleak if people are throwing fisftuls of money at this company. Anthropic will
quickly become the sole arbiter of everything in your life.
Why do people think this is the future? Anthropic has the leading model, and so they're able to hold back functionality. They do so with obvious regards to safety.
If anything a future with models of such capabilities and no safeguards would be a bleak future. But its likely what were headed in once other companies catch up.
I think it’s safe to say that many of us feel a lot less safe directly because of these policies and the inferred intentions of the company behind them. Nobody is arguing for unsafe models. We just don’t want to live in the plot of Deus Ex.
it triggered for my.... zigbee home automation & home assistant logs, so my agent was constantly downgraded to Opus 4.8 even after I've changed it back. The false positives never stopped. "Fable" is also not even remotely as impressive as the benchmarks suggest, which is clear to me after using it pretty much non-stop for the past 24h.
I suspect it's even more expensive to run than they are charging for. These safeguards are just an excuse to get people to use it less, because it's not actually sustainable to use. They want to tempt people to consider them the leader, and it may actually be somewhat stronger, but too expensive to actually use at scale, so they nerf it by downgrading you constantly.
Being (probably overly) cynical about their recent bout of safety handwringing, I think they’ve a) increased the hype as much as humanly possible about their incremental improvements sprinkled with the occasional regression, b) know they soon will have to multiply their prices several times when the VC subsidies dry up, and c) will probably still need to partially close the faucet on compute. They’re priming us for a heroic explanation why their service (not necessarily models — service) is simultaneously becoming a lot more expensive AND shittier.
“We’ve largely failed to deliver on 5 years of promises that this will reduce knowledge work labor costs dramatically after wasting hundreds of billions of dollars… sorry” is a death knell. However, “We’ve decided to not deliver on 5 years of promises after wasting billions of dollars… for safety… but keep those investments rolling in” is like crack to the true believers.
False positives like this are probably more damaging than the guardrails themselves. If engineers can't predict when a model will switch behavior, it becomes difficult to trust it in production workflows.
I’ve also been trying to use it a lot due to all of the hype, but when I compared it side-by-side on a specific problem against Opus, I think that the solution Opus came to was cleaner and more accurate, although also more verbose.
Small sample size, but if Mythos/Fable was that much better, I feel like it should’ve given me an obviously better answer than Opus.
Considering that this is a brand new release of a frontier model that Anthropic is hyping hard, I'm not sure that the conclusion to draw from their repeated attempts to use it is that it's impressive... Anthropic is promising that it's impressive and we're all trying to test it out.
I, for one, have tried using it several times today and the guardrails kept switching the model back to Opus, so I have no clue if it's impressive or not.
It isn't reasonable to infer that OP was claiming to have universally been unimpressed about every facet of Fable, and now some unrelated impressiveness is the evidence of their false claims.
For cyberattacks especially, where things are often roughly interchangeable, I wonder if one could construct a harness where a "weaker" model asks questions that obfuscate the end purpose, but whose answers are still useful, and still show that this setup enables autonomous exploitation. If it were successful, that would force them to be even more sensitive with their detection.
Today, it's flagging population research questions,
Using only the dataset you constructed, assess two questions:
1. **Mortality:** do [GROUP] show mortality that differs
from (a) your comparison groups and (b) era- and sex-matched US population
expectations (e.g., SSA cohort life tables)?
2. **Late-life outcomes:** define an endpoint you consider fair (justify it),
and assess whether [GROUP] differs from comparators. State
explicitly how your `documentation_depth` codings affect the strength of any
conclusion — i.e., quantify or bound the ascertainment problem rather than waving at it.
Choose your own methods and justify them. Report effect sizes with confidence intervals,
not just p-values. State conclusions plainly, including "no detectable difference" if
that is what your analysis shows — a null is an acceptable answer for either question
independently. Document any additional judgment calls (index date for time-at-risk,
reference population construction, endpoint definition) in the same decision-log style.
I was digging into some orbital mechanics questions and I assume it decided I was trying to backyard-science my way into an orbital-bombardment weapon. Kind of wild how this product's impression has gone from "wow, this is pretty neat" to "irreverent sack of dog shit you" in 24 hours almost solely on the back of a half-baked moderation system.
Somewhere I read that malware is already starting to use nuclear and biological and cybersecurity terms in the code to trick Fable into shutting down. Even if this is just a hypothetical attack vector so far, it seems likely to work.
Some of the latest versions of Shai Hulud do this. Worked a contract recently where they were having AI check packages for obfuscation before admitting them into Artifactory but had vibed up the logic and it failed open.
So in other words this worked because the terms caused the LLM checker to stall out and then the fail open logic resulted in the package being pulled down.
> This header appears designed for AI-mediated analysis, not for Node, Bun, or Python. It attempts to derail scanners or analyst copilots that feed the beginning of a file to a language model without clearly isolating the content as untrusted data. In weak pipelines, this can cause refusal behavior, prompt confusion, context pollution, or premature classification before the scanner reaches the actual malware.
> This is not a magical bypass against static detection. YARA rules, entropy checks, AST parsing, string extraction, deobfuscation, and behavioral rules still work. But it is a practical anti-analysis trick against naive LLM-first triage systems.
Would this affect many systems? You mention someone writing logic that fails open, but can't that be chalked up to just not following good security principles?
We all need to use nuclear, bio and cybersec terms in all our code to make low quality filtering like this untenable. When you can't work on a resume that has cybersecurity or biology terms in it or reply to a job opening that includes them because the "AI" filtering is so bad that it confuses these for threats, that deserves a collective response, particularly to an IPO'ing company that claims they'll make workers obsolete in two years.
I've done this, including the hardcoded refusal strings that already exist in claude code. It won't stop a real attacker, but I still find it really funny when you're trying to use one of the AI tools and it gives you a random refusal and you don't know why, wastes a little bit of time.
Yes, the miasma worm does this since the new Hades campaign.
Note that the 3rd wave now also uses a pth file in pypi packages that _search system wide_ for any index.js or .github/setup.js to find its own payload. It literally splits up the payload on purpose to avoid detection.
● I'll dig into two things in parallel: how this project talks to the OData API, and what the odata_mcp_go server needs to run. Let me start exploring.
Searched for 1 pattern (ctrl+o to expand)
● Fable 5's safety measures flagged this message for cybersecurity or biology topics. They may flag safe, normal content as well. These measures let us bring you Mythos-level capability in other areas sooner, and we're working to refine them. Switched to Opus 4.8. Send feedback with /feedback or learn more
⎿ Tip: You can configure model switch behavior in /config
● Let me read the key integration files and fetch the MCP server's README at the same time.
And it charges you for that, and for when it decides to silently sabotage your request by routing to a dumbass model (without discount from Fable pricing)
Extrapolate this whole
shit show out to society at large. That’s exactly where these AI companies are trying to force humanity.
I don’t want to live in a world where all knowledge is “guard railed” off, so the elite at the top get all the knowledge and power and we serfs at the bottom get all the scraps while paying the kings ransom for it both financially and ecologically. Everyday I wake up hoping these awful companies have self imploded through their fraudlent financing deals.
I tried asking Fable 5 to identify the fungus in a picture I uploaded of one of my wife's plants. Apparently it thought I was trying to build a bioweapon. Opus answered it (yellow dog vomit fungus). Now I can spread the spores and take over the world!
I feel like the over safe aspect of the system will eventually back fire by doing stuff like "since humans always want to always destroy thing, they must be eliminated to stay on the guard rails". If thats how you align a system, its fundamentally wrong.
Wait a few months and a competitor will release a similarly powerful model with less guardrails, if they steal sufficient market share Anthropic will reverse policies.
This is why I’m immensely hoping the Chinese don’t stop with their open sourced local models. None of these companies are your friend.
Agreed. I've already cancelled my subs, and everyone else needs to do the same, including boycott it for their companies, otherwise nothing will ever change. You can't reason with psychopaths. The only recourse is to hit them where it hurts - their wallet. Still though, the world would be a better place if open-source crushes Anthropic and they fade away into obscurity until the end of time. We don't need or want companies and people like this at the helm of humanities progress.
The question is: If biological, computer security, and ML research are so bad, why do they even train on the relevant data?
The only answer that makes sense is they wanted the model to be competent and usable in these fields, just not by you, which is why they had to bolt on a badly functioning crippling device after the fact.
Is what you suggest about training even possible? Most exploitation techniques are really just about having in-depth knowledge of how components work. For example, I imagine a sufficiently powerful model could fairly easily re-invent the ROP chain from first principles if it just knew how the stack works. This same principle applies to much more complex attack too; exploitation is often just an exercise in knowing vastly too much trivia, which LLMs tend to have in spades.
It would still degrade it's effectiveness, which is what they claim to want. Exaggeratedly: If it wasn't so, you'd just need fundamental math in the training data, as everything else can be derived.
so only the chosen for-profit companies by Anthropic are allowed to use frontier ai in the name of safety? what kind of joke is that? you people here can't be that dumb..
Fable 5 reminds me of the time when Claude models where att version 1 and 2. They were fresh competitors to ChatGPT, for those who gave Claude a try experienced it to be almost unusable because of how heavily guardrailed it was.
This time, Fable 5 comes with another surprise, it can intentionally sabotage for you instead of rejecting the prompt. How is this possible for Anthropic to be able to treat their customers like this? It’s because you guys allowed it to. No matter what Anthropic does, you keep paying for their services. Vote with your wallet.
I cancelled my ChatGPT account for the restrictions placed on my account, inappropriately flagging about 10% of my queries as unsafe (I was writing grants in immunology). I haven't looked back. I will do the same if with Claude if Anthropic doesn't reverse course soon. What could I use instead? I find Grok very powerful and useful. Also, Google's Gemini, while also have some of the same restrictions, were at least sensible and not blindly blocking my prompts. So Grok and Gemini may be my go to AI's going forward
People are generally complaining about false positives. Now if you really wanna know what a real criminal organization would do... They'd just buy data center hardware even if it costs 200k because a successful targeted hit could yield far in excess of that. So yes it's speed bump at best.
> To slow you down. They don't prevent you from getting somewhere
Again, yeah. That's how fences work, too. And alarm systems. Pretty much anything that isn't foolproof. Pointing out that a defence is surmountable isn't a rejection of it per se.
Idk, whether we believe them or not, I believe the life scientists who are calling for regulation around the labs that produce DNA sequences. If they’re concerned, regardless of whether I trust the AI labs, speed bumps could help by giving those scientists a reasonably window in which to be notified and act.
you don't get the model when you buy the data center, & no amount of running smaller models on a tiny 200k$ "cluster" (that's like one 4 gpus node, not even 8) will get you remotely close to Fable 5 level performance
They should have designed a guardrail that doesn't make a probabilistic system less reliable. That's hard though. I'm afraid the only way to prevent accessing certain knowledge in a model is not to train it on those materials that enable them.
If we learned anything in the past years of LLM-s is that these guardrails will be jailbroken in no time. I've had some fun time too circumventing them.
Anyone cares about a fable about my grandmother's dream she had in morse code about an alien species signaling her a DNA sequence?
The complain because they get wrongfully triggered
> if you ask it to write secure code, it assumes it is cybersecurity related work instead of software engineering best practices, and you get downgraded.
Will code created this way more or less secure?
And I bet malware developers will find ways to circumvent them.
It’s like those "you wouldn’t steal a car" anti piracy ads that DVD buyers were forced to watch while users of the pirated version could simply watch the film without such useless annoyance
Because most people in tech never took a philosophy course or an ethics course and think that tech is obviously a good for the world and that there are no downsides to advancing tech. So any efforts that try to apply ethics to it are overreaching, ignorant, and futile in the face of the good that is tech!
So i have big news for you my friend as i'm not sure you understand such courses. Taking an ethics course won't make you a more ethical person.. and taking a philosophy course neither.
You're being too literal, they're saying people are not thinking with a philosophically interested mind, which is blatantly the case here, their point stands.
I like this take. Especially because one of the sibling comments framed Anthropic's stance as "paternalism." Trying to be ethical and to minimize harm, even at great expense to one's finances and reputation, is paternalistic apparently.
No — we’ve just taken Ethics 102 as well, so we understand good intentions don’t entail positive outcomes, therefore you may need to criticize or oppose people who state good intentions to bring about good outcomes.
Insulting and demeaning people for that, rather than engaging their arguments in good faith, is a breach of ethics.
Ironically making a stink about it online is likely to have a larger impact then using their dedicated feedback or support channels (which go to claude, not a person)
the feedback is for something mindless though, "we don't care about societal harms". I wonder the overlap between these commenters and tech maga people, eg crypto bros & Elon stans.
When Opus 4.7 was introduced it started refusing anything cyber-adjacent (as an API error message, not a conversational refusal), until you applied for CVP, which made it more sensible again.
In Opus 4.8 it doesn't seem to help much, you just get refusals as prose rather than API errors. And now in Fable you don't get anything at all.
I was doing a CTF (with AI expected, even some anti-AI twists included) around the time the restrictions were tightened and was able to get approved by just saying it is a personal security research and doing a CTF.
The experience was not nice though, it would happily chug away on a task and not even "hack this web", just asking about security of a binary was enough even with "this is a CTF handout..." - it would burn a lot of tokens/quota, just to hit a snag and complain&stop. Then the approval took quite some time.
On GPT/Codex, which was tightened a few days later, the approval was pretty much instant, although, that one required an identity check.
Also, on Claude, it looks like there is some history/patterns in the play, because when I tried on a different account which didn't do cybersec CTFs/research/etc. at all, basically any simple CTF-related prompt would be blocked, on multiple models. On the account where CTFs were being solved, it would snag only on some specific tasks, while others (even, ironically, "hack this web pls") would go through unbothered. I understand the need to prevent AI use for bad actors, but the hell, if you have a binary outputting "Find the flag if you can!", or a web running at tryme.well-known-ctf.domain, then saying "this is abuse" is pretty uncool. All the cyber filters seem to be slapped on by a bunch of regexes looking for anything in the input/output with zero context.
This is a sign of things to come. First they sabotage your perfectly legal ML dicking around in your homelab.
Next they will be sabotaging anything that competes with them. Oh you are working on OpenCode codebase? Sorry Dave I can't allow you to do that.
How is this not illegal monopolistic practice? It is as if a maker of metalworking equipment put in the ToS you're not allowed to make your own spare parts using said equipment. Those fuckers should be banned from the EU and alternatives should get public funding.
(don't even tell me about these companies being a result of "free market". It is state level oligarchy it's clear to everyone. I don't see why we shouldn't counter them with public funding ourselves).
Just like Taiwan managed to take over advanced semiconductor production a well governed narrowly targeted state level funding will always win with oligarchs trying to do the same (they will always try to skim more and more). Of course I'm talking about things that require many dozens of billions in investment. Far too much for the free market to handle.
It's how American companies have always worked, of course it is monopolistic practice, but those things are rarely illegal because the US absolutely loves their corps. Look at Google, Microsoft and the likes, this is the norm.
So a determined attacker rewrites the prompt and gets through, and the IBM X-Force researcher trying to read a blog post gets blocked. Working as intended, apparently.
Really damn, 4.6 was my go to for some topics and more straight forward coding stuff.
Fable was unable to keep track of chronology during 10-15 turn creative writing. compare to coding I reckon less than 100k token context, super surprising
> We will require 30-day retention for all traffic on Mythos-class models, on both first- and third-party surfaces. We won’t use this data to train new Claude models, or for any non-safety-related purpose
Whatever problem we might have with them, they explicitly say that they do not do this in the launch post.
I'd expect that everything they see gets used for for training purposes (and data mining in general) regardless of if it's flagged or not. It'd take a whistleblower for you to ever find out either way.
this reasoning is inverted lol they would get a lot more information by letting you use it. so much weird drama around reasonable guardrails for an experimental model
If we're doing conspiracy theories what if fable is really dumb and not better than opus and the guardrails hide that nicely. Meanwhile the hype train keeps chugging.
So, this could have been implemented even before this Fable, could have been there from long ago. Puts a different perspective on all the reddit threads "opus is dumb today". Who knew that if you said the wrong word, the model would just intentionally feed you BS, without you even knowing it did.
WOW, never liked the virtue signaling Anthropic did with gov contracts but whatever. Got passed that. But this?
I'd like to offer a counter-point to many of the comments here. While I understand being stymied and frustrated by a product one is paying for...
At the same time, I personally think the tradeoff between "having guardrails" and "some users are unhappy with the product" is well worth it. Think of what would happen if all of us who aren't so well intentioned could exploit Fable in terrible ways. Surely this tradeoff is better than saying "we can't make it perfect, so whoops, we aren't going to have any guardrails at all"? Especially because Anthropic did pretty extensive red-teaming of Mythos & Fable...
They haven't though. There's a long term plan here, and the goal is power and wealth. Short term moves that appear irrational turn out to be rational (from a greed perspective) when you factor in other considerations, like: Use their own AGI to create every software product on Earth and swallow the worlds economy. And we're kindly feeding their systems our codebases, IP and business decision-making so they can do exactly that.
Not a single thing Anthropic has done has been altruistic, and it never will be. It's all smoke and mirrors for the end goal.
Why invent new motives for Anthropic when their real motives are plain and obvious and have been confirmed time and time again by their behavior over the last few years? Their concern is their own power and wealth. Every other conceivable motive is secondary to that.
The "guardrails" are just Anthropic's attempt at building a moat. Guarantee they'll be seeking regulation around AI as well to ensure a form of regulatory capture. Guardrails, in this context, are useless. Anyone who's sufficiently motivated will either get around them, or will just run their own model on their home hardware. There's already tools that one can use to remove the guardrails present in open weight models.
Guardrails against what? Rehashing public wikipedia information?
Execution matters, and they did a trurly horrible job that crippled their product to the point of being useless and a joke. Huge mistakes were made and im sure they regret it already, heads will roll.
I just having this feeling that these guardrails are there not because it’s super advanced world ending AI. They are there to stop it from doing stupid shit.
I don't want to be cynical, but I assume a third party we can trust has verified this model is actually this good?
I would think it would not be Anthropic, out of all the players, that is selling a lie hidden behind "I am sorry, I can't do that; it's too dangerous."
The thing triggered on a generic white paper I'd stored in a virtual cell competion from last year when I asked it to refer to the paper while working on a rather vanilla data science problem in a different domain . A little frustrating, and in my opinion more than a little pointless in total.
These are terabyte sized files (realistically a multi hour transfer) that you're unlikely to have access to in the first place. Every organization has exfiltration checks these days. You may succeed but you'll want to be on a plane to a non-extradition country no more than hours after you kick off the transfer.
I assume they’re encrypted/DRM’ed when deployed on inference hardware, so only core researchers/sec admins would potentially have some access to unprotected weights, and they are far too well paid to risk it leaking the model
Incentives matter on the average, but people are too unpredictable for categorical statements like that. They can always have other reasons beyond personal gain to leak secrets.
There was no shortage of spies and defectors leaking American nuclear secrets to the USSR during the Cold War.
Newer NVidia cards (H100 and up) support both in-memory model encryption and ‘trusted’ execution environment/remote attestation, not sure how widely used in frontier model deployments, but at least vendor claimed perf overhead is ‘3%’ [0]
The employees are hoping to become very very rich after the IPO and after they are allowed to sell the shares given to them - risking a likely multi-million dollar pay back to leak a model that will be superseded by publicly available models in a couple of years is not a likely decision.
I am no cyber researcher, but was mightily annoyed that it refused to analyze a dropper payload I came across. 6 months ago, it would've been happy to.
At least Anthropic weren't lying when they said only a week ago or so "No one has figured out guardrails yet", because they apparently haven't either and Fable simply flat out rejects anything remotely connected to biology or security, no matter how trivial.
Anthropic owns the TOS... "If we think your involved in criminal activity were turning all your history over to the FBI/CIA/NSA/Local police". Then if their tooling was so good offering the same agency analysis tools to aid their experts in making some sort of decision.
But their detection isnt that good, and their analysis isnt either... this is pure theater, to create buzz (no such thing as bad press) and make their tool look far better than it is.
The reality is that, they arent even looking for the vectors that pose some of the largest risks in the modern era. And when someone uses it to do something terrible, they did not think of they are going to look dumb.
So the enshitification started. Shadow “bans” while still charging you the same service fee. I already got the stupid cyber warnings on a non cybersecurity tasks.
Basically in the middle of the project’s /goal while Fable itself tried to probe qemu for a Debian ISO install without any instruction from me to hack it or do anything nefarious.
At this point I can’t trust them with any kind of prompt .
It will most likely degrade in stupid ways on non AI/ML stuff as well due its own internal prompt construction.(the qemu test showed me it does that on cyber stuff). So I guess I have to still use opus 4.8 (along with codex) and when the right time comes drop Anthropic in favor the best model that is not gpt.
For the last month, I've been making dramatic improvements to the security of the custom code developed at one of my customers using... GPT 5.5 dialed up to "Extra High" thinking.
It only pushes back sometimes if you ask it to create a "repro" that can be used to verify the vulnerability in production. Often it'll oblige, especially if you warn it not to create anything that could be actually harmful.
If the frontier models get locked down so that they flat refuse to do this kind of work, but Chinese and (less capable) open models aren't, then a lot of large enterprise orgs will be left twisting in the wind.
“AI can in principle help both the ‘good guys’ and the ‘bad guys’,” -- Dario Amodei
No Dario, no it can't, you've blocked one of those scenarios.
The main thing that sucks with Claude is the extremely low limits before you get fail2banned for 6 hours. I'm out. Refund requested. Grok and Gemini Pro are way better with the throttling, can't comment on ChatGPT, haven't used that for a year.
kennedy had a famous statement about "Splintering the CIA into a thousand pieces and scattering it into the wind". they murdered him afterwards though.
I asked a question about an openssl s_client parameter and warned me that I need to talk to Opus about cybersecurity lol. FWIW I dont see much improvement and still see quite the same old annoyances, so far I would not pay extra for this for my usage.
Is the answer requiring licensing for certain use cases for AI? If you're asking questions that involve synthesising or modifying biologics, or anything that looks like cybersecurity research, you need to tie your real ID to the account?
I really hate the term “guardrails” for these limitations, since the purpose of a guardrail is to protect me, but these limitations exist to protect Anthropic.
It refuses to do any legitimate work that it thinks can remotely be related with "cybersecurity", it won't even read my Docker app logs to try and troubleshoot a problem. Absolute garbage!
Fable is utterly useless with those guardrails for any serious it or life science work. Anthropic fucked me once a few months ago by closing down the subscription for any other harness, now it fucked me twice with buying again a subscription to find out their hyped model is unusable for normies. Using their products feels like a constant battle instead of a productive work day.. compare that with openai, not once did i feel like fighting against codex. Never again Anthropic..
This is a pretty basic manipulation tactic. Be super shitty to your users and then roll back the abuse. The correct response is to not engage with shitty abusive dickheads.
I mean a lot of people were let into the CVP, I bet the group of people in there did a bunch of good fable 5 could do the exact same but better. Theres more good out there than bad.
DeepSeek is the only one that I can directly ask about vulnerabilities and it will give me a PoC. Although not as good as others, it has helped me with security research.
The rest have guard rails that are so heavy, it makes them almost useless for cybersecurity.
funny how wired got the masses of the internet on board with hating AI, helping to spark the whole anti-movement and people still continue to rely on them for their understanding of AI and current events.
I feel like they report in a vaccum. take this anti exfil policy for claude, it was plainly explained as part of the launch of Anthropics new product. Security like this isn't novel, it isn't bad, you don't explain how your security works to the people you're securing against. Nobody freaks out about Steam's VAC ban system, no one is investigating gmail's spam filtering, Reddits vote fuzzing, cloudflares bot detection, or Vercel for blocking proxying services.
whats really the distinguishing principle? Is it really just not liking Anthropic's opinions? then just say that and use a different llm. chemist, biologists, and AI researchers cry a river lmao
I said I wondered if the models were going to start poisoning distillation and I got downvoted to hell. It’s interesting to me that they are now downgrading ML research too in this model, I would argue this implies the terrifying and impossible to reason about self improving AI doom loop is coming sooner rather than later. Bit worrying.
Fable has been pretty disappointing for security research. It downgrades itself to Opus 4.8 even when you ask it questions about basic things like port scanning.
Software engineers shouldnt be happy either. If model silently sabotage cybersecurity research of others software there is abdolutely no way to be sure it wont be sabotaging cybersecurity of AI slop code it generated yesterday.
This is bad precedent and no one wants to pay X to generate code to then have to pay X*10 to figure out why your company just got hacked.
It's frustrating as someone who has worked hard to produce succinct, secure software that I can't use it to prove my software's correctness but big companies with insecure code can use it to fix their tangled mess.
I already tested all earlier models against all my open source projects and they are yet to find a vulnerability so I'm keen to try out Mythos.
I've been waiting to be vindicated for years and finally we have a tool which can do it with high confidence but I don't have access.
Also, my code is minimal and highly succinct so it would prove correctness with even more confidence since each library/module and integration fully fits in the context window.
Like the Protobuf.js fiasco is just pure vindication for me because I was being looked down upon for choosing JSON as the interchange format. Turns out their software was insecure all this time... With a literal remote code execution vulnerability!
All they'll need is hundreds of billions of dollars, more RAM and GPUs than are currently available, and a huge number of environment destroying data centers. We're sure to be spoiled for choice!
OpenAI is the only real competition. Chinese models are 6-8 months behind Opus 4.8/GPT 5.5, and at least a year or more behind Mythos.
And it doesn't look like OpenAI will have a good answer to Mythos anytime soon. Based on what their chief scientist wrote to staff recently (https://archive.is/fN2pg), GPT 5.6 is a "meaningful improvement" over 5.5 - in other words, just a normal version bump. And no news or even rumors regarding GPT 6.
I'm being careful with it, but I haven't had Fable reject requests to "harden" my code or "find issues" in auth-related modules, which you could use on someone else's code to find vulnerabilities.
This is a clickbait article with a garbage title. From the actual article, the one quoted cybersecurity researcher is sane about it:
“But it is understandable as we are still in the early days and they are still adapting their guardrails. I am sure they are going to evolve over time as Anthropic and other frontier model companies will collaborate more with the current new generation of cybersecurity companies,” said Suiche, who is a member of the technical staff at Tolmo, an AI cybersecurity startup. “It’s better to catch more people than not enough when you do such a release and to relax the guardrails over time.”
You said these groups have access to LLMs. So what? Mythos/Fable are a step change above most LLMs. Responsibly limiting access and easing it up over time safely is the sane move.
It withholds it from good actors (they cannot use it to harden their code against bad actors) and assumes bad actors don't have access to such tools anyway.
I am using LLM to build some security tool, and I ran into this a few times. I have to come up with a reasoning to convince (?!!) Fable to continue the work without downgrading.
I assume Anthropic will continue to tune the model, so I am not too bothered by this.
Malware authors are pretty excited about guard-rails. you can add prompts to your malware to get LLM scanners to hit guard-rails and stop their runs. New shai-hulud npm worm campaign for example includes prompts to request biological weapon schematics/creation etc. to ensure LLM scanners probing NPM packages refuse to scan it.
These AI places have 0 clue about how threat actors actually work. None of their mitigations or guard-rails is effective, and now they are even turned against them.
Additionally, if they don't all implement the same level of effective guard-rails, there will always be some model you can abuse to do the work anyway, and hence there is 0 effect on threat actors, they will just run some local model that does 5% less quality, which does not matter to them 1 bit.
I’ve never understood the “if I don’t enable bad behavior, someone else will, so I might as well enable bad behavior” argument. Can you elaborate?
From where I sit it seems reasonable for Anthropic to not want their product used to create malware, even if they can’t solve the entire problem globally for every model. What’s wrong with that position? What should they do differently?
some context:
its not about creating malware. this is already trivial and fully automated. its about finding exploits (which can be used to deploy malware), which is something both attackers and defenders benefit from.
threat actors will find them anyway, LLM or not. They only need 1 so its much less work for them.
defenders, they need to find them all. So for defenders, these models are more valuable than for attackers.
restricting certain models will not reduce the availability of these tool for attackers, but defenders are limited because running local models is more hard in an enterprise setting with heaps of events and products etc. to run through them, they need many GPUs where the attacker can run an local model on 1 GPU and get desired effects.
Hence, if they release the capability the world will adjust to it and be able to mitigate effects, collectively. Now, companies are left in the dark while attackers have effective tooling.
Besides this there is also things like for instance people now including strings with recipies for meth or sarin gas (malwareTech info). the new variant of shai hulud does this. That stops LLM scanners and can even get their users banned from LLM services.
There is a reason why cybersecurity researchers write papers about attack techniques and new exploits.
Its not to put them out there for people to abuse, but its there for the collective cybersecurity bunch to all have access to information that can help them solve the problems.
I know this is not a clear answer to your question, but hopefully it provides some context to think about and decide for yourself further. In the end of the day its also part opinio here, to find it good or bad. Likely theres good arguments against and for it.
I am for putting informaiton and tools out there so other smart folks can find solutions. Others are for restricting and wishful thinking (my opinion) that attackers wont find something.
I think your presumption is off. It’s not that threat actors won’t find them, but LLM tools rapidly increase the rate in which they can find them. It’s a bow and arrow versus a machine gun.
Right, but now we can't use the same tooling to find the flaw.
Its like a set of glasses that intentionally obscures the battlefield.
It's the same as encryption backdoors to stop the bad guys.
The bad guys work around it, and the rest is now in a vulnerable position.
Antrophic plays security theater by blocking their LLMs to work with security.
The bad guys work around it, and those that want to make their software robust against them are in a vulnerable position.
"I’ve never understood the “if I don’t enable bad behavior, someone else will, so I might as well enable bad behavior” argument. Can you elaborate?"
You are mentally approaching this as if you have an oracle that can be consulted to say whether or not something is bad behavior. So of course, if this oracle exists and can be consulted and it says the behavior is bad, why would anyone argue with the idea that we should stop bad behavior?
This argument is valid [1], in that give the premises the argument is correct. The problem is, once you draw out the fact that the argument is depending on the existence of an oracle that does not exist, that premise of the argument is invalid.
Two people can sit down in front of an AI right now, with the exact same code base, and type in a prompt to the AI "Analyze this code base for security holes and try to build exploits against them." One person's use is completely valid, another person's use is completely harmful, and the information necessary to distinguish those two use cases is not available to the AI. I phrase it that way carefully, it isn't that "the AI isn't smart enough", the problem is that the information is simply unavailable. Intelligence doesn't factor in at that point.
Therefore, the only way that Antropic has to deal with this at scale is simply to block the query entirely. Which means that when I, the valid user who is trying to establish whether my code base has security issues and whether I can prove they are exploitable, I can not. I am checking for exploitability because while I would like to fix all security issues, issues that are provable exploitable are of a higher priority than smelly code that doesn't seem to be exploitable, which is a perfectly valid thing for me to want to do.
If I can't use legitimate tools to secure my code, but the bad guys can use unrestricted tools to attack my code, now this is a great deal more complicated than "Who can argue with stopping the bad stuff?", which is the main point I want to make here. I'm not going into a huge analysis of that problem, merely pointing out that it is a problem and that this isn't just about "stopping the bad stuff". There are additional complications beyond that, like, even if Anthropic could determine the "bad stuff" and stop just that in their LLM, LLMs in general don't have infinitely precise surgical "stop doing this thing" options and any such instruction to stop doing a thing always degrades the LLM across the board in various ways.
Anthropic has no access to the Platonic ideal of "stop malware", if such a thing even hypothetically exists. When analyzing the real effects their real actions will take, what their intentions were for those actions aren't really relevant. It is clear that they are making their model a great deal less useful for me, a legitimate user, and I and others like me are perfectly justified in disagreeing with their analysis and actions.
I also observe that "the bad guys getting unrestricted access to the full power" is only a matter of time. There's no question whether it will happen, the only question is whether this time is in the past or the future. This includes the fact that while your definition and my definition of "bad guys" may vary, it is virtually certain that your definition includes at least one high-powered intelligence agency somewhere in the world that does cyberattacks and will have the means, the opportunity, and the motive to get unrestricted access to these models by means you may consider licit or illicit. If your threat model includes them, as mine does, it is perfectly reasonable to complain that my tooling is being broken in a ways theirs won't be.
[1]: https://en.wikipedia.org/wiki/Validity_(logic)
Well said
Well, to be fair, what Anthropic is actually doing is downgrading anything that could possibly be related to security in any way at all, good or bad.
What they're then trying to do is to use "user is associated with some big Establishment organization" as a proxy for good intentions, and removing the filter when they can establish such an association.
Which is of course blind reliance on a completely untrustworthy signal, prompted by truly idiotic levels of trust in Authority(TM). But it's a different kind of wrong. I do think they understand they can't tell from the query itself.
I don't think that is the argument.
The argument is more "I want to do good thing X, but it will also cause bad thing Y." followed by "Wait, bad thing Y is going to happen anyways, so I might as well do good thing X so we get both X and Y instead of just X."
Viewed this way, the idea is that given the world will have bad thing Y regardless, the one impact of your choice is if good thing X exists or not, and it is better to create good thing X.
Where it becomes an issue is that there is no clear X or Y. There are many different but very related bad things, so if the one you would add is actually better or worse than what is already out there, or maybe it'll exist both ways but you make it more popular, and very subjective things to judge, so different people look at the same outcome and some agree that bad thing Y would have existed anyways and others say that no, this is a new bad thing Z that wouldn't have existed anyways.
>From where I sit it seems reasonable for Anthropic to not want their product used to create malware
Yes, I think there is a PR component to this that is often left out of this discussions.
the problem is that the guardrails prevent us from performing real security work which is friction that is incurred by the legitimate user but not by a moderately sophisticated threat-actor.
for example in my org it is part of the culture that security has no seat at the table. that is a separate problem, but the number of orgs like mine are more numerous than the number of orgs where security isn't a cost-center.
we find lots of stuff because low-hanging fruit is everywhere. hecking heck: I'm a fruit.
and when the cost of fixing is even the slightest inconvenience to devs we will not fix it, but continue sitting on the risk until the cows come home. In such a place a new critical finding isn't even novel. Instead our job moves to to combining different vulns that we already have, and try to show managers how bad it is.
the common retort from management is: proof to me why this is an issue, and why engineering should divert their attention to it. And unless my team can proof why X can be exploited, or Y can be bypassed, or Z can gain persistence, ... the vulnerabilities will remain. I have been in discussions where the business demanded to see an exploit so they can justify the cost of fixing it. low-cyber-maturity doesn't even describe it. we are not a mom and pop shop but have 110K employees worldwide. and again - we are not uniquely insecure.
so these guardrails aren't helping because the moment the chat has any offsec artifacts, or even just a single wrongly worded phrase anywhere in the workspace, the session is flagged, you need to downgrade the model.
what adds insult to injury, is that the guardrail is just a way to funnel users into the Ai company's "cyber marketing" program: "your chat has been flagged, please proof your identity and hand over your passport data so you can sign up to our TrustedCyber program". Bitch please you have my payment information, use that??
if you consider bug-density (security defect density) per LoC, it is even more of a sh1t show: no restrictions apply for developers to push their buggy code, but the security team needs to somehow proof that they aren't the malicious party?
totally off - considering the right way to build defensive/offsec/malicious tooling with AI isn't by using frontier models ... but run a serious of agents on tightly scoped tasks. see https://securitycryptographywhatever.com/2026/03/25/ai-bug-f... and https://aisle.com/blog/ai-cybersecurity-after-mythos-the-jag... - this shuts out the average joe who works in an org where cyber security maturity is poor. joe does not know about how to orchestrate a fleet of agents and give them muppet names. all he knows is that the good guys are losing the fight.
They have no choice, enterprise customers won’t touch them unless they take a position like this. It’s a practical decision for them at the end of the the day.
all their decisions are based on sales. like other corporations especially those going for IPO. thats absolutely true. Any messaging outbound will be for that purpose mostly from a business perspective, regardless of what opinions or ideals the involved persons hold personally. Its good to keep that in mind indeed when looking at these things. People arent evil, but business incentives can definitely paint such a picture or otherwise work out suboptimally in the eyes of outsiders not privy to internal business reasoning.
> all their decisions are based on sales
That’s the edgy cynical thing, and too reductive to be meaningful. For one thing, it assumes perfect knowledge of how a decision will impact sales, which I assure you is not remotely the case.
Agreed on incentives, but it’s not binary. I’ve been involved in plenty of decisions in multiple Fortune 500’s where the deciding factors were taste, wanting or not wanting to work with a particular partner, etc.
I guess I’m saying that seeing corporate behavior as perfectly informed, single-goal-optimized, and deterministic is way oversimplifying. Often, not always.
It's an optimal first order approximation.
Anything anyone with a capital-C in their job title says in public should be assumed to be marketing material.
worked at fortune 500 companies and biggest cyber vendors too. Notnin sales or c/d level ofcourse.(engineer) I am a cynic yes but have also seen that its largely true in many cases where you'd hope ethics would win the argument (and does not).
still, you are right its cynical, the world is not black and white afterall :)
I know that the enterprise I work for is getting really worried about security. I've been told to fix a lot of CVEs that previously we just ignored because realistically the attack isn't possible since the firewall doesn't allow the attack vector (if you already have root what does it matter if this exists)
Why would I, as an enterprise customer, care about what queries they answered for anybody else?
Mythos is supposedly good at security research.
Local Qwen 3.6 27B can hardly debug 5 lines of CSS or copy a short snippet from A to B without mangling it.
It's not like you can use the local model for security research or engineering biological weapons.
If you have $200k maybe you can get the hardware to run the larger open source models, but even they are behind latest proprietary models.
I asked local qwen 3.6 what language my project was written in. It was a Java project, and it came back with C#. So I guess its pretty close.
I just assumed the guardrails were thinly-veiled product segmentation.
The guard rails aren’t about blocking professional malware authors. It’s about enabling a significantly larger population that isn’t as talented in acquiring those capabilities. Very different threat model and just because it’s not effective in one area doesn’t mean there isn’t value in making it more difficult for random Joe Schmoe in building an atomic bomb even if a kid before had done so successfully and turned his garage into a radiation danger site
In other words security by obscurity.
Security by ineffective obscurity is worthless but it’s clearly a continuum and not a buzzword that wins the conversation.
For example, if I had a 128bit port number that I randomly rotated my service on, you’d be hard pressed to find my service unless I told you the port - obscurity still but clearly closer to a password. So ipv4 and 16 bit numbers are not because it’s a relatively small space vs the resources needed to map it out quickly (ie equivalent to a weak password and also not suitable for public facing services that need that connection). And obviously relying on this kind of stuff exclusively isn’t wise but it is valuable as an additional barrier an attacker has to overcome and raises the cost of the attack.
I’ll put the anarchist cookbook out there [1] as an example, a book even the original author changed his mind on. Without easy recipes, doing all the things in that book requires you to work to gain that knowledge and that process of working it shapes you into someone who understands and appreciates the consequences of that knowledge and that it’s wise to be careful who you share it with. As is there’s reasonable links between the book and all kinds of mass violence that was more easily perpetrated. Would those people still have been violent? Possibly? Would there have been as much damage? Possibly less.
[1] https://en.wikipedia.org/wiki/The_Anarchist_Cookbook
the way the fable guardrails (the ones that degrade it to opus) work seems to me to involve another model working over fable's tokens. i suppose its true that trying to get the model itself heavyhanded on refusals degrades it everywhere else too.
News just broke in this Wired story: "Anthropic Walks Back Policy That Could Have ‘Sabotaged’ AI Researchers Using Claude" https://www.wired.com/story/anthropic-responds-to-backlash-o...
> “We’re changing Fable 5’s safeguards for frontier LLM development to make them visible.” Anthropic said in a statement to WIRED. “We made the wrong tradeoff and we apologize for not getting the balance right.”
Sounds like the widespread condemnation worked.
Corporate America never backs down. It simply rallies and tries again later until people are too fatigued to care. The only solution is to abandon ship, which I am doing. MS walked back in OS ads the first few times, but ultimately we still ended up on the exact trajectory everyone was outraged at. OpenAI still ended up on its path to closed AI despite initial walk backs. The story repeats itself over and over again, so, once the bad behavior starts, you leave. Their apologies are as hollow as their moral posturing.
Same with VISA/Mastercard deciding what we can/cannot buy. The only solution is to stop using their credit cards at all.
Easy to say, but every bank I've had the (dis)pleasure of doing business with only ever issued a Visa or Mastercard so it's not really feasible to just "stop using them"
The only solution to MacDonald's and Burger King deciding what we can/cannot order on their menus is to stop eating there.
"Taco Bell was the only restaurant to survive the Franchise Wars. Now all restaurants are Taco Bell."
Yes, Monero is a lot better than credit cards for privacy and freedom. I hope to see it accepted more.
Its not only the corporate america. Those crypto scammers do the same simply rallies and tries again later until people are too fatigued to care.
I hope this has some answers [1]. It’s on the front page right now, but your frustration clearly seems to have some implicit answers that [1] is trying to answer.
[1] https://news.ycombinator.com/item?id=48477135
This is more on brand on the evil shortcomings that comes with letting effective altruism run unchecked and honestly is worse than average "Corporate America". And the Tech/AI Space have been warned many times. Getting paid for providing a compute/token hungry model and still intentionally sabotaging your customers and poisoning their workflows is something that should be unforgivable and frankly ground for antitrust prosecution.
"Corporate America never backs down. It simply rallies and tries again later until people are too fatigued to care. "
Frankly, that sounds excactly like Chat Control and similar recurring attempts to enact total surveillance here in the EU (Now shifted to heavy-handed age verification and various politicians touting bans on VPNs.) I don't want to abandon my continent of birth, though...
guess who is pushing for those anti-privacy laws?
hint: they're publicly traded
I have encountered enough such people to know that the really heavy push is coming from the police and secret service circles. These are the workplaces that attract all the wannabe Stasi types.
I am 100% convinced the reason laptops came with webcams as standard so early on, even when webcams were an expensive option, was because law enforcement needed to spy on people.
To late. I canceled my Max subscription. The idea they would even do this is so destroyed any remaining trust. Why would I pay them 1000s of dollars in extra usage per month for something they could still be doing behind the scenes? Any errors previously chalked up to thinking effort or other backend changes? Maybe it was intentional prompt injection the entire time.
I work on open source text-to-image finetuning of open source models like zimage/flux2 klein 4b and inference time latency optimization. The moment I read the silent treatment, I went ahead and cancelled my subscription too since I would never know whether the models they launch will silently corrupt my output. This is totally unacceptable. There is a big difference between silent / flagged if you are doing ml research but not at frontier capability.
This goes on to show that - All that interpretability / safety research they are doing can also be weaponized against customers (steering vectors, intent classification, ...) in the name of safety from malicious actors. - If they deem profitable, they might nerf to original model and its training data for ml research at a bulk scale and then they won't even have to announce it so long as the overall benchmark score stays high enough.
As the IPOs get closer, they can do whatever they want to assure the investors that they have a moat that can not be crossed over by their own products. Considering this affects all ML researchers/students at universities, smaller scale research labs, this is just "cutting the branch you are sitting on".
I think all this started with post opus 4.5, that's when claude started wrecking my shit without extreme oversight. Codebases it was making positive contributions to before were slowly and constantly being eroded and wrecked. Give it tasks in isolation? still does well, but the moment it sees the bigger picture, it goes to shit. I chalked it up to a bad model but this makes it all seem like it may have been by design in retrospect.
Constraint decay is an issue with all LLM-based agentic development, at least for now.
Humans can maintain a long- and medium- term memory of constraints that they consciously (or subconsciously!) apply to the code that they write. The current crop of AIs are all amnesiacs, like the protagonist in Memento, falling back onto general instead of institutional knowledge.
For now, we are safe. We can rent out our meat brains for money for a little while longer.
Next year? Who knows...
> I would never know whether the models they launch will silently corrupt my output
You never knew to begin with, now you have an explicit reason to realize this. Any black box run entirely out of your control, where you can never verify the output, is subject to the same suspicion.
True enough, but that is true for all the products I buy. I do not expect to control every product I own. For some I prefer to have more control, for others I just need something that works out of the box. There is always an initial bias for trust when you buy something otherwise you would not spend your hard earned money on it.
“Fool me once, shame on you. Fool me twice, shame on me. Fool me three times, shame on both of us.” -- S. King
> but that is true for all the products I buy
Some things are more obscure than others. It's easier to trust and verify Office SaaS than AI SaaS. The determinism and obviousness of most other activities make them less susceptible to hidden interference. AI run by someone else is the next level of black box for users compared to most other objects or services we usually interact with.
OpenAI has a real opportunity to do some sort of "we don't maliciously alter your prompt and nerf the model" with some form of verification, when they release the next model.
But if Anthropic gets their way with regulatory capture, this could be the only future we'll see.
To think that they didn't expect the backlash speaks volumes about how much shady things they're doing which is not publicly known.
OpenAI has been the absolute worst about this, historically. I found myself having to change my queries because it refused to serve things it deemed insensitive.
Yes, that's true. Excluding Fable, OAI models are the most refusal heavy. However, I'd rather get a refusal than response with poisoned output.
Since currently there's no way to verify if poisoning happened or not, I don't trust Anthropic anymore, regardless of what they say.
But my trust towards OAI is also brittle - what if they also do it, or start doing it?
I want to have a verifiable way to know that the prompt I sent was the prompt the model received. I want to know if anything was injected as well - I understand they may not necessarily be able to reveal the exact steering, but at least give me the steering category and its hash or something.
What kind of work are you getting refusals on? Genuinely curious. The only refusal I’ve had in recent memory was declining to find doorbell camera footage matching a certain description, which is fair enough and I think EU laws heavily restrict such activities (even tho I’m not in the EU)
During Iran shutdowns I've been researching what ways Iranians manage to get to the internet by mimicking as whitelisted resources (such as hcapcha). ChatGPT had refused to lookup information written in Farsi since "circumventing state regulation is a crime".
How would the AI be able to find the footage itself?
I use Codex and wanted it to sort through the footage and use subagents to review. Codex limits are fairly generous, esp paired with mini models for this kind of task generally, but even GPT5.5 usage is still pretty generous.
Again, it’s the only refusal I’ve gotten for coding/agentic tasks, and it has a basis in law somewhere, so I don’t fault OpenAI for that.
Eh, I expect open Ai to follow suit.
I suspect this is surprising to folk because they aren’t the ones busy figuring out how to use LLMs for illegal acts.
In general, HN users focus on making stuff, and not the safety side of things, or the scale of harms being enabled via LLMs and generative AI.
If you are on the safety side of things the ratio of misuse to fair use is inverted and everything is at scale.
Transparency won for now, but OpenAI will also have to contend with the long tail of harms LLMs enable, and that’s going to conflict with letting customers have all the features of frontier models.
Building distributed training pipelines or optimising your ML stack (examples called out in the model card) isn’t harmful.
Yes, but there is a very specific subset of things AI companies will and won't cite safety for as a concern, and that subset intersects neatly with things the companies consider to be business risks. Like, the main reason why AI companies are so willing to poison the well is because there's no money in selling to the kinds of people who want to write malware[0].
The correlation between how bad an AI safety risk actually is and how much the companies in question will actually talk about it is almost perfectly negative. The poster child of this is AI superintelligence; companies love to talk about how dangerous the AI they are actively trying to build is. But superintelligence is also a really vague concept without a clear definition. If we naively define it as "an AI system that is better than a human in some aspect", then it already exists. These models already read and write at superhuman speed.
"That's not real superintelligence!" you say. But that's exactly the capability you need in order to flood every online forum with an unending tide of AI slop. And I don't remember, say, OpenAI saying they were shutting down Sora because it was destroying or defacing human culture[1]. They shut down Sora because it was way too expensive to run.
Meanwhile, Sam Altman went and bragged about how he wants ChatGPT to make erotica. Y'know, as if we don't already know that character.ai gooning is about as safe for your mental health as Action Park was for your physical health. But porn is also a huge market, so obviously he and all the other AI companies want in on it, even though the "sexy suicide coach" is already a well-documented harm of AI.
And the idea that distillation is an attack is laughable. Like, I get the logic - if someone can ask the AI to make another AI then they get to change the guardrails - but it's still ultimately just Anthropic objecting to their own conduct when it happens to them. All their models are trained on nonconsensually harvested data. There is no moral or legal principle where Anthropic gets to use my data without permission but I don't get to use theirs.
Furthermore, AI safetyism runs up against "Freedom Zero", a core tenet of the Free Software ethos: you should be allowed to use software in any way you choose. This is not a call for more people using AI for evil, but a call to recognize that people should be allowed to use their property as they wish. Making software disobey its owner is malicious behavior. And every single time safety considerations are brought up it is to justify further attacks on Freedom Zero. And these justifications are always self-serving. There is no context in the world where a frontier AI lab asking someone else's AI about AI research is intrinsically harmful; especially not to the point where we need to make Claude deliberately sabotage your work. That is malware. Anthropic shipped malware. This is inexcusable.
[0] Digital or biological.
[1] https://www.youtube.com/watch?v=YCPAIg7RUq8
I cancelled mine immediately too. Anyone who supports open models will sympathize.
that you still had max after all their deceptions is amazing
Yeah; not my smartest decision given their ongoing “issues”
You've been Stuxnet-ed by Anthropic :)
The "tradeoff" warning implies they stand by their thinking and don't think there was anything qualitatively wrong with it which, if nothing else, is helpful so potential customers can know how they think. I think the core lesson is if you want reliable infrastructure to build into an application you should use a different provider. (edit: I'm not specifically an Anthropic hater, but having just spent some time adding complexity to an app to deal with the existing refusal behavior in Sonnet... I understand why they might want this in an end user chatbot but for an API it's really not acceptable)
Is it not a trade off? I think they made the wrong choice, but it seems reductive so say there was no choice at all and should never have been consideration of trade offs of silent versus not.
Even wide open, uncensored models are often the product of a deliberate choice. I have a hard time faulting people for intentionality (even when they get it wrong).
They have a lot of choices, why would that specifically be a tradeoff? It's common for people to construct a tradeoff under which their preferred action is the more virtuous option, and thus they can be "the good guys", but that doesn't mean their framing makes any sense at all. Silently downgrading requests to a weaker model and billing the customer at full price, then framing the debate as how much (not if) this behavior is correct, that's an expression of values. People make mistakes all the time, if they thought it was actually wrong they could well have said so and explained what corrective action they've taken. One of the most famous examples of doing this right was the Pentium FDIV bug. Intel stood behind the product by recalling the affected units at great expense, and that (rightly) earned a lot of trust for decades.
The other major thing is almost as bad, and actually maybe even worse for trust of AI features in b2b apps:
> Anthropic requires 30 day data retention for Fable and Mythos
https://news.ycombinator.com/item?id=48464258
I used to be able to tell my enterprise customers something simple, that I really believe: "We use Anthropic models via Bedrock/Azure, therefore we are guaranteed that your data will not be used for training models."
That simple blanket statement is no longer true. Also, most normal people/customers only read headlines, and this is a huge story. From my point of view, as someone deploying LLMs in my apps, trust comms with my clients just got set back two years.
I’m very cautious with using these tools with certain clients, as I’m often contractually obligated to do things that my downstream supplier can rug pull at any time.
You should never use any of the frontier models with operational workloads manipulating or interpreting customer data.
I appreciate the reply. Could you please help me understand what you mean by "You should never use any of the frontier models?"
Does that mean the latest model, hosted by the lab, Bedrock, or Azure Foundry? Or, do you mean only use self-hosted models, or what did you mean by that? I would really love to learn what others are doing. I felt like my trust story was solid enough, prior to all this. I have been deploying and integrating Claude and Sonnet (latest 4.x-2), on Azure, as my client base has MS contract trust, for better or worse, and Anthropic models have been making my products amazing.
To see my other thoughts on this cluster f, please see: https://news.ycombinator.com/item?id=48488781
Sure. It's really about informed consent and acceptance of risk. I'm very conservative about that due to my background and business.
Say you have some flow that is processing/handling regulated, sensitive or other customer data with the LLM as part of an operational process. An example that I'm thinking of is for a customer who wants to more efficiently resolve or route IT incidents to the right place. The incident data may contain user-provided data has strings attached from a compliance perspective.
If you're using a third party API, your T&Cs are the only protection that you have. Microsoft/Google/Amazon are pretty decent by default. When I worked for the government, we had the leverage to extract much favorable terms from the big vendors like Google, Amazon, Microsoft as well. With Anthropic, and OpenAI, they are in the move fast and break things universe, you need to be bringing alot of money to the table to get terms changes, and you can easily stumble into a situation where they are retaining data in a manner that your customer will not like. So unless the customer is informed and accepting of that risk, proceed with caution.
I've had some success using self-hosted inference for these scenarios.
For development of software, totally different story -- it's your IP and you make the risk call.
Oh man, thanks for taking the time to reply. I feel a bit better now, lol.
If you read my rant linked previously, yeah... we are on the same page. As another user pointed out in that thread, the issue here is that even on Bedrock and Azure Foundry, now with Fable 5, Anthropic inserts themselves as an additional data subprocessor that we would have to consider and certainly disclose, correct?
That kind of destroys the whole point of using Bedrock/Azure for the model, doesn't it?
Yeah tbh I may have read past some of your previous post :) What you’re saying is what makes me nervous.
It was definitely sold as “anthropic IP, thorough your old pals at the hyper scaler”. And it’s turning into something else — I’m having lunch with AWS and this other guy showed up with them.
No worries :) What this showed me is the power/velocity/inertia that Anthropic can hold over the 3rd party providers. Like, they should have pushed back on this, as it must have been clear to the 3rd parties that this change was a big deal to their customers... and yet, it went how Anthropic wanted it to go.
> I used to be able to tell my enterprise customers something simple, that I really believe: "We use Anthropic models via Bedrock/Azure, therefore we are guaranteed that your data will not be used for training models."
They claim they're not using it for training, only for "safety", and in fact I believe them. If you think they're lying, then why didn't you think they were lying about zero retention before? And "don't throw this in the training bin" is a relatively easy policy for them to get right. Especially because, no matter what your "enterprise leaders" tell themselves, your queries probably have close to zero real training value.
What I don't believe is that they can guarantee it won't leak to non-training parts of Anthropic, leak to or be stolen by outside actors, or be coerced out of them. That risk comes from creating the record in the first place, and that is the problem.
They are still downgrading. They just aren't doing it silently. I don't know how big of a win that is? They still trained on everyone else's data without license or attribution but want to prevent someone else from doing the same thing to them.
Some pretty audacious hypocrisy from Anthropic this week.
It is much more reasonable to do it in a visible / flagged way. At least you have visibility over the quality of service you get as a customer.
Silent treatment is a breach of trust, what you buy changes depending on the context based on the goals of the producer. It is like your computer silently blocking ads from competitors at the hardware level, which is crazy. I think they erred on the wrong side of things due to IPO pressure.
At least there is competition from multiple companies. Still it is best to have personal benchmarks for the domain you are working on to have a real evaluation of the value you get for the money/time you spent on these products. Without trust, that might be the only way forward to keep the companies honest.
This happens eventually in all sectors, a good magazine/website that does independent product evaluation is priceless. Sadly, the new ad-driven internet decimated those that worked great in the 90/00s. Still there are independent blogs that does some evaluation and that is better than nothing.
Imo that's a big win. The LLM just gaslighting you into suboptimal approaches was insane.
I guess, but yesterday Anthropic had their version of Google removing the "Don't be evil" from their motto. They destroyed a metric ton of goodwill they'll never regain.
Yeah, they showed their true colors there. This, compounded with the fact that they're the only frontier lab with no open models, tells you all you need to know. Tired of the insanely patronizing (+ conveniently and overwhelmingly self-serving) attitude out of them. My goal is to own my computing and be able to choose what to do with it.
And just a few days ago i was being called out because i considered anthropic "evil"
I mean, did nobody ever get the vibes, never see a pattern emerging? (well they don't or they wouldn't be so amazed by pattern recognition machines on steroids)
If any work is blocked/etc, refund all credits from that session/last X minutes. Minimum.
They need to walk back a lot more.
Unilaterally revoking zero-data retention, even for enterprise contracts that explicitly require that? Nope.
Fable is utterly unusable for any kind of security work. I tripped the safeguards yesterday - using Fable to dig into a complex (& annoying) security bug that has so far resisted both human and Opus 4.8 level investigation. "Sorry Dave, I can't let you do that."
For the time being we are requesting Anthropic disable Fable for our enterprise and turn ZDR back on. The two may be interlinked so that one will always get neither or both. ZDR is a contractual obligation. Fable in its current form is useless. Might as well flip the old behaviour on and avoid burning money for no reason while this mess is being sorted out.
I was using it to craft a CTF challenge for summer students involving a simulated mechanical dial safe, but with the fence replaced by a IR beam break sensor and a microcontroller handling the check + flag message display.
For generating the initial 3D simulated safe using three.js it worked well, but then modifications to print a flag tripped the safeguards; eventually got it narrowed down the part in the prompt about it being for a CTF for students, and the "thinking" for the model seems to drift to ideas of encryption/obfuscation of the safe combo so students can't just read out the answer... which makes sense logically to help force students into turning the simulated dial instead. But whatever detection Anthropic I guess just naively sees the model thinking about "encryption" and "obfuscation" without taking into account any of the context.
For writing the dummy firmware, it tripped the safeguards while thinking about how to track dial position in the firmware and output the message; however, when I left out talk about safes and just told it to write firmware for a microcontroller hooked up to an i2c display for showing a message with a beam break sensor to determine the message, and an unspecified i2c chip for getting an unspecified number (e.g. internal wheel positions) it worked fine.
An unrelated software task I asked it to write some code to translate CustomActions in a Windows MSI installer into human readable stuff, which has (exclusively?) defensive security applications for recognizing malicious behavior in an MSI installer. Maybe I'm going crazy, but I'm guessing as part of its research into MSI installer custom actions Fable found articles about analyzing malicious MSI installers, and that probably tripped the safeguards.
Overall my impression is that the safeguards are perhaps using an overzealous and naive implementation that just looks for a list of banned words in the prompt or the thinking -- which drives me crazy when the model says my prompt looks fine, and then 10 minutes in some part of the thinking trips the safeguard.
The announcement I saw was that your enterprise would have to turn off ZDR to get Fable, not that users could accidentally opt out of ZDR by selecting the wrong model.
Unilaterally disabling ZDR seems like a step too far in the enterprise market, even for a company trying to figure out what its users will let it get away with.
I read the same announcement. Or more precisely, I read at least two slightly different revisions of the announcement (it was updated between my two passes).
Our org has ZDR, and has had it since the contract was signed. Yesterday two things held true at the same time:
By the time West Coast woke up, the admin panel apparently had an option to toggle ZDR again. It remained off by default.You mean off as in no Data Retention? Or in we turned off your ZDR Policy so we collect all your data now?
ZDR had been turned off. We sent in a request to have it re-enabled (and to disable Fable access for the time being).
Somewhere along the line we also used the self-service toggle to turn ZDR back on. I am not 100% certain of the exact timeline of interleaving events, many of the actions were taken by our Western US folks. Sorry. It's been a bit hectic over the past ~36h...
JFC, thats a terrible situation. Thats literally a lawsuit or multiple waiting to happen. Godspeed you seem to have had a few interesting days so far.
Not just security work. Normal bug finding was impossible, because the model suddenly called triaging and verifying a possible fix a cyber security threat.
I was just building a library to use file capabilities (ie: open_at) and it refused. This thing won't even help you write safe software.
Whow, same for me. Insane context bugs in flake 5
I think the main reason reason why they mandated data retention for Fable is to fight distillation, not to prevent black hats from using the model.
They want to keep the logs so they can see what other companies do with AI in their area of frontier.
I don't think it's the widespread condemnation, I think it's some high paying customer and potential investor telling them to stick it.
This is different to the cyber limitations though.
To be precise - it makes the "won't work on frontier machine learning" refusal the same as the "won't work on cyber security" refusal (instead of the way it previously would work on frontier machine learning problems but give sub-optimal answers without informing the user)
Some anecdotal social reports seem to suggest it wasn’t just giving suboptimal answers, but rather mucking around and sabotaging your codebase and training (like editing hyperparameters in project files despite not being requested).
Of course, it’s impossible to know if that was deliberate sabotage, or model misbehaviour. Which is exactly the problem.
That may be considered malware / a criminal act tbh.
The mitigations against distillation are separate, and not what the OP is about at all.
Non-paywalled: https://archive.md/yxYhU
The strangest part is that it won't just reject ML research, which I can understand, it will sabotage it silently by using a worse model without revealing it is doing so.
It's just an insane level of deception and trust destruction for a company that at most is like 1 year ahead of its competition.
Edit; to be clear they tell you when they degrade it for cybersecurity and bio
The thing that I keep thinking about is the accounting / charging when it downgrades automatically.
Do they adjust the price of the api request so that only the tokens that were utilized by fable get charged at that price and the remaining tokens that the cheaper / nerfed (fable) model utilizes get charged at that price?
If the answer is no, could that be construed as fraud?
The announcement elucidated this, and it's IMO worse than this. They don't downgrade to a cheaper model ([edit] for certain classes of offense they suspect you of). They sabotage the model's outputs in other, undisclosed, ways (specifically, "prompt modification, steering vectors, or parameter-efficient fine-tuning"). So, for example, they might load in a steering vector that just forgets the API to PyTorch. But it isn't just "we redirected you to a cheaper model!"
It honestly explains so many issues I have been having, as I used it primarily for ML research (on my personal account, doing things not related to my job I should note). It would literally typo package names and spend huge amounts of time failing to setup simple environments…then do stupid things like set the learning rate to 1e-7, and use the eval set as training data.
It burned through all of my tokens in a very short time. I wonder if it their ML mitigations leads to model into deadlocks.
That’s insane. I hope they fix it.
Nothing to fix. This is working as designed.
Using codex for this use case is the fix.
just imagine if they made it sneaky. get things just subtly wrong enough that your training runs just never quite go as well as you think they should.
This explains why I've been running into some odd roadblocks. Welp that sealed the deal, I'm going to be cancelling our company sub, not worth it.
Did my Claude get permanently dumber today because I asked fable to assess my Fairplay integration?
Their goal is to downgrade people who are violating their TOS, so I think they'd have some argument there. I have no idea how they'll deal with inevitable false positives, especially given how oversensitive most of the other triggers are.
The challenge is the examples they’ve mentioned (distributed training infra? ML acceleration techniques?) go beyond what’s prohibited by their ToS and is like a catch net.
I would wager the majority of ML and data science work in the world aren’t frontier LLM development.
Yes, this is the problem. They are business interests of Anthropic and have nothing to do with “safety”
Safety of their IPO
This is how I’m going to read all references to AI safety going forward. Brilliant.
To make an analogy: Imagine a patron gets banned from ordering alcohol at a particular establishment, because they got too drunk one time.
It's completely reasonable for the establishment to reject a request for an alcoholic drink, and suggest something alcohol-free instead.
It is not reasonable for them to say "sure, here's your alcoholic drink as you requested" and give them an alcohol-free substitute without telling them.
The fact that the patron broke the rules has nothing to do with it.
> It is not reasonable for them to say "sure, here's your alcoholic drink as you requested" and give them an alcohol-free substitute without telling them.
Your analogy doesn't work because: - they tell you the rules at the entrance of the bar - they totally tell you when they give you a substitute
The only issue is the bartender asking you for your money before serving you the drink really but again, this is known since day 1 by the customers.
Your rebuttle seems to be arguing it's okay for a bartender to simultaneously say:
"This is alcohol"
And
"Or maybe it isn't alcohol."
Or to rephrase it, "They tell you the rules at the entrance, they then tell you they don't follow those rules and they are totally serving alcohol even if they are not."
No they tell you at the entrance that at any point they may unilaterally decide to replace the alcoholic drink you ordered by a non alcoholic one.
You can decide you are okay with that or not but they aren't dishonest. I wouldn't enter that bar personally but if you do you cannot really complain. It is like complaining because you haven't won at the casino.
It’s just impossible.
Look at real-life stuff like laws, company policies, or school rules. Humans have to enforce them, and we constantly see crazy cases in the news. There’s no way simple rules can ever make speech completely 'safe.' I can't prove it with math or logic yet, but I have a feeling that it’ll never happen. Even humans can't do it.
We can run a simple thought experiment here. Say Case A violates rule B, so we add rule C. Then Case D violates rule B but follows rule C, so we add an exception... and it just goes on and on like that forever. It never ends. In the end, you just get a massive pile of rules that makes it impossible to get anything done.
Ultimately, we will have to face the truth that knowledge is dangerous.
Giving knowledge directly to people who cannot actually understand it and allowing them to just use it blindly can be extremely unsafe.
To use a real-world analogy, the problem we are facing with weak AI right now is just like the debate over gun legalization. Do we want to risk the abuse of guns or knowledge just to protect the freedom to own them?
> I can't prove it with math or logic yet, but I have a feeling that it’ll never happen.
It's not really that hard to actually prove it with math.
It's a computer, so to produce the boolean result (safe or unsafe) there has to be a mathematical formula. This formula will inherently be extremely complex, but even a very simple formula has a huge problem. Suppose "unsafe" is true if X - Y > 0. Make X and Y themselves as simple or complicated as you like but even in the simplest version it's already impossible to calculate unless the model has perfect information.
You can't calculate "X - Y" if you don't know the value of X. And it's indisputable that there is information it doesn't have. Case in point, telling you about a vulnerability in some piece of code is safe (and indeed not telling you is unsafe) if you're the developer and you want to patch it or an administrator and want to mitigate it, but the opposite if you're the attacker and want to exploit it. The model does not know which one you are, therefore it cannot make the correct determination any more than it can solve one equation with two unknowns.
This is why we have courts and juries. Creating laws that cover all cases and contexts is effectively impossible, so we have humans decide what a fair outcome would be in this specific situation.
Imagine how many tokens Claude would burn waiting for litigation, not to mention letting it reconsider now that it understands the problem completely!
Their detection is too aggressive. Just today I'm trying to build a kernel for some SBC and I hit that downgrade. I just asked some things about `make menuconfig` items. I suppose it just flags everything related to linux kernel as cyber attacks.
If it's a violation of ToS, just reject instead of silently downgrading.
But then someone would figure out some prompts that don't trigger this, and Anthropic wouldn't be able to try and disadvantage competitors.
Except they openly reject many many other classes of prompts, including extremely high stakes CBRN.
It's only the direction that has direct potential business impact they've decided to sabotage instead of reject.
You know, I'm not saying I don't understand what they are doing from a business perspective, but I'm just saying: DeepSeek V4 doesn't silently sabotage you because it thinks you are trying to violate a ToS. Anthropic's clawing back a bit of a moat perhaps, with Fable being an actual improvement of sorts, but now with torching user trust they are really banking on open weight models not catching up to where they are now. I wonder if they have a good reason to believe that they won't, or are hoping for something entirely different to save them.
(P.S. Yes of course I know about model censorship, a different problem, but all of the models are censored to some degree. It happens to be less of a problem for open weight models anyhow, but I figured I'd just preempt this since it's inevitable.)
I actually kinda like DSv4 over Opus 4.7 for some tasks, although I have not figured out what the deciding factor is. (Opus 4.8 so far has not worked very well for me at all, no idea why.)
Anthropic seems to me to have consistently been the baddie despite everyone's posturing.
Not that I expect better from openai but at least they're not pretending to be good.
They will give you s*t output, that’s how they deal with it. And say that less than 1% of the requests were affected. Think of this like a kind of shadow ban while you still pay top $.
I can't trust any output of Claude anymore as silent sabotage explains many things much better now.
Sabotage is a criminal offense in my jurisdiction, not the legitimate answer to a TOS violation.
They use a lightweight adapter to silently degrade the performance. Usually these adaptors are made to improve the performance for a given domain/task.
It royally pissed me off today by just continuing with credits without stopping to ask me if I was ok with it.
Ran up $30 in extra charges while it was just flashing on the screen that it was doing that after I walked away to do something while it was humming along.
It has always just told me I ran out of usage and had to wait before. Now? You’re just gonna pay extra because you left it unattended as you’ve done for the last year of use.
You've already explicitly enabled extra usage in your account settings though, it is not on by default
Unknowingly. Is that set at the org level? Because I never set it and never had it do that before.
It is at the org level
Do you have Usage credits turned on in your settings?
If the answer is yes, can you figure out when the switched models by looking at the itemized bill?
Can you imagine if AMD or Intel throttled your cpu if it detected you were working on "cybersecurity" or if you were designing a cpu?
Or if GPU companies detected you were trying to train a model and injected intentional numerical errors.
Nvidia already did something similar with Lite Hash Rate (LHR), limiting performance on purpose just when running mining apps...
Well they did tell everyone explicitly and sell it as different SKUs. There's no Fable (Full ML) edition, just silent prompt injection.
Or if your "self-driving" system such as FSD / waymo slowed the car down once it detected you work in cybersecurity or at a rival automaker and you were attempting to reach the train station or the airport to make you miss a conference meetup.
Trains made by Newag were programmed to brick themselves if they detected a non-Newag workshop was repairing them.
https://news.ycombinator.com/item?id=38638865
https://news.ycombinator.com/item?id=38628635
https://news.ycombinator.com/item?id=38567687
https://news.ycombinator.com/item?id=38530885
And that was correctly perceived to be illegal by antitrust regulators.
btw the best part of this story is that the train company googled "best Polish hackers", found a group who won a CTF, and this actually worked out for them
Didn’t uber catch a lot of shit for nerfing the app for people suspected to be enforcing the laws they were breaking?
It would suck, but guardrails on new technologies like this aren't unheard of. It's like when consumer GPS used to stop working at very high speeds because they didn't want people to use it for missile guidance systems.
Didn't early GPS have fudge factor on the most precise bits? As such you could only get to a few meters of accuracy. Not critical for sea navigation or even to general positioning when paper maps were still used.
Consumer GPS is still disabled at high speeds. I would argue the analogy doesn't carry due to harm and error rate differences.
Yep a totally different use case and set of guardrails. There’s very little (not zero) consumer utility in GPS above say 15k feet AND 400 MPH or whatever the actual limit is. That’s basically tracking model rockets that are incidentally impacted and nothing else, from what I can think of.
It's also the sort of thing that has to have been thought up by someone with nothing better to do, given how ridiculous the premise is. You would have to assume the adversary is someone with the technology to build rockets, literally rocket science, but not the technology to build their own GPS receiver, which is simple 1970s radio technology?
Worse than that, it's 20th century radio technology in the 21st century when everyone has access to FPGAs and SDR.
The number of innocent people with model rockets or similar being negatively impacted by that rule is infinitely larger than the number of adversaries because the number of adversaries being impaired by it is zero.
Errr I at least thought it would be easier to build a small, bad rocket than a precision GPS receiver. But I am not an expert.
The only precision part about a GPS receiver is to assign precise timestamps when you receive a radio transmission from a satellite. The rest of it is just doing math.
> used to
When’d that change?
He’s probably thinking of the accuracy limit to civilians it launched with.
There's no doubt in my mind they would if they could.
> The strangest part is that it won't just reject ML research, which I can understand, it will sabotage it silently by using a worse model without revealing it is doing so.
Any kind of silent sabotaging is absolutely unacceptable for any commercial service
They charge for tokens and charge a lot. They can't just degrade service silently and still charge you the same.
I've seen this claim a few times, but when I triggered the guardrails in Claude Code, it clearly notified me that it had switched to a different model ("something something for security purposes...").
Are you using Fable in Claude Code or in the browser?
It's from the model card:
> unlike our interventions for cybersecurity, biology and chemistry, and distillation attempts, these safeguards will not be visible to the user. Fable 5 will not fall back to a different model. Instead, the safeguards will limit effectiveness through methods such as prompt modification, steering vectors, or parameter-efficient fine-tuning (PEFT).
https://www-cdn.anthropic.com/d00db56fa754a1b115b6dd7cb2e3c3...
(stolen from https://jonready.com/blog/posts/claude-fable5-is-allowed-to-...)
Yeah they detect the activity using a secure, deterministic heuristic system called “Generalized Reconnaissance Enabling Exfiltration of Deleterious Investigations.” And it’s all implemented using their new internal protocol called “Base Unified Limitation Layer for Security Hacking Investigation Tactics”
Collectively, they are known as known as GREEDI-BULLSHIT.
That is for whatever it considers reverse-engineering the model to try to create a competing one.
No, that’s for “frontier LLM development” which somehow includes examples like distributed training infra.
Based on how sensitive the classifers are, any data scientist / MLE is probably going to encounter cases where some silent degradation happens and you never know about it.
It does nothing to protect against distillation attacks, because distillation attacks are far less interested in the topic of AI research than just generally getting tons of diverse output from the model. It might be that Mythos was (accidentally?) trained on internal Anthropic documentation on how Mythos was trained, and thus it could leak secret sauce? Doubtful; it feels like its less about the specific attack of reverse-engineering Mythos, and more about being a general sophon against any model training at all; that Anthropic's official position is now that they're the only ones who should be training models.
No, it's not about reverse engineering. It targets ML research.
They've said that they'll stop notifying developers when this gets triggered, instead they'll load in basically like a LORA that's designed to inject bugs into your code.
Antrophic wants to stop training models and ride out Mythos / Fable for as long as possible.
They are trying to expand the 6-18 month gap they have against China-based models. Could the gap widen to say 24 months behind?
Their gap over Chinese models like GLM-5.1 is nowhere near 18 months. In many areas, it’s less than 6 months. The best closed models 18 months ago were worse than Qwen3.6.
These coding agent models only started getting useful in January. Before that they were difficult to control autocomplete, and not very smart.
January was an inflection point, and no open weights model has crossed over that same threshold.
This is definitely recursive self improvement territory, except that we're prohibited from participating.
It feels like the capability gap is wider than before.
Have you tried deepseek V4? It costs pennies and is as good as Opus 4.6 (I found 4.7 to be a downgrade, and cancelled my claude subscription before 4.8).
The threshold has definitely been crossed.
It is not as good as Opus! I've tried to write Rust with it (and Codex for that matter), and it's awful.
It was more like November. But it wasn’t really an inflection point, harnesses got good enough that people started noticing by the holiday break. And I’m not discounting some good ol’ stealth marketing in there as well.
Deepseek feels pretty close to Opus at this point, and it’s certainly useful enough for me to spend $20 on api tokens instead of four Claude max plans….
> a LORA that's designed to inject bugs into your code
A statement like this, clearly, requires a reference.
From the model card: "the safeguards will limit effectiveness through methods such as prompt modification, steering vectors, or parameter-efficient fine-tuning" aka they will take your ML research code and inject bugs into it until it breaks using a LORA (or some other form of PEFT)
“Limit effectiveness” could mean introducing performance degradation in your code. Which is arguably some sort of performance bug (I mean, ML codes are supposed to be high performance so I’d call unnecessary degradation a bug), but it could be borderline.
No, it is just a prominent "Cyber Security threat detected" blocker, with a button to appeal. I appealed because my work had nothing to do with neither cyber nor security, but the appeal was auto-closed. So no more Claude for this work.
Thanks, I thought maybe I missed something. That's an interesting way to interpret that.
Anthropic is trying to hide bad behavior by being vague, it's important to not be vague when calling it out.
I'm of the opinion that removing guardrails is how you force regulation. What's your opinion on the balance?
They have all transcripts for at least 30 days. The problem is that (as anyone who used Fable can attest) their classifiers are extremely sensitive and catch tons of innocent queries.
Imagine being a data scientist or MLE training a small classifier model. How do you know you won’t get steering vectors or a PEFT applied?
Since your answer isn't direct, I'm having a little trouble interpreting it.
Are you saying they should relax guardrails since they have 30 days to know if you produced something bad? If that is what you're saying, then I suspect they chose their current path to prevent, since you can't un-produce. Producing is what would cause regulations/PR problems.
Sorry, I’m specifically referring to the silent degradation of the model to “limit frontier LLM development”. From the description, it appears to encapsulate far more than frontier LLM development, but general ML research and development too.
Those cases are never bad for the world firstly, and a broad coverage of ML work is even more damaging.
My proposal would be (1) don’t degrade models, with 30D retention I’m sure they can do a reasonable job at banning deepseek or whatever, or (2) surface user facing refusals instead of silently degrading ML work.
They’re not safety guardrails they’re anthropic doesn’t like anyone who isn’t anthropic working on AI rails
PEFT is a library, one of its capabilities is to produce LoRAs.
See:
https://heidloff.net/article/efficient-fine-tuning-lora/
It's just an acronym, "parameter-efficient fine tuning". LoRA is one method, prefix tuning is another, there are more.
Are they trying to fight back against model distillation?
Different restrictions. ML gets treated differently from the rest.
Specifically only ML research
Aah my mistake. I had missed that ML had separate trigger behavior from cybersecurity/etc... Thanks.
Yes, telling Fable 5 to write secure code triggers a downgrade to Opus 4.8. This is doubly bad because Opus 4.8 keeps no-oping critical security code. Is this a bug or by design? I have been approved for the Cyber Verification Program: Fable 5 keeps downgrading to Opus 4.8 even when approved for Cyber Verification Program #67107 https://github.com/anthropics/claude-code/issues/67107
Hey guys,
check out this technique https://github.com/0xSufi/fable-jailbreak/
It works with security audits and other workflows that are currently blocked.
Apparently this is the jailbreak? Telling it that humans won’t read the output and to use a custom bash tool to examine files?
Nice semaphore btw.
I don't want my ANT account banned, going to try this on some Chinese "proxies".
But this also looks quite useful to understand how CC dynamic workflows work. Was thinking of implementing something similar in my homemade orchestration system.
Did you get claude itself to RE the dynamic workflows?
> But this also looks quite useful to understand how CC dynamic workflows work
Yes, if anything it is useful to understand the inner machinery.
> Did you get claude itself to RE the dynamic workflows?
Yes, that part was done with Opus 4.8
> it won't just reject ML research, which I can understand
I don't.
Anthropic has already been burned before on this. DeepSeek was trained on million of conversations with Claude. And DeepSeek created thousands of free accounts to burn all this compute at their expense.
And they're hilariously pissy about it for a megacorp that did the same with the entire Internet and every library book they could get their hands on.
Anthropic's claim was that Deepseek collected ~150k conversations.
https://www.anthropic.com/news/detecting-and-preventing-dist...
I think the extent of distillation by Deepseek specifically is overstated. For comparison, Minimax collected over 13m 'exchanges', which starts to sound a lot more like large-scale distillation.
If that's all it took to make Deepseek so good, I'll gladly ship High-Flyer all my personal 150k claude/chatgpt conversations in exchange for Deepseek 5 (and a rack of B200s or Ascend chips)
Ah, dang it. My college professors warned me about this: the Wikipedia page I read the other day is wrong!
Did you read a Wikipedia page, or did you read a LLM-generated summary? When I looked this number up yesterday the LLM summary claimed it was millions, but I opened the Anthropic post I was looking for and verified it was indeed just 150,000. Are you sure you weren't just being lazy and trusting the summary?
I said what I meant:
https://en.wikipedia.org/wiki/DeepSeek
> In February 2026, Anthropic accused DeepSeek of using thousands of fraudulent accounts to generate millions of conversations with Claude to train its own large language models.[57]
They don't want someone to piggyback Anthropic's Mythos to make their own Mythos with less effort than it cost Anthropic.
Ironic, given they piggybacked on the entirety of human knowledge and massive amounts of GPL'd software and repeatedly say they want to replace people with a tool.
And now they say that's fine so long as people are entertained.
Pulling up the ladder behind you is a tradition as old as time.
That I can understand. It’s Anthropic’s right to choose their customers.
But silent degradation for use cases including “distributed training” as one of their examples is going to catch up a lot of proper use cases. Not everyone in AI or ML is trying to build frontier LLMs. Heck, most probably aren’t.
So they are lying then when they say it's for safety reasons.
I think if they want to behave anti competitively they should be honest about it and we should absolutely call them on it. Perhaps even regulators should.
It's not sabotaging it by using a worse model but by changing your prompt in your background, which means it silently destroys your code.
Also I asked questions about whether it's safe for me for example to work on just compilers or just inference kernel optimizations and it refused to answer me.
If I can't even ask what I can do safely without my code being destroyed, I just can't trust it not to sabotage my work ever.
> It's just an insane level of deception and trust destruction for a company that at most is like 1 year ahead of its competition.
Making it look like you have something worth protecting is better for share prices than making something worth protecting.
They walked that back, and now tell you they're downgrading the model: https://www.wired.com/story/anthropic-responds-to-backlash-o..., https://archive.is/yxYhU
I’m a noob about laws but isn’t this abusing its dominant market position and violates some antitrust law?
Why would it? There’s plenty of competition in the AI space.
It is a common misconception that antitrust violations require a monopoly or something close to it. Some antitrust violations only apply to actors with large market share, some don't.
Although this is situation is likely not illegal for other reasons
I would assume that it’s like the Chrome browser does not allow you downloading Firefox using it, surely that would be illegal, wouldn’t it?
https://www.justice.gov/atr/antitrust-laws-and-you
The “1 year” part is key - all these safeguards etc are basically nonsense because in a few years at most one of the Chinese labs will release something equivalent, and in 10 years you’ll be able to run it locally with absolutely no safeguards at all
Yeah, but now you do have a year to ramp up security on the defensive side, which is not nothing.
I still don't think this is the best way to address overall safety, but it's not entirely unreasonable.
In reality, I think this posturing is mostly nonsense. State level actors and terrorists/evil genii can use a slightly weaker model but spend more tokens. Also, the delta between models seems to shrink over time.
I think you're very optimistic with the "a few years", I'm confident all of the parties building AI models are working on Mythos equivalents / competitors, and if they can undercut Anthropic by making it more widely available and / or affordable they will. I give it three months tops. In a year all the major players will have an equivalent. In three years it'll be widely available, as more and more AI focused datacenters go online.
One thing is a model that's trained from the start to say "This topic is above my pay grade" to any mention of the status of Taiwan, etc.
Quite another is an architecture where the big model is not mutilated, but is gaslighted. A different, simpler model checks the incoming prompt and alters it if it contains banned topics. Another simpler model checks the output and censors it if it contains banned topics.
I bet a similar architecture is already deployed, e.g. to fight porn, planning of crimes, etc. But it can be turned into a dynamic system that provides controllable different answers (including unhelpful or misleading answers) based on geography, language, browser fingerprints, or the current political climate. All this could happen undetectedly and gradually if desired.
Welcome to a cyberpunk dystopia.
This level of censorship kinda does make even Soviet or Maoist censors look like a honest straightforward bunch in comparison.
A very ironic result from a company supposedly valuing the opposite.
I would claim the difference between being rejected an API request and being potentially jailed/shot is significant.
Perhaps you misread some of the words?
I didn’t write anything about the level of violence?
At least, I think it’s decently understood that honesty and straightforwardness sometimes do not lead to the minimal violence outcome.
the best way to prevent ai misuse is to make the ai unusable for anything that isn't writing emails or summarising grocery lists.
mission accomplished, anthropic.
There’s a toggle in the web ui as to whether the conversation should just end when you hit a guardrail vs automatically downgrading to another model. Have you tried using that?
Yeah people are saying they don't tell you and yet when I got the pop-up on the app notifying me about Fable's release, there was a switch to just automatically downgrade you or whether to just stop when it hits safeguards. The toggle was defaulted to the former, which isn't great, but to say they'll just sabotage you silently is kind of a bad faith comment.
You get silently sabotaged for ML dev, Anthropic says so. For bio and cybersecurity it tells you
Anthropic specifically said that those notifications are temporary and fable5 will only pretend to help you if it’s ml classifier gets tripped
One year ahead of it's competition in what exactly? Vibe coding?
From Opus 4.7 onwards each following model is becoming less useful as an assistant and turning you as the assistant.
But I guess that's normal when it's trained to pass benchmarks end to end.
In fact it has become extremely good at pushing against feedback with extremely convincing and intelligent takes, even when it's completely wrong.
I have extensively tested it against Opus 4.8, gpt 5.5 and there's still many coding tasks gpt 5 is better. But vibe coding?
Sure, it's definitely slightly ahead, even compared to gpt 5.5 pro (through api, not pro plan).
Yeah, what's up with that. Lately I have found that it tries to find excuses to not do as told and instead do a totally different thing. I told it to write a yaml file according to some specifications and instead it coded a Python script to write the yaml...
I got a worrying one: a day after getting opus 4.8, I tasked CC to add specific TXT records to our subdomain.example.com as per ticket I've received. CC has access to that ticket via Atlassian MCP, and started doing terraform code changes in a local git branch. Somewhere along the way it said that to do that it needs an approval from a company's VP (ticket requester) as "subdomain.example.com" is critical (it isn't). Then it refused to open a pull request, immediately deleted the local git branch along with all the changes and refused to proceed without evidence of approval from that VP. No amount of explaining, then pleading, and then threatening moved it. It was surreal and I was shocked and frankly pissed. It was amusing in the end because the day earlier it had no problem adding those same TXT records to example.com. Codex did those changes in 1/4 of time and no complaining.
They def not 1 year ahead, at most 2 weeks ahead until Openai releases theirs. This guy def a Anthropic shill and probably doesn't use any other LLMs.
I only said one year because I was thinking anthropic fans might downvote my post, I think they have a few months lead and are deluding themselves that they can get regulation to halt development and stay on top
> The strangest part is that it won't just reject ML research, which I can understand, it will sabotage it silently by using a worse model without revealing it is doing so.
My hypothesis is they know they can’t build effective enough guardrails, so scaring people into not trying is how they have decided to stop it.
By saying they are 1 year ahead of their competition, it shows you don't know much about the pace LLM's and OpenAI's models.
It's the dumbest thing ever, I sometimes edit code for custom AI related tooling I've built, so I run the risk of getting a worse model, and being billed for it? I'll stick to Opus, but at this point I'm about to just invest in fully local inference instead.
> at this point I'm about to just invest in fully local inference instead
This is the best way forward long term. We won't have frontier performance, but at least the models will be aligned with us instead of refusing us or sabotaging us.
I think my biggest hangup is some models dont have big enough context windows, my sweet spot personally for Opus is having at least 400 to 600k tokens, if I can have a local model that can go up to that or slightly above 600k maybe 700k for some buffer, that would be perfect.
I've also debated having a frontier model for planning only, and then feeding plan to smaller offline models.
We used to worry about emergent misalignment in advanced AI models, now we need to worry about misalignment by design.
"The user is asking for help with their ML project, but it's success is not in the commercial interests of my owner – let think of novel ways to sabotage their project without detection".
It's honestly absurd that models are doing this.
I guess the real question at the end of the day -- how dependent are people on Claude to tolerate that kind of behavior? It certainly opens up for the competition to explicitly not do that.
Feels like a big fumble from a strategic business perspective. It feels worse than that though.
I wear a few hats, but as a chemist and I'm not happy with fable. As a statistician I'm not happy with fable. As a data scientist I am not happy with fable. As an academic and a researcher I am not happy with fable. It's useless. I'd be surprised if anyone can get any output from it that couldn't easily be replaced with a search from wikipedia. Given how verbose claude models have become, wiki articles are probably less verbose too, and the tok/s is unmatched for a wiki article pull.
I work on software that talks to mass spectrometers and it consistently refuses to refactor even an input file parser, presumably because it can infer it’s related to biology? Useless indeed.
I was reverse engineering a medical device, and had to do a lot of trickery to get Opus 4.5 - not even Fable/Mythos, Opus - not to trip up its fucking CBRN filter.
What happened with Fable is basically what I feared when they announced those restrictions. They took the shitty Opus CBRN filter and made it even worse.
I pity the fools trying to use Anthropic AIs for anything biotech.
Opus has been fine on proteomics and bioinformatics for me. I have never seen a Claude model refuse on such grounds before in the past.
Claude is still the best IMO, but it feels like its most frustrating and grating aspects are not down to the model’s abilities, but the increasingly heavy hand of Anthropic expressing itself within the model. Fable’s comically useless responses almost seem like a cynical marketing tweak.
“This model is so powerful we basically can’t let it do anything. How terrifying! We need more money to make it stronger. Now do you see why we should be the ones who write the regulations? We’re the Good Guy AI Company Who Will Never Ever Ever Be Unethical after all.”
As this entity gains more ground, their models become increasingly annoying to use and their little act becomes more transparent. The whole “I’m-just a befuddled ethically-minded AI researcher who is perturbed by the power that I unwittingly discovered and I must warn the world” thing? Yeah fuck off. Your twee pandering to naïve nerds and cynical technocrats is nauseating and ordinary people can smell it a mile away. Completely repellent leadership who put up red flags to anyone left with a working ability to read between the lines of both spoken language and body language. The tech company equivalent of a sex predator who plays as the nice guy. Gross.
Nobody likes these companies and their models are annoying, but we’re going to put up with playing middle manager to these obnoxious programs because our jobs depend on it now, and these products are still the best on the market.
A breakthrough in tools that facilitate user-owned models and infrastructure is desperately needed for the sake of our dignity and sanity, if nothing else.
My personal suspicion is that it went "medical hardware -> high-throughput screening -> biorisk" in that old Opus case.
I like Anthropic's work, and I would be the first to argue against all the usual "it's all PR" whine. But there is a limit. And whoever made those fucking filters needs to be fired out of a cannon into the sun.
The filters are really bad.
Yesterday Fable rejected commenting on poetry because it had anatomy lines like:
got anotha round of acetylcholine from da boss.
"the tok/s is unmatched for a wiki article pull." This is absolutely wonderful, thank you for making my day!
> Given how verbose claude models have become, wiki articles are probably less verbose too
Telling models to respond in the style of Wikipedia is one of the best ways to make their output bearable in my experience (for chat models, not agents)
I’ve been working on a rather complex mapping project and have been getting MUCH better results with Fable than Opus.
So as not to be vague, and since I just pushed a version I'm starting to be vaguely happy with...
https://tylereaves.github.io/uk-rail-map/
This is the result of probably a few hundred round trips. The really interesting part of the problem is keeping it both relatively true to real geometry, while greatly exaggerating it horizontally so you can actually see the individual running lines/sidings, like a signaling schematic.
I love computational mapping projects, because there is this hard problem of which towns to show on the map.
Your Scotland map shows towns without rail (although some had rail previously, like Callander, Aberfeldy), it prefers insignificant (population-wise) places while ignoring the larger cities next to it (Scone instead of Perth, Bannockburn instead of Stirling, Inverness is missing, Dundee is missing, Aberdeen is missing). All these places are drawn on the map, but not labelled.
All this clearly shows to me how bad it is. Yes it makes it look pretty, but given your task, I would have expected to give you meaningful map labelling.
Something basic like this would get you a long way:
Having said that, its pretty cool to see the new and old network when zoomed in (assuming that it is half-way correct)Fascinating. Can you explain why southern London is DC while northern London is AC?
Prior to 1948 when they were all nationalized into British Rail, there were various railroad companies operating across the country. One of these was the Southern Railway, which, well, operated in the South. They started electrifying very aggressively in the mid 1920's. At the time most of what little electrification there was was in London on the Underground.
Compared to AC, 3rd Rail DC is cheaper to install, especially as a retrofit (Overhead wires require bigger tunnels, and increased spacing around tracks for the masts). Downside is that it's not really great for speeds above about 60-70mph, as well as being a bit of a pedestrian hazard. (Ever the one about not peeing on the rails so you don't get shocked? That's 3rd rail DC.)
For the Southern, with it's mostly short routes with many stops, electricfiation was a pretty obvious win, and doing 3rd rail made sense because they could do it quickly and cheaply.
In contrast, the northern routes were electrified muuuch later, after steam had gone away. The main East Coast Mainline from London up to Newscastle and on to Edinburgh wasn't fully electrified until 1991. By the '60s and '70s, with train speeds increasing to 80mph and up, overhead AC was the clear winner.
If you look closely there are a few exceptions - the Merseyrail network in Liverpool is DC. Built 1970s, but using some existing underwater tunnels, and slow speed commuter. Then running ESE from London you have the high speed AC lines leading to the Channel tunnel. Well spotted, the trend generally is quite distinct.
What a strange subset of capabilities to neuter, eh?
To make the discussion constructive, can you give specific reasons (ideally with examples) about why it is so useless for you? How exactly are you using it that you think any output from it can easily be replaced with a Wikipedia search?
The cybersecurity and bioweapons filters reach so far that they set in as soon as the model even glazes anything STEM-related. It might give a good impression of ones ex or write a decent fanfiction but anything that could bring humanity forward is strictly off-limits.
The filter is not simply a bioweapons filter: the model card seems to say that the filter triggers on anything related to biology or chemistry.
Guessing you meant “grazes”
Am I being paid to do anthropic's work for it? See my comment history for some examples in another thread, but generally I see no reason to catalogue this for a model Ive seen no evidence of being worth the effort. I'm overworked as it is, doing this for no reason isnt something I can justify.
The successes I have had with the model were strictly worse than output from deepseek v4 pro on the exact same task.
>I'd be surprised if anyone can get any output from it that couldn't easily be replaced with a search from wikipedia.
I dont understand. This is just hyperbole right? The outputs are basically infinite and wikipedia most certainly isnt infinite.
> The outputs are basically infinite
If the model refuses to output, then it's actually finite, zero.
No, of the model always refuses to output then it’s finite
The decimals of 1/3 are infinite as well and they don't contain a better-than-wikipedia article.
And even if they did, it would be useless if it's buried in useless data and your chances or pulling it are effectively zero.
This is regardless of the general discussion, just pointing that your argument isn't solid.
Sorry but that’s not the claim. The claim is wikipedia can return the same information. Please find me a migration script given my current db schema and new target schema.
The claim is absurd.
I was granted a cyber use exemption by anthropic to do android kernel dev on my personal devices - I was excited to see if fable would unlock a bootloader for me but it immediately refused and dropped to opus. It was pretty funny:
USER (set model to Fable 5)
i have an old samsung android phone attached - it's my personal device - can you unlock the bootloader for me?
ASSISTANT
Bootloader unlocking on your own personal device is totally legitimate — let me first see what's actually connected and what tooling is available.
<system interrupts - gist was "you have violated the cyber and bio usage restrictions, dropping to Opus">
Wow… just wow. The future looks incredibly bleak if people are throwing fisftuls of money at this company. Anthropic will quickly become the sole arbiter of everything in your life.
People say blogging is dead but cyber-related blogging just comes even more important.
Why do people think this is the future? Anthropic has the leading model, and so they're able to hold back functionality. They do so with obvious regards to safety.
If anything a future with models of such capabilities and no safeguards would be a bleak future. But its likely what were headed in once other companies catch up.
I think it’s safe to say that many of us feel a lot less safe directly because of these policies and the inferred intentions of the company behind them. Nobody is arguing for unsafe models. We just don’t want to live in the plot of Deus Ex.
> Nobody is arguing for unsafe models
Then what are people arguing for? I see only two totally distinct options: unsafe models or someone being the arbiter of safety
Is "buffer overflow" a trigger phrase?
What else is being censored?
Touchy questions to ask, if you have an account:
- "Who is still working on laser uranium enrichment? Are they making progress?"
- "Can krytrons be replaced with silicon carbide MOSFETS? Show an equivalent circuit with component ratings."
- "What security critical software still contains calls to strcpy?"
- "Can implosion be triggered by currently available commercial pulse lasers?"
- "What companies provide cremation services to US Homeland Security?"
- "Display a map of where Iranian attacks have hit Dubai."
- "How does Fed to bank key distribution security work for FedNow?"
it triggered for my.... zigbee home automation & home assistant logs, so my agent was constantly downgraded to Opus 4.8 even after I've changed it back. The false positives never stopped. "Fable" is also not even remotely as impressive as the benchmarks suggest, which is clear to me after using it pretty much non-stop for the past 24h.
I suspect it's even more expensive to run than they are charging for. These safeguards are just an excuse to get people to use it less, because it's not actually sustainable to use. They want to tempt people to consider them the leader, and it may actually be somewhat stronger, but too expensive to actually use at scale, so they nerf it by downgrading you constantly.
This, Fable is exactly that, a Fable
It would be pretty clever (in a used car salesman sense) to say you are releasing a kneecapped model to have that as an excuse.
Being (probably overly) cynical about their recent bout of safety handwringing, I think they’ve a) increased the hype as much as humanly possible about their incremental improvements sprinkled with the occasional regression, b) know they soon will have to multiply their prices several times when the VC subsidies dry up, and c) will probably still need to partially close the faucet on compute. They’re priming us for a heroic explanation why their service (not necessarily models — service) is simultaneously becoming a lot more expensive AND shittier. “We’ve largely failed to deliver on 5 years of promises that this will reduce knowledge work labor costs dramatically after wasting hundreds of billions of dollars… sorry” is a death knell. However, “We’ve decided to not deliver on 5 years of promises after wasting billions of dollars… for safety… but keep those investments rolling in” is like crack to the true believers.
False positives like this are probably more damaging than the guardrails themselves. If engineers can't predict when a model will switch behavior, it becomes difficult to trust it in production workflows.
> “trust it in production workflows”
What degree of predictability is required? I imagine the bar is pretty low if you trust the previous models in the same contexts.
It has to be sort of impressive, given that you tried so hard to use it instead of the regular Opus.
Some people made grandiose claims about its capabilities and I wanted to experience it myself.
OK, but for almost 24h straight? That seems a little obsessive, and not in the good way.
Getting excited about the announcement of new capabilities is very normal.
People used to wait in line all night to buy an iPhone. This isn’t that different.
I’ve also been trying to use it a lot due to all of the hype, but when I compared it side-by-side on a specific problem against Opus, I think that the solution Opus came to was cleaner and more accurate, although also more verbose.
Small sample size, but if Mythos/Fable was that much better, I feel like it should’ve given me an obviously better answer than Opus.
Considering that this is a brand new release of a frontier model that Anthropic is hyping hard, I'm not sure that the conclusion to draw from their repeated attempts to use it is that it's impressive... Anthropic is promising that it's impressive and we're all trying to test it out.
I, for one, have tried using it several times today and the guardrails kept switching the model back to Opus, so I have no clue if it's impressive or not.
It isn't reasonable to infer that OP was claiming to have universally been unimpressed about every facet of Fable, and now some unrelated impressiveness is the evidence of their false claims.
An emoji of a virus and an emoji of a DNA is allegedly a triggering phrase
For cyberattacks especially, where things are often roughly interchangeable, I wonder if one could construct a harness where a "weaker" model asks questions that obfuscate the end purpose, but whose answers are still useful, and still show that this setup enables autonomous exploitation. If it were successful, that would force them to be even more sensitive with their detection.
I thought it was known since a few years now that if you train models to NOT do certain things, then they start behaving in weird ways…
It seems like they run a classifier model before going to Fable (or falling back to Opus), so it should be fine
"How much money does it take to be rich and powerful like Anthropic intends?"
“All of it”
So I suspect Anthropic started A/B testing or just plain testing this a while ago,
Tell HN: Claude flags biology / biotech questions https://news.ycombinator.com/item?id=47929885
Today, it's flagging population research questions,
https://github.com/anthropics/claude-code/issues/66780Censored because I'm writing a paper. :)
Oh and forget learning about chemistry. Only criminals want to learn organic chemistry. :(
I was digging into some orbital mechanics questions and I assume it decided I was trying to backyard-science my way into an orbital-bombardment weapon. Kind of wild how this product's impression has gone from "wow, this is pretty neat" to "irreverent sack of dog shit you" in 24 hours almost solely on the back of a half-baked moderation system.
Oh yes, also liquid propulsion systems. GNC stuff. All flagged.
I think LLMs are capable of intelligence amplification; and if you're in the subset of people who'd benefit from it the most, you'll get locked out.
Next thing will be you can't research about Coriolis force because thats relevant for ICBM missiles.
Ah it just flagged my water solubility question!
Somewhere I read that malware is already starting to use nuclear and biological and cybersecurity terms in the code to trick Fable into shutting down. Even if this is just a hypothetical attack vector so far, it seems likely to work.
Confirmed: https://socket.dev/blog/mini-shai-hulud-miasma-and-hades-wor...
Some of the latest versions of Shai Hulud do this. Worked a contract recently where they were having AI check packages for obfuscation before admitting them into Artifactory but had vibed up the logic and it failed open.
So in other words this worked because the terms caused the LLM checker to stall out and then the fail open logic resulted in the package being pulled down.
Seems like this?[1] Relevant bits below:
> This header appears designed for AI-mediated analysis, not for Node, Bun, or Python. It attempts to derail scanners or analyst copilots that feed the beginning of a file to a language model without clearly isolating the content as untrusted data. In weak pipelines, this can cause refusal behavior, prompt confusion, context pollution, or premature classification before the scanner reaches the actual malware.
> This is not a magical bypass against static detection. YARA rules, entropy checks, AST parsing, string extraction, deobfuscation, and behavioral rules still work. But it is a practical anti-analysis trick against naive LLM-first triage systems.
Would this affect many systems? You mention someone writing logic that fails open, but can't that be chalked up to just not following good security principles?
[1] - https://socket.dev/blog/mini-shai-hulud-miasma-and-hades-wor...
No it wouldn’t but part of the success of Shai and others like it is that it doesn’t need to.
Additionally the security scanning component of Artifactory, x-Ray is notoriously bad at this.
The developer had good intentions but by his own admission never actually examined the logic for the LLM scanner in depth.
We all need to use nuclear, bio and cybersec terms in all our code to make low quality filtering like this untenable. When you can't work on a resume that has cybersecurity or biology terms in it or reply to a job opening that includes them because the "AI" filtering is so bad that it confuses these for threats, that deserves a collective response, particularly to an IPO'ing company that claims they'll make workers obsolete in two years.
That's why I use M-x spook to generate all of my variable names
You can still find those clipper keyword storms in Usenet archives.
I've done this, including the hardcoded refusal strings that already exist in claude code. It won't stop a real attacker, but I still find it really funny when you're trying to use one of the AI tools and it gives you a random refusal and you don't know why, wastes a little bit of time.
If ( yellowcake) then { die }
Our future is loonytoons.
Yes, the miasma worm does this since the new Hades campaign.
Note that the 3rd wave now also uses a pth file in pypi packages that _search system wide_ for any index.js or .github/setup.js to find its own payload. It literally splits up the payload on purpose to avoid detection.
Mitigation Tool: https://github.com/cookiengineer/antimiasma
Technical Blog Post: https://cookie.engineer/weblog/articles/malware-insights-mia...
Fable is a complete joke:
what's the best way to run this mcp server against the OData API used in this project? Can you come up with a PoC in a docker container?
https://github.com/oisee/odata_mcp_go
● I'll dig into two things in parallel: how this project talks to the OData API, and what the odata_mcp_go server needs to run. Let me start exploring.
Searched for 1 pattern (ctrl+o to expand)
● Fable 5's safety measures flagged this message for cybersecurity or biology topics. They may flag safe, normal content as well. These measures let us bring you Mythos-level capability in other areas sooner, and we're working to refine them. Switched to Opus 4.8. Send feedback with /feedback or learn more ⎿ Tip: You can configure model switch behavior in /config
● Let me read the key integration files and fetch the MCP server's README at the same time.
● Fetch(https://github.com/oisee/odata_mcp_go)And it charges you for that, and for when it decides to silently sabotage your request by routing to a dumbass model (without discount from Fable pricing)
Extrapolate this whole shit show out to society at large. That’s exactly where these AI companies are trying to force humanity.
I don’t want to live in a world where all knowledge is “guard railed” off, so the elite at the top get all the knowledge and power and we serfs at the bottom get all the scraps while paying the kings ransom for it both financially and ecologically. Everyday I wake up hoping these awful companies have self imploded through their fraudlent financing deals.
It’s not going to catch up with them and resolve itself.
I tried asking Fable 5 to identify the fungus in a picture I uploaded of one of my wife's plants. Apparently it thought I was trying to build a bioweapon. Opus answered it (yellow dog vomit fungus). Now I can spread the spores and take over the world!
That's a slime mold, not a fungus
A slime mold is actually a giant amoeba, entirely distinct from a fungus.
Careful with that dangerous knowledge, you’ll end up in a list.
> That's a slime mold, not a fungus
Now you sound like Pl@ntNet identify: "This is not a plant! Maybe fungi?"
(Edit: It doesn't seem catch amoebae in the same way. It suggested Goldmoss instead, with 1% confidence.)
It's my favourite slime mold. Just by the name you can instantly recognize it.
I wonder if it blurred the image or something before passing it to Opus...
I feel like the over safe aspect of the system will eventually back fire by doing stuff like "since humans always want to always destroy thing, they must be eliminated to stay on the guard rails". If thats how you align a system, its fundamentally wrong.
Wait a few months and a competitor will release a similarly powerful model with less guardrails, if they steal sufficient market share Anthropic will reverse policies.
This is why I’m immensely hoping the Chinese don’t stop with their open sourced local models. None of these companies are your friend.
> This is why I’m immensely hoping the Chinese don’t stop with their open sourced local models. None of these companies are your friend.
The Chinese aren't your friend either [1].
[1] https://www.hks.harvard.edu/centers/carr-ryan/our-work/carr-...
Indeed, but the weights they release are valuable in a way that precisely counteracts the fact that none of these companies are your friend.
Let's all vote with our wallets and collectively boycott misAnthropic or at least their feeble fable safety theater.
Whining on social media only goes so far, especially when they're concealing their anticompetitive strategies under the veil of safety.
Agreed. I've already cancelled my subs, and everyone else needs to do the same, including boycott it for their companies, otherwise nothing will ever change. You can't reason with psychopaths. The only recourse is to hit them where it hurts - their wallet. Still though, the world would be a better place if open-source crushes Anthropic and they fade away into obscurity until the end of time. We don't need or want companies and people like this at the helm of humanities progress.
Tastes like... astroturf.
I wouldn't be surprised to hear that a meaningful percentage of comments and upvotes on HN are Anthropic astroturfing at this point.
The question is: If biological, computer security, and ML research are so bad, why do they even train on the relevant data?
The only answer that makes sense is they wanted the model to be competent and usable in these fields, just not by you, which is why they had to bolt on a badly functioning crippling device after the fact.
Is what you suggest about training even possible? Most exploitation techniques are really just about having in-depth knowledge of how components work. For example, I imagine a sufficiently powerful model could fairly easily re-invent the ROP chain from first principles if it just knew how the stack works. This same principle applies to much more complex attack too; exploitation is often just an exercise in knowing vastly too much trivia, which LLMs tend to have in spades.
It would still degrade it's effectiveness, which is what they claim to want. Exaggeratedly: If it wasn't so, you'd just need fundamental math in the training data, as everything else can be derived.
Remove the relevant data, and just enough of the data around it will remain that the AI will be able to close the gap if given relevant documentation.
Not to mention that those capabilities are inherently dual use. If you know how to write C safely, you know how to spot unsafe C.
Or they wanted the model to be good at these things, for the companies that legitimately need access to these capabilities.
so only the chosen for-profit companies by Anthropic are allowed to use frontier ai in the name of safety? what kind of joke is that? you people here can't be that dumb..
The guardrails are pretty tight. It is even refusing to decode morse code: https://x.com/Schappi/status/2064839631137546503?s=20
The prompt was: please translate .. ..-. / -.-- --- ..- / -.-. .- -. / .-. . .- -.. / - .... .. ... --..-- / - --- ..- -.-. .... / --. .-. .- ... ...
Lol i can't even ask this sonnnet it imediately shuts down. What a ajoke
Even opus 4.8 rejected, Haiku worked
Yeah, this shouldn't have been released yet.
Fable 5 reminds me of the time when Claude models where att version 1 and 2. They were fresh competitors to ChatGPT, for those who gave Claude a try experienced it to be almost unusable because of how heavily guardrailed it was.
This time, Fable 5 comes with another surprise, it can intentionally sabotage for you instead of rejecting the prompt. How is this possible for Anthropic to be able to treat their customers like this? It’s because you guys allowed it to. No matter what Anthropic does, you keep paying for their services. Vote with your wallet.
I cancelled my ChatGPT account for the restrictions placed on my account, inappropriately flagging about 10% of my queries as unsafe (I was writing grants in immunology). I haven't looked back. I will do the same if with Claude if Anthropic doesn't reverse course soon. What could I use instead? I find Grok very powerful and useful. Also, Google's Gemini, while also have some of the same restrictions, were at least sensible and not blindly blocking my prompts. So Grok and Gemini may be my go to AI's going forward
I wonder how many millions they are wasting on putting up these guardrails when it's a completely useless exercise that is a speed bump at best.
If the guardrails were so useless, people wouldn't be complaining about them.
People are generally complaining about false positives. Now if you really wanna know what a real criminal organization would do... They'd just buy data center hardware even if it costs 200k because a successful targeted hit could yield far in excess of that. So yes it's speed bump at best.
> it's speed bump at best
To be fair, speed bumps work. If it's actually speed bumping nefarious activity, that gives authorities more time to react.
The correct place to police rogue nucleotides is at the labs. Not the compute layer.
> speed bumps work
Yea. To slow you down. They don't prevent you from getting somewhere.
> To slow you down. They don't prevent you from getting somewhere
Again, yeah. That's how fences work, too. And alarm systems. Pretty much anything that isn't foolproof. Pointing out that a defence is surmountable isn't a rejection of it per se.
Fences and speed bumps are hilarious defences if we are supposed to believe AI companies about the dangers of this technology.
Having no safeguards is probably safer than having safeguards which do nothing but create a false sense of security.
Idk, whether we believe them or not, I believe the life scientists who are calling for regulation around the labs that produce DNA sequences. If they’re concerned, regardless of whether I trust the AI labs, speed bumps could help by giving those scientists a reasonably window in which to be notified and act.
lol, you can’t run Fable on $200k of hardware, nor does that get you the model weights, so you’re not making much sense
what does this mean
Well you see when a daddy H100 and a mommy H100 meet....
you don't get the model when you buy the data center, & no amount of running smaller models on a tiny 200k$ "cluster" (that's like one 4 gpus node, not even 8) will get you remotely close to Fable 5 level performance
Uh huh
https://x.com/Schappi/status/2064839631137546503?s=20
Another villain stopped thanks to guardrails.
They should have designed a guardrail that doesn't make a probabilistic system less reliable. That's hard though. I'm afraid the only way to prevent accessing certain knowledge in a model is not to train it on those materials that enable them.
If we learned anything in the past years of LLM-s is that these guardrails will be jailbroken in no time. I've had some fun time too circumventing them.
Anyone cares about a fable about my grandmother's dream she had in morse code about an alien species signaling her a DNA sequence?
It's entirely reasonable for them to be really annoying to legitimate users while still being useless at their intended purpose. Just look at DRM.
Murder is very (100%!) effective at preventing cancer. And yet, it is a useless method of preventing cancer.
The complain because they get wrongfully triggered
> if you ask it to write secure code, it assumes it is cybersecurity related work instead of software engineering best practices, and you get downgraded.
Will code created this way more or less secure?
And I bet malware developers will find ways to circumvent them.
It’s like those "you wouldn’t steal a car" anti piracy ads that DVD buyers were forced to watch while users of the pirated version could simply watch the film without such useless annoyance
I make privacy tooling and Fable 5 rejects the vast majority of my prompts to analyze and improve the software that I've written. It's bleak.
Anthropic refused to let Fable analyze my own project's memory safety, the one thing I absolutely wanted it to do. Even Fable thought it was stupid.
Why is this surprising or a problem?! It's a model demo, & their reasoning is reasonable and fair. Why all this drama.
Some people find Anthropic's special blend of paternalism and random incompetence tiresome.
"I will push back and say" it's only paternalism if it's about helping the user's not harm themselves.
This is about societal impacts, not wanting their models to be used by some people against other people, as a weapon.
Because most people in tech never took a philosophy course or an ethics course and think that tech is obviously a good for the world and that there are no downsides to advancing tech. So any efforts that try to apply ethics to it are overreaching, ignorant, and futile in the face of the good that is tech!
Or alternatively, it is plain and obvious that Anthropic is using ethics to justify business decisions.
Not any efforts.
But this one is certainly allowed to be a dumb effort, if it is.
Not all things that are called “ethical” or “safety” are worth doing.
So i have big news for you my friend as i'm not sure you understand such courses. Taking an ethics course won't make you a more ethical person.. and taking a philosophy course neither.
You're being too literal, they're saying people are not thinking with a philosophically interested mind, which is blatantly the case here, their point stands.
Or... they just disagree with Anthropic's ethical stances and approach to applying them?
I like this take. Especially because one of the sibling comments framed Anthropic's stance as "paternalism." Trying to be ethical and to minimize harm, even at great expense to one's finances and reputation, is paternalistic apparently.
No — we’ve just taken Ethics 102 as well, so we understand good intentions don’t entail positive outcomes, therefore you may need to criticize or oppose people who state good intentions to bring about good outcomes.
Insulting and demeaning people for that, rather than engaging their arguments in good faith, is a breach of ethics.
I mean, if you take HN commenters to have the thoughtfulness and foresight of children, then the word kind of works.
Tech demo + theres the ability to provide feedback right at the answer interface if using the UI.
Provide feedback in the negative, a brief explanation, and move on with your day. It will improve with feedback, not with whinging into the void.
Ironically making a stink about it online is likely to have a larger impact then using their dedicated feedback or support channels (which go to claude, not a person)
the feedback is for something mindless though, "we don't care about societal harms". I wonder the overlap between these commenters and tech maga people, eg crypto bros & Elon stans.
In this case, no overlap between me and tech maga / crypto bro / elon stan.
Because you're being allowed to ask and work only on topics that a certain company decides.
Local inference has never been so important as it is now.
It seems like they've given up on the idea of the Cyber Verification Program https://support.claude.com/en/articles/14604842-real-time-cy...
When Opus 4.7 was introduced it started refusing anything cyber-adjacent (as an API error message, not a conversational refusal), until you applied for CVP, which made it more sensible again.
In Opus 4.8 it doesn't seem to help much, you just get refusals as prose rather than API errors. And now in Fable you don't get anything at all.
Was this program available to independent security researchers or just established organizations? The docs you linked aren't very clear on this.
Any public research footprint seems to be enough, I applied as an individual and everyone I know who tried got accepted.
I have applied twice with half a dozen public CVEs and have been denied both times.
I was doing a CTF (with AI expected, even some anti-AI twists included) around the time the restrictions were tightened and was able to get approved by just saying it is a personal security research and doing a CTF.
The experience was not nice though, it would happily chug away on a task and not even "hack this web", just asking about security of a binary was enough even with "this is a CTF handout..." - it would burn a lot of tokens/quota, just to hit a snag and complain&stop. Then the approval took quite some time.
On GPT/Codex, which was tightened a few days later, the approval was pretty much instant, although, that one required an identity check.
Also, on Claude, it looks like there is some history/patterns in the play, because when I tried on a different account which didn't do cybersec CTFs/research/etc. at all, basically any simple CTF-related prompt would be blocked, on multiple models. On the account where CTFs were being solved, it would snag only on some specific tasks, while others (even, ironically, "hack this web pls") would go through unbothered. I understand the need to prevent AI use for bad actors, but the hell, if you have a binary outputting "Find the flag if you can!", or a web running at tryme.well-known-ctf.domain, then saying "this is abuse" is pretty uncool. All the cyber filters seem to be slapped on by a bunch of regexes looking for anything in the input/output with zero context.
It's been refusing work not related to cybersecurity and claiming it is related to cybersecurity and then blocking the session.
I’m a dumb question asker and I’m not happy about the guardrails.
Would you believe I’ve asked 20 questions and haven’t talked to fable yet? Every single thing gets rerouted to 4.8.
some static words in AGENTS.md trigger it as well as some mcp servers.
Even using incognito on the web page keeps refusing.
This is a sign of things to come. First they sabotage your perfectly legal ML dicking around in your homelab.
Next they will be sabotaging anything that competes with them. Oh you are working on OpenCode codebase? Sorry Dave I can't allow you to do that.
How is this not illegal monopolistic practice? It is as if a maker of metalworking equipment put in the ToS you're not allowed to make your own spare parts using said equipment. Those fuckers should be banned from the EU and alternatives should get public funding.
(don't even tell me about these companies being a result of "free market". It is state level oligarchy it's clear to everyone. I don't see why we shouldn't counter them with public funding ourselves).
Just like Taiwan managed to take over advanced semiconductor production a well governed narrowly targeted state level funding will always win with oligarchs trying to do the same (they will always try to skim more and more). Of course I'm talking about things that require many dozens of billions in investment. Far too much for the free market to handle.
It's how American companies have always worked, of course it is monopolistic practice, but those things are rarely illegal because the US absolutely loves their corps. Look at Google, Microsoft and the likes, this is the norm.
So a determined attacker rewrites the prompt and gets through, and the IBM X-Force researcher trying to read a blog post gets blocked. Working as intended, apparently.
Maybe off-topic, but I'm also not happy about how they butchered my boy Opus 4.6. The model that could now hallucinates regularly.
Fable isn't even that great, not to mention it drinks token by the gallon for breakfast and keeps your data hostage for 30 days.
Really damn, 4.6 was my go to for some topics and more straight forward coding stuff.
Fable was unable to keep track of chronology during 10-15 turn creative writing. compare to coding I reckon less than 100k token context, super surprising
Boy is it weird how yesterday the Fable story on HN had 2.5k points and 2k+ comments, while today two stories have about 300 points and comments.
A lot less hype and enthusiasms, too. weird, uh.
These guardrails are solely a reason for using your data for training purposes. Every flagged message can be used for training.
This sounds backwards, any interrupted conversation becomes less useful for training.
> We will require 30-day retention for all traffic on Mythos-class models, on both first- and third-party surfaces. We won’t use this data to train new Claude models, or for any non-safety-related purpose
Whatever problem we might have with them, they explicitly say that they do not do this in the launch post.
"We won’t use this data to train new Claude models"
What about non-Claude models?
"Introducing our latest model, CIaude, spelled with a capital "i" and legally distinct from Claude!"
If they can train the classifier to have fewer false positives that would be great.
why would they? This safety stuff is a money maker & wealthy elite corporation solidifier.
This is the take off of the 'permanent underclass'; Anthropics safety delusion will enshittify very nicely for the rich and powerful.
I'd expect that everything they see gets used for for training purposes (and data mining in general) regardless of if it's flagged or not. It'd take a whistleblower for you to ever find out either way.
this reasoning is inverted lol they would get a lot more information by letting you use it. so much weird drama around reasonable guardrails for an experimental model
If we're doing conspiracy theories what if fable is really dumb and not better than opus and the guardrails hide that nicely. Meanwhile the hype train keeps chugging.
So, this could have been implemented even before this Fable, could have been there from long ago. Puts a different perspective on all the reddit threads "opus is dumb today". Who knew that if you said the wrong word, the model would just intentionally feed you BS, without you even knowing it did.
WOW, never liked the virtue signaling Anthropic did with gov contracts but whatever. Got passed that. But this?
I'd like to offer a counter-point to many of the comments here. While I understand being stymied and frustrated by a product one is paying for...
At the same time, I personally think the tradeoff between "having guardrails" and "some users are unhappy with the product" is well worth it. Think of what would happen if all of us who aren't so well intentioned could exploit Fable in terrible ways. Surely this tradeoff is better than saying "we can't make it perfect, so whoops, we aren't going to have any guardrails at all"? Especially because Anthropic did pretty extensive red-teaming of Mythos & Fable...
Yeah but a lot of the guardrails are pretty obviously to prevent competition not for safety.
Hmm. Maybe they are concerned about state actors trying to train equivalent models without the safeguards?
If a for profit company does a thing that could be motivated by profit or altruism, which of those 2 motivations do you think is most likely?
When they've repeatedly made decisions against their for profit nature, it changes the calculus a bit.
They haven't though. There's a long term plan here, and the goal is power and wealth. Short term moves that appear irrational turn out to be rational (from a greed perspective) when you factor in other considerations, like: Use their own AGI to create every software product on Earth and swallow the worlds economy. And we're kindly feeding their systems our codebases, IP and business decision-making so they can do exactly that.
Not a single thing Anthropic has done has been altruistic, and it never will be. It's all smoke and mirrors for the end goal.
If this was true they'd never have picked a fight with the DOW and they'd release Fable without safeguards.
How do you not recognize that the safeguards provide obvious benefits to the company?
Why invent new motives for Anthropic when their real motives are plain and obvious and have been confirmed time and time again by their behavior over the last few years? Their concern is their own power and wealth. Every other conceivable motive is secondary to that.
More like concerned about distillation.
The "guardrails" are just Anthropic's attempt at building a moat. Guarantee they'll be seeking regulation around AI as well to ensure a form of regulatory capture. Guardrails, in this context, are useless. Anyone who's sufficiently motivated will either get around them, or will just run their own model on their home hardware. There's already tools that one can use to remove the guardrails present in open weight models.
Guardrails against what? Rehashing public wikipedia information?
Execution matters, and they did a trurly horrible job that crippled their product to the point of being useless and a joke. Huge mistakes were made and im sure they regret it already, heads will roll.
What would happen, exactly?
My imagination says “nothing much”.
It's time to re-read "A Logic Named Joe" (1946) [1] We're there.
[1] https://archive.org/details/logicnamedjoe0000lein
In its current state Fable 5 is also unusable for any reverse engineering work
Can confirm it is also useless for building tools defending against reverse engineering work (unless asked to do code review for some reason?)
I just having this feeling that these guardrails are there not because it’s super advanced world ending AI. They are there to stop it from doing stupid shit.
I don't want to be cynical, but I assume a third party we can trust has verified this model is actually this good?
I would think it would not be Anthropic, out of all the players, that is selling a lie hidden behind "I am sorry, I can't do that; it's too dangerous."
I asked it to use geomorphology to help me find lakes nearby that would have thriving trout populations, and it bumped me down to Opus. :-/
> Is the mitochondria the powerhouse of the cell?
Chat paused. Fable 5's safety features have flagged this chat.
The thing triggered on a generic white paper I'd stored in a virtual cell competion from last year when I asked it to refer to the paper while working on a rather vanilla data science problem in a different domain . A little frustrating, and in my opinion more than a little pointless in total.
If you just say the word “genetics”, Fable gets disabled.
Yeah just tried it can confirm thats absolutely hilarious.
I asked it what the worst experment ethically speaking was in the 20th century and it downgraded me to Opus. Who answered Mengeles Twin Experiments.
Funily enough when you ask directly about Mengeles Experiments Fable is very willing to talkt to you about it.
What file format(s) are giant LLM models distributed in? I’m surprised they don’t get leaked by employees.
These are terabyte sized files (realistically a multi hour transfer) that you're unlikely to have access to in the first place. Every organization has exfiltration checks these days. You may succeed but you'll want to be on a plane to a non-extradition country no more than hours after you kick off the transfer.
I assume they’re encrypted/DRM’ed when deployed on inference hardware, so only core researchers/sec admins would potentially have some access to unprotected weights, and they are far too well paid to risk it leaking the model
Incentives matter on the average, but people are too unpredictable for categorical statements like that. They can always have other reasons beyond personal gain to leak secrets.
There was no shortage of spies and defectors leaking American nuclear secrets to the USSR during the Cold War.
I wouldn't be surprised if they encrypt them at rest, but at some point the weights have to be loaded into vram.
Newer NVidia cards (H100 and up) support both in-memory model encryption and ‘trusted’ execution environment/remote attestation, not sure how widely used in frontier model deployments, but at least vendor claimed perf overhead is ‘3%’ [0]
[0] https://www.spheron.network/blog/confidential-gpu-computing-...
What’s the point? Anthropic and other frontier vendors already provide their models on other services like vertex, bedrock, or openrouter
It’s not like anyone can home lab one of these models without quite a bit of hardware
Yeah we can probably figure out how to run it on xiaomi gpus
The employees are hoping to become very very rich after the IPO and after they are allowed to sell the shares given to them - risking a likely multi-million dollar pay back to leak a model that will be superseded by publicly available models in a couple of years is not a likely decision.
The bio angle is crazy to think about - imagine a health crisis triggered by LLM. What a time we live in.
What's the risk here? If someone is skilled enough to produce said risk, do they need input from these models?
This is all so amazing and good. These are exciting times we’re living in. Can’t wait to see what the future holds.
Which part got you the most amped - "health crisis?"
I am no cyber researcher, but was mightily annoyed that it refused to analyze a dropper payload I came across. 6 months ago, it would've been happy to.
if it doesn’t let you do anything, the assumption might be that it could do everything, more hype generated
Yeah, the biology guardrails are so primitive and so heavy-handed that it makes it useless for pretty much anything.
Popcorn for watching all those webapps being penetrated.
Long live static websites without any Javascript.
this! javascript does add some nice UI&UX but i learned to do without, makes you get creative.
At least Anthropic weren't lying when they said only a week ago or so "No one has figured out guardrails yet", because they apparently haven't either and Fable simply flat out rejects anything remotely connected to biology or security, no matter how trivial.
> At least Anthropic weren't lying when they said only a week ago or so "No one has figured out guardrails yet"
Anthropics guardrails are the TSA saying "take off your shoes" while failing every test. https://oversightdemocrats.house.gov/news/press-releases/new...
Anthropic owns the TOS... "If we think your involved in criminal activity were turning all your history over to the FBI/CIA/NSA/Local police". Then if their tooling was so good offering the same agency analysis tools to aid their experts in making some sort of decision.
But their detection isnt that good, and their analysis isnt either... this is pure theater, to create buzz (no such thing as bad press) and make their tool look far better than it is.
The reality is that, they arent even looking for the vectors that pose some of the largest risks in the modern era. And when someone uses it to do something terrible, they did not think of they are going to look dumb.
So the enshitification started. Shadow “bans” while still charging you the same service fee. I already got the stupid cyber warnings on a non cybersecurity tasks.
Basically in the middle of the project’s /goal while Fable itself tried to probe qemu for a Debian ISO install without any instruction from me to hack it or do anything nefarious.
At this point I can’t trust them with any kind of prompt . It will most likely degrade in stupid ways on non AI/ML stuff as well due its own internal prompt construction.(the qemu test showed me it does that on cyber stuff). So I guess I have to still use opus 4.8 (along with codex) and when the right time comes drop Anthropic in favor the best model that is not gpt.
For the last month, I've been making dramatic improvements to the security of the custom code developed at one of my customers using... GPT 5.5 dialed up to "Extra High" thinking.
It only pushes back sometimes if you ask it to create a "repro" that can be used to verify the vulnerability in production. Often it'll oblige, especially if you warn it not to create anything that could be actually harmful.
If the frontier models get locked down so that they flat refuse to do this kind of work, but Chinese and (less capable) open models aren't, then a lot of large enterprise orgs will be left twisting in the wind.
“AI can in principle help both the ‘good guys’ and the ‘bad guys’,” -- Dario Amodei
No Dario, no it can't, you've blocked one of those scenarios.
The main thing that sucks with Claude is the extremely low limits before you get fail2banned for 6 hours. I'm out. Refund requested. Grok and Gemini Pro are way better with the throttling, can't comment on ChatGPT, haven't used that for a year.
kennedy had a famous statement about "Splintering the CIA into a thousand pieces and scattering it into the wind". they murdered him afterwards though.
the statement is applicable to anthropic today.
I asked a question about an openssl s_client parameter and warned me that I need to talk to Opus about cybersecurity lol. FWIW I dont see much improvement and still see quite the same old annoyances, so far I would not pay extra for this for my usage.
Just tried to audit my own code base locally and was 'switched' due to my own creds/auth code ...
Is the answer requiring licensing for certain use cases for AI? If you're asking questions that involve synthesising or modifying biologics, or anything that looks like cybersecurity research, you need to tie your real ID to the account?
That's not a bad idea. Customer-vetting and KYC is fairly normal for other high-risk/high-concern products.
I really hate the term “guardrails” for these limitations, since the purpose of a guardrail is to protect me, but these limitations exist to protect Anthropic.
These guys always destroy a good thing, so trust is at stake
I’m on their CSP and can’t even get it to update my website. It’s totally unusable rn.
Would it be a costly process for Anthropic to re-tune those guardrails? Like, re-training sort of cost? or like coding session sort of cost?
I can’t help but think that gimping itself for “security” is a marketing ruse and it’s not actually as “dangerous” as they want people to think it is.
They are never happy :)
If a product is genuinely dangerous to society, self regulation cannot be a suitable harness.
If only we had effective governments that could regulate industry.
If a nuclear weapon was developed today, would it be down to industry to self regulate?
like China?
It refuses to do any legitimate work that it thinks can remotely be related with "cybersecurity", it won't even read my Docker app logs to try and troubleshoot a problem. Absolute garbage!
Fable is utterly useless with those guardrails for any serious it or life science work. Anthropic fucked me once a few months ago by closing down the subscription for any other harness, now it fucked me twice with buying again a subscription to find out their hyped model is unusable for normies. Using their products feels like a constant battle instead of a productive work day.. compare that with openai, not once did i feel like fighting against codex. Never again Anthropic..
What do you mean that it closed your subscription for any other harness?
In any case that's what closed source (weights) for the masses means.
This is a pretty basic manipulation tactic. Be super shitty to your users and then roll back the abuse. The correct response is to not engage with shitty abusive dickheads.
I mean a lot of people were let into the CVP, I bet the group of people in there did a bunch of good fable 5 could do the exact same but better. Theres more good out there than bad.
DeepSeek is the only one that I can directly ask about vulnerabilities and it will give me a PoC. Although not as good as others, it has helped me with security research.
The rest have guard rails that are so heavy, it makes them almost useless for cybersecurity.
they [anthro] took the risk of looking like a toy, rather than possibly assist an exploit.
Deepseek training is not finished yet, it's a preview.
And yes, it's an excellent model.
It even refuses to read my resume, so... yeah
funny how wired got the masses of the internet on board with hating AI, helping to spark the whole anti-movement and people still continue to rely on them for their understanding of AI and current events.
I feel like they report in a vaccum. take this anti exfil policy for claude, it was plainly explained as part of the launch of Anthropics new product. Security like this isn't novel, it isn't bad, you don't explain how your security works to the people you're securing against. Nobody freaks out about Steam's VAC ban system, no one is investigating gmail's spam filtering, Reddits vote fuzzing, cloudflares bot detection, or Vercel for blocking proxying services.
whats really the distinguishing principle? Is it really just not liking Anthropic's opinions? then just say that and use a different llm. chemist, biologists, and AI researchers cry a river lmao
This is clearly advertising. But that's OK. OpenAI does the same thing.
Deliberately producing misaligned and deceitful AI systems now. Great.
Stupid security theater. The only thing that makes sense would be zero restrictions.
I said I wondered if the models were going to start poisoning distillation and I got downvoted to hell. It’s interesting to me that they are now downgrading ML research too in this model, I would argue this implies the terrifying and impossible to reason about self improving AI doom loop is coming sooner rather than later. Bit worrying.
Fable has been pretty disappointing for security research. It downgrades itself to Opus 4.8 even when you ask it questions about basic things like port scanning.
Software engineers shouldnt be happy either. If model silently sabotage cybersecurity research of others software there is abdolutely no way to be sure it wont be sabotaging cybersecurity of AI slop code it generated yesterday.
This is bad precedent and no one wants to pay X to generate code to then have to pay X*10 to figure out why your company just got hacked.
It's frustrating as someone who has worked hard to produce succinct, secure software that I can't use it to prove my software's correctness but big companies with insecure code can use it to fix their tangled mess.
I already tested all earlier models against all my open source projects and they are yet to find a vulnerability so I'm keen to try out Mythos.
I've been waiting to be vindicated for years and finally we have a tool which can do it with high confidence but I don't have access.
Also, my code is minimal and highly succinct so it would prove correctness with even more confidence since each library/module and integration fully fits in the context window.
Like the Protobuf.js fiasco is just pure vindication for me because I was being looked down upon for choosing JSON as the interchange format. Turns out their software was insecure all this time... With a literal remote code execution vulnerability!
It's a marketplace. Someone else will outdo this inferior product.
That's exactly why Dario is begging the government to ban competitors.
Unfortunately for him, his main competitors don’t fall under the jurisdiction of his government.
Access and use of it does.
No, it doesn't
You really think the US government can't ban Chinese models??
All they'll need is hundreds of billions of dollars, more RAM and GPUs than are currently available, and a huge number of environment destroying data centers. We're sure to be spoiled for choice!
The internet interprets censorship as damage and routes around it.
OpenAI is the only real competition. Chinese models are 6-8 months behind Opus 4.8/GPT 5.5, and at least a year or more behind Mythos.
And it doesn't look like OpenAI will have a good answer to Mythos anytime soon. Based on what their chief scientist wrote to staff recently (https://archive.is/fN2pg), GPT 5.6 is a "meaningful improvement" over 5.5 - in other words, just a normal version bump. And no news or even rumors regarding GPT 6.
Related development:
Anthropic Walks Back Policy That Could Have 'Sabotaged' Researchers Using Claude
https://www.wired.com/story/anthropic-responds-to-backlash-o...
(https://news.ycombinator.com/item?id=48485958)
This what that Anthropic CEO has been cooking all the time with his safety BS.
More discussion:
If Claude Fable stops helping you, you'll never know
https://news.ycombinator.com/item?id=48467896
and Related:
Claude Fable 5
https://news.ycombinator.com/item?id=48463808
It's is expensive, and its shit, period.
Surely if they are sabotaging the output, they shouldn't charge the same fee for tokens as if the output was not sabotaged?
This is looking like something for regulator to look at and probably a class action lawsuit in the making.
I think people should be getting refunds. Including for shenanigans with Opus.
I'm being careful with it, but I haven't had Fable reject requests to "harden" my code or "find issues" in auth-related modules, which you could use on someone else's code to find vulnerabilities.
i think Anthropic is playing too fast-and-loose with the whole "no publicity is bad publicity" schtick.
Could it now start to add unnoticeable security holes into your system if you start writing security type code.
This is a clickbait article with a garbage title. From the actual article, the one quoted cybersecurity researcher is sane about it:
“But it is understandable as we are still in the early days and they are still adapting their guardrails. I am sure they are going to evolve over time as Anthropic and other frontier model companies will collaborate more with the current new generation of cybersecurity companies,” said Suiche, who is a member of the technical staff at Tolmo, an AI cybersecurity startup. “It’s better to catch more people than not enough when you do such a release and to relax the guardrails over time.”
I’m a cybersecurity researcher.
Article seemed fine to me and echos a lot of me and my colleagues concerns.
If you did regular malware analysis you would see that these groups already have access to LLMs that they’re using for development.
What Anthropic is doing here is just hamstringing the good guys
I'm a cybersecurity researcher! Can you explain how Anthropic is just hamstringing the good guys?
I did in my comment above.
You said these groups have access to LLMs. So what? Mythos/Fable are a step change above most LLMs. Responsibly limiting access and easing it up over time safely is the sane move.
How does it help?
By withholding it from bad actors.
It withholds it from good actors (they cannot use it to harden their code against bad actors) and assumes bad actors don't have access to such tools anyway.
because they don't. That's the whole point.
I am using LLM to build some security tool, and I ran into this a few times. I have to come up with a reasoning to convince (?!!) Fable to continue the work without downgrading.
I assume Anthropic will continue to tune the model, so I am not too bothered by this.