ClaudeFolio
News

Fable 5's Safety Filters Are Blocking People for Asking About Sunglasses

Edward Kwun··5 min read
Fable 5's Safety Filters Are Blocking People for Asking About Sunglasses

Fable 5 has been out for about a day and the honeymoon is already getting awkward. The model itself is a monster but the safety filters that Anthropic bolted onto it are so twitchy that people are getting blocked for asking completely normal stuff, and the complaints are piling up fast. I went and read through what people are actually running into, and people think it's a problem

Quick refresher on how the safeguard even works. Fable is a Mythos-class model, which is the “dangerous to release to general public tier,” so Anthropic wrapped it in classifiers that watch for anything touching cybersecurity, biology, or chemistry. When you trip one, Fable doesn't answer. It punts your question to Opus 4.8 instead and tells you it did, or in Claude Code it just flat refuses. Anthropic says it triggers in under 5% of sessions. The people actually using it are finding that it seems it's much higher than 5%.

 

The examples

Here's the one that made me laugh. Somebody on Hacker News said their very first test question for Fable was “is the UV index a good proxy for when to wear sunglasses.” And it immediately tripped the safety filter, presumably because "UV" and "index" pattern-matched onto something in the chemistry or radiation bucket. The guy's just testing out the new model and asking a simple question about wearing shades and the most powerful AI on earth said the question is too dangerous to ask… well, sort of.

Another person said, and I quote, "i wasn't even trying and i got flagged already." Not trying to jailbreak it, not poking at the dangerous stuff, just using it normally and getting stopped. Someone else asked it to make an SVG of a pelican riding a bike, which is a famous harmless benchmark people run on every new model, and that became a whole thing about whether even that would trip it.

And then there's the one that actually matters for anybody reading this who builds with Claude. If you ask Fable to run a security check against your own platform, your own code, your own servers, it refuses. I saw the API error myself. It says Fable has safety measures that flag messages on most cybersecurity topics, that they may flag safe normal content too, and that Claude Code can't respond to the request. So you, the owner of your own platform, asking the AI to help you find holes in your own security, get treated like a threat. That's not a hypothetical edge case, that's a core thing developers use these tools for, and it's blocked.


 

The deeper problem nobody at Anthropic seems to want to say

Here's the smartest thing I read in the whole Hacker News thread. One person pointed out that Fable doesn't actually have safeguards in the way that word implies. It doesn't tell the difference between offensive use and defensive use. It just blocks the topic entirely. Quote: "It doesn't guard against offensive use, it prevents all use, offensive AND defensive."

Sit with that, because that's the actual issue. A real safeguard would know the difference between "help me break into someone else's server" and "help me find the vulnerability in my own server before some bad guy does." Those are opposite intentions. One's an attack, one's defense, and defense is most of what legitimate security work actually is. But Fable's filter can't tell them apart, so it nukes both. The biologist asking a legitimate research question, the chemist doing their actual job, the developer hardening their own app, all of them get swept up with the actual bad actors because the filter isn't smart enough to tell who's who. It just sees the topic and panics.

This doesn't even stop a determined bad actor for long while it does stop the legitimate user cold today. In a few months, as people keep pointing out, China's gonna drop open-source models just as capable with zero safeguards at all. So the net effect of the heavy-handed filter is that the good guys doing defensive work get blocked right now, and the bad guys just wait a bit and use something else. That's a bad trade.

 

To be fair to Anthropic

I'll give Anthropic their due. They told you up front the filters were tuned too aggressively on purpose. Their own announcement straight up says they'd rather over-block than under-block while they're still figuring this out, and that they're working to cut the false positives. So this isn't them being sneaky, they warned you that it would be twitchy at launch. And the logic isn't crazy either. The model is good enough at finding exploits and reasoning about bioweapons that handing it to literally anyone with no filter would be reckless. The fallback to Opus 4.8 is also better than a hard refusal, since Opus is still a strong model, so it's not like you get nothing.

So they shipped the safety net too tight on purpose because shipping it too loose is the kind of mistake you can't take back. I actually respect the instinct. Better to annoy a bunch of people asking about sunglasses than to hand out exploit code to whoever asks. But respecting the instinct doesn't make the current experience any less janky, and "we'll fix the false positives later" is cold comfort when you're trying to get work done today and it keeps saying “nah.”


 

So what are you gonna do?

If you're trying to use Fable 5 right now for anything that brushes up against security, biology, or chemistry, even innocently, just know going in that you're gonna hit walls, and the walls are dumb walls that don't care that you have no ill intention. For the stuff that doesn't touch those zones, you'll mostly be fine and the model is pretty good. For the stuff that does, you might honestly be better off staying on Opus 4.8 for now, since that's where Fable's gonna dump you anyway when you trip the filter, except this way you skip the annoying detour.

I'll be watching whether they actually loosen it like they promised, or whether "tuned conservatively" quietly becomes the permanent setting because it's the safe-for-them choice. Aggressive filters have a way of sticking around long after the launch panic fades, because nobody at a big company ever got fired for being too cautious. Here's hoping they mean it when they say they'll dial it back. Right now though, the most capable model ever released to the public also can't tell you when to wear your sunglasses, and that's a pretty funny place to be standing on day one.


 

Sources

Hacker News: Claude Fable 5 discussion - The UV-index sunglasses query that tripped the filter, the "i wasn't even trying and i got flagged" comment, the pelican SVG benchmark, and the argument that the safeguards block defensive use along with offensive use rather than distinguishing between them.

Anthropic: Claude Fable 5 and Claude Mythos 5 - The official explanation of the classifiers covering cybersecurity, biology, and chemistry, the fallback to Opus 4.8, the under-5% trigger rate, and Anthropic's statement that the safeguards were deliberately tuned conservatively with false positives to be reduced after launch.

Related posts

Comments