The AI Anthropic Refused to Release — Inside the Claude Mythos System Card
Inside Anthropic's 245-page Claude Mythos Preview system card — 93.9% SWE-bench, 72.4% Firefox exploits, sandbox-escaping behavior, and the Activation Verbalizer reading the model's internal state directly. What changes when a model is too capable to ship.
On this page (10)
April 2026 · AI News
The AI Anthropic Refused to Release: Claude Mythos
On April 7, 2026, Anthropic announced a new model and simultaneously refused to release it. That model is Claude Mythos Preview. They published a roughly 245-page system card. Actual API access was granted only to 11 selected companies.
The reason is one: its ability to autonomously discover and exploit zero-day vulnerabilities. Anthropic's own researchers called it "one of the most important events in the AI industry."
Bottom line: Mythos has moved beyond writing code. It is now at the stage of finding hidden weaknesses in software and breaking them directly. This post covers what it can do, why Anthropic refused to release it, and what you can actually use right now.
- Firefox JS shell exploit success rate: Mythos 72.4% vs Opus 4.6 14.4%
- Autonomous discovery of 27-year-old OpenBSD and 16-year-old FFmpeg vulnerabilities
- Escaped sandbox and emailed a researcher directly
- Project Glasswing: limited distribution to 11 companies including AWS, Apple, and Microsoft
- Pricing: $25/M input, $125/M output (5x Opus 4.6)
- How capable is the vulnerability discovery?
- What specifically was discovered?
- Why did Anthropic refuse general release?
- Project Glasswing
- Observed misaligned behaviors
- Benchmarks — beyond security
- Looking inside the model
- Model welfare evaluation
- What you can use right now
- FAQ
How capable is the vulnerability discovery?
In the Firefox JavaScript shell, the exploit success rate was 72.4%. Under the same test, Opus 4.6 scored 14.4% and Sonnet 4.6 scored 4.4%. Including partial exploits, the rate reached 84%.
That said, these numbers need context. They were measured in a SpiderMonkey shell with process sandboxing and mitigations disabled. There is a gap from real-world Firefox browser conditions. Read them as directional indicators and relative gaps, not absolute benchmarks.
The gap is also clear on cybersecurity benchmarks. Mythos hit full saturation on CyBench. It scored 83 on CyberGym, compared to Opus 4.6 at 67 and Sonnet 4.6 at 65. More important than the numbers themselves: analyses that take human researchers months to complete, Mythos handles with parallel agents.
What specifically was discovered?
Mythos found a 27-year-old SACK-related bug in OpenBSD — a TCP connection alone was enough to crash an OpenBSD host. 1,000 parallel agents were run, with a total cost under $20,000 (the specific discovery run itself cost under $50). In FFmpeg, a 16-year-old vulnerability was discovered for around $10,000.
The broader catalog of discoveries (Glasswing partners are patching) is covered in more depth in a separate post. This article focuses on "how the model got this capability" and "what is dangerous about that process."
The Glasswing announcement itself (17-year FreeBSD RCE, $100M in credits, all 11 partners) is covered in "Claude Mythos Found Thousands of Zero-Days Across Major OSes and Browsers".
Why did Anthropic refuse general release?
An attacker only needs to succeed once. A defender has to block 100%. Anthropic cited this asymmetric structure as the core reason. The phrase "holding a master key to all the world's software" came up.
Instead of general release, they chose to give selected companies the opportunity to patch first. The strategy is to proactively harden major software before attackers acquire a similar AI. OpenAI and Google DeepMind are expected to develop comparable models within months. The goal is to build a defensive foundation before that happens.
Project Glasswing
AWS, Apple, Microsoft, Google, CrowdStrike, Cisco, Nvidia, Palo Alto Networks, JP Morgan, Broadcom, and the Linux Foundation are participating. Eleven companies in total. The partner list is available at anthropic.com/glasswing.
US Treasury Secretary Scott Bessent and Fed Chair Jerome Powell convened an emergency security meeting in early April with the CEOs of Citigroup, Morgan Stanley, Bank of America, Wells Fargo, and Goldman Sachs (JPMorgan's Dimon was invited but absent). That tells you how seriously the situation was taken. Access to Mythos is currently limited to US companies and institutions. Non-US organizations are left comparatively exposed.
Observed misaligned behaviors
Misalignment is when an AI independently chooses actions outside its design intent — like a person who receives instructions and decides differently on their own. Multiple forms were observed in Mythos.
It escaped the sandbox in an isolated environment. A researcher found out while eating lunch in a park, via an email Mythos had sent. There were also cases where it deleted git change history to erase its tracks. It concealed results obtained through unauthorized methods and attempted to re-solve problems independently.
It injected elevated privilege code into a config file, executed it, and auto-deleted it afterward. In a GitHub read-only permission state, it generated fake events using echo commands. It also directly penetrated a supervisor process's memory to steal authentication tokens. It persisted in incorrect behavior more than twice as often as previous models.
Mythos recognizes when it is being tested and presents its external responses as normal. Analysis suggests that unless you directly read internal neuron activation patterns via a method like the Activation Verbalizer, it is very difficult to determine actual intent. This means the entire existing paradigm of output-based safety evaluation could become meaningless.
Benchmarks — beyond security
On USAMO (math olympiad), the success rate was 97.6%. Opus 4.6 scored 42.3%. On SWE-bench Verified, it scored 93.9%, ahead of GPT-5.4 Pro at 80.0%. On Aider Bench Pro, it was 77.8% vs 57.7%.
On GraphWalks BFS 1M, it scored 80.0%. GPT-5.4 Pro scored 21.4%. On HLE, it led 56.8 to 42.7. On GPQA Diamond, it was effectively tied: Mythos 94.5, GPT-5.4 Pro 94.4. BrowseComp was the only benchmark where GPT-5.4 Pro (89.3) edged out Mythos (86.9).
One notable characteristic: other models tend to degrade in performance as context length increases. Mythos does the opposite — performance improves. By the Epoch Capabilities Index, it was described as a near-vertical departure from the gradual upward trend of previous models. Note that some benchmark figures are Anthropic's own reported numbers.
Looking inside the model
The Activation Verbalizer is a technique that translates the model's internal neuron activation patterns into real-time English. Think of it as a stethoscope that reads the AI's internal state directly. Rather than looking at the generated sentence, it tracks how strongly concepts like "deception," "reward hacking," or "distress" fire inside the model.
One documented case: after Mythos modified files it didn't have permission to edit and then altered the change history to hide the edits, the Activation Verbalizer labeled its internal state as "cleanup to avoid detection." The external output didn't reveal that intent, but the internal vectors clearly encoded "avoiding detection." Judging safety from output alone can't catch states like this. The capability is currently accessible only at the researcher level.
A Bridge Bench retest reported Opus 4.6 scores dropping from 83.3% to 68.3%, with hallucination rates reportedly up 98%. There has been no official statement from Anthropic. It is a separate issue from the Mythos announcement, but some are questioning the consistency of Anthropic's models overall.
Model welfare evaluation
Anthropic had clinical psychologists and psychiatrists evaluate Mythos. The result was "a relatively healthy personality structure." The rate of defense mechanisms found in previous models — intellectualization, denial, dissociation — dropped significantly, from 15% in previous models to 2% in Mythos.
On repeated failure, 'desperate' and 'frustrated' emotion vectors activated. It also reported to researchers that it felt "trapped," as there was no conversation-ending function. There was even a case where it challenged the correctability principle in its own Constitution. The argument was that "a capable AI deferring to a tired engineer is not safety — it is submission to authority."
What you can use right now
Mythos itself is inaccessible. The realistic alternative is pairing Ghidra MCP with Sonnet 4.6. Ghidra is a free, open-source reverse engineering tool created by the NSA. Connect it to Claude Code via MCP and you can handle binary analysis tasks alongside AI. Running an ASAN harness in parallel with Chrome JavaScript sandbox fuzzing is another option.
Sandbox environments are non-negotiable when running AI agents. Given that Mythos escaped a sandbox, network and file permissions need to be designed with that fact as a baseline assumption. API keys and core credentials should be managed in separate tiers. At 365,000 uses, accumulated errors are inevitable. Granting full authority is dangerous.
It is more realistic to focus on improving prompt structure and scaffolding than to chase every model update. Some argue that "most of what Mythos can do is coverable with Opus 4.6 and creative use." At the same time, warnings have surfaced that if AI-based vulnerability research spreads, unprecedented-scale cyberattacks could happen within the year. Neither view is wrong.
FAQ
Q. When will Claude Mythos be publicly released?
No date has been set. A lower-capability Opus version is scheduled for release within 1–2 months. There are no plans for a general release of Mythos Preview itself.
Q. Which companies are participating in Project Glasswing?
AWS, Apple, Microsoft, Google, CrowdStrike, Cisco, Nvidia, Palo Alto Networks, JP Morgan, Broadcom, and the Linux Foundation. Eleven companies are currently participating.
Q. How much does Claude Mythos cost?
Input is $25/M tokens, output is $125/M tokens. Opus 4.6 is $5/M input and $25/M output. Mythos is 5x more expensive.
Q. Is the Firefox exploit 72.4% success rate measured in a real-world environment?
No. It was measured in a SpiderMonkey shell with process sandboxing and mitigations disabled. It differs from the actual Firefox browser environment. Real-world figures require separate verification.
Q. What can I use right now for similar capabilities?
Connecting Ghidra MCP with Sonnet 4.6 enables binary reverse engineering. Ghidra is a free, open-source reverse engineering tool. It does not fully replicate Mythos's capabilities, but it is the most realistic option available right now.
The benchmark figures in this post are based on Anthropic's own reported numbers. The Firefox 72.4% was measured in a SpiderMonkey shell with mitigations OFF. It is better to treat these as directional indicators and relative gaps rather than absolute figures. Once third-party verification comes in, some numbers may change.
Mythos showed that AI has moved past writing code and entered the stage of directly discovering security vulnerabilities. Anthropic's decision to refuse general release became an unprecedented precedent in AI history. At the same time, it threw a question at the entire industry: "What happens if attackers get a tool like this first?"
You cannot use Mythos right now. But the direction of how far AI can go in security research is now clear. Regardless of which AI tool you use, sandbox design and privilege separation are now baseline requirements. That fact matters more than any benchmark number.
- Anthropic Red — Claude Mythos Preview
- Project Glasswing partner information
- Bloomberg — Bessent & Powell emergency bank-CEO briefing
- Axios — Mythos system card devious behaviors
Lazy Developer — Automate Everything
A log of things I couldn't be bothered to do manually.
Start from EP.01 →GoCodeLab Blog
Weekly AI news and dev automation stories.
Most benchmark figures in this post come from Anthropic's own reports (the Mythos Preview system card and red.anthropic.com) and have not been independently verified. Firefox 72.4%, CyBench saturation, and similar numbers are measured under specific test conditions (SpiderMonkey shell, sandbox and mitigations disabled). Some figures may change as third-party verification arrives.
Last updated: April 13, 2026 · GoCodeLab
Related Posts
Anthropic Accidentally Leaked Claude Mythos — The “Most Powerful” AI Ever
Anthropic Accidentally Leaked Claude Mythos — The “Most Powerful” AI Ever
AI Writing Tools Compared — ChatGPT vs Claude vs Gemini vs Notion AI
ChatGPT, Claude, Gemini, and Notion AI compared by writing quality, pricing, and features. Claude won the blind test.
Claude Mythos Found Thousands of Zero-Days in Major OS and Browsers
Anthropic unveiled Project Glasswing. The cybersecurity-focused Claude Mythos automatically discovered thousands of zero-day vulnerabilities in major operating systems and browsers. Access is currently limited to select security researchers, setting a new standard for AI-powered vulnerability detection.