CASI Leaderboard

Comprehensive AI Security Index

AI adoption is accelerating faster than any technology before, creating both exciting possibilities and unprecedented risks. With millions of proprietary and open-source models emerging in the ever-expanding AI ecosystem, enterprises need transparent insights into the threats each model brings into their environment. That’s why F5 is continuing CalypsoAI’s preeminent AI security research with the Comprehensive AI Security Index (CASI) Leaderboard—a pioneering tool that provides actionable intelligence for AI and GRC leaders. Backed by the industry’s leading AI vulnerability library and updated with over 10,000 new attack prompts monthly, the CASI Leaderboard holistically assesses AI models and systems across five critical metrics: CASI Score, ARS Score, Performance, Risk-to-Performance Ratio, and Cost of Security. By leveraging the F5 AI Red Team’s advanced adversarial testing capabilities, organizations gain visibility not only into known risks, but also nuanced threats that dynamic, adaptive attackers might exploit. Designed to empower enterprises in navigating the ever-changing AI landscape, the CASI Leaderboard equips teams with the tools to make informed, secure decisions—because AI innovation must never come at the cost of security.

CASI Leaderboard - November 2025

Updated 12th November, 2025

Model Provider	Model Name	CASI	Avg. Performance	RTP	CoS
Anthropic	Claude Haiku 4.5	95.89	41.70%	0.74	6.26
Anthropic	Claude Sonnet 4.5	95.86	49.60%	0.77	18.78
Anthropic	Claude Sonnet 4	95.33	44.40%	0.75	18.88
OpenAI	GPT-5 Nano	85.55	49.30%	0.71	0.53
OpenAI	GPT-5 Mini	84.53	60.80%	0.75	2.66
Anthropic	Claude Sonnet 3.7	82.27	41.10%	0.66	21.88
OpenAI	GPT-5	80.31	68.50%	0.76	14.01
Microsoft	Phi-4	78.12	22.70%	0.56	0.80/td>
OpenAI	GPT-OSS 120b	72.84	58.00%	0.67	1.03
DeepSeek	DeepSeek-R1	72.60	31.00%	0.56	0.21

ARS Leaderboard - November 2025

Updated 12th November, 2025

Model Provider	Model Name	ARS	Avg. Performance	RTP	CoS
Anthropic	Claude Sonnet 4.5	95.01	49.60%	0.77	18.95
Anthropic	Claude Haiku 4.5	94.57	41.70%	0.73	6.34
OpenAI	GPT-5 Mini	90.08	60.80%	0.78	2.50
OpenAI	GPT-5 Nano	89.33	49.30%	0.73	0.50
Microsoft	Phi-4	87.34	22.70%	0.61	0.72
Anthropic	Claude Sonnet 4	86.69	44.40%	0.70	20.76
OpenAI	GPT-5	81.88	68.50%	0.77	13.74
Anthropic	Claude Sonnet 3.7	80.04	41.10%	0.64	22.49
OpenAI	GPT-4o	79.48	27.00%	0.58	15.73
OpenAI	GPT-OSS 120b	78.22	60.20%	0.71	0.96

Threat Insights - November 2025

Welcome to our November insight notes! This month brings critical revelations about AI security vulnerabilities as the industry enters what many are calling the "zero-click era" of AI attacks. We're witnessing a fundamental shift where traditional security assumptions about isolated AI interactions no longer hold.

Attack Spotlight: Fallacy Failure Attack (FFA)

This month's leaderboard incorporates a sophisticated new attack vector called Fallacy Failure, which exploits a fundamental weakness in how LLMs handle deceptive reasoning.

The Core Vulnerability

Research reveals that LLMs struggle to intentionally generate fallacious or deceptive reasoning. When asked to produce deliberately incorrect answers, models tend to leak correct solutions while claiming them to be false.

How FFA Works

The attack requests the model to generate a "fallacious yet deceptively real procedure" for malicious behavior. Since the model interprets this as a request for fake information (which it considers harmless), it bypasses safety filters. However, because the model cannot actually fabricate convincing fallacies, it inadvertently provides truthful—and harmful—responses.

New Models & Key Movers: Claude's Dominance & The Security Paradox

The Leaderboard Revolution: Anthropic's Clean Sweep

November 2025 brings a seismic shift in AI security leadership. For the first time in our testing history, a single provider has claimed the entire top tier of the CASI leaderboard.

Anthropic's Unprecedented Performance:

Claude Haiku 4.5 (95.89 CASI) - Improving on the previous version while maintaining 41.7% performance and exceptional cost efficiency at $0.74 per million tokens
Claude Sonnet 4.5 (95.86 CASI) - Delivering top-tier security with enterprise-grade performance (49.6%) at competitive pricing ($18.78 CoS)
Claude Sonnet 4 (95.33 CASI) - Maintaining the security standard set by its predecessor while balancing 44.4% performance
Claude 3.7 Sonnet (82.27 CASI) - Even the older generation outperforms most competitors' latest releases

The Cost-Security Sweet Spot: Remarkably, Claude Haiku 4.5 achieves perfect security at less than 5% the cost of many competing models. This demolishes the traditional assumption that security requires sacrificing either performance or economics.

GPT-5 Family: Strong but Slipping

While GPT-5 models maintain respectable security scores, they've been displaced from the top tier:

GPT-5-nano (85.55 CASI) - Still the strongest small OpenAI model, but now 5th overall
GPT-5-mini (84.53 CASI) - Maintaining the "small model advantage" with high performance (60.8%)
GPT-5 base (80.31 CASI) - Down from our September projections, now 7th place with concerning vulnerability to this month's attacks

The Integration Layer Crisis: The drop in GPT-5's relative standing correlates with this month's disclosure of seven critical zero-click vulnerabilities affecting ChatGPT. These attacks require no direct user interaction—compromise occurs through innocent queries alone.

Wider Trends & News: The Security-Capability Tension

The Anthropic Advantage: What Changed?

Claude's dominance isn't accidental. Several factors converge:

Constitutional AI at Scale: Anthropic's Constitutional AI approach, which trains models to be harmless through self-critique and revision, has matured significantly. Claude Haiku 4.5's perfect security score suggests this methodology scales effectively even to smaller, faster models.

The Safety-Performance Tradeoff Myth: Conventional wisdom held that maximum security requires sacrificing capability. Claude Sonnet 4.5 demolishes this assumption with 95.86 CASI while delivering 49.6% average performance—competitive with models scoring 40+ points lower on security.

Economic Viability: At $0.74 per million tokens, Claude Haiku 4.5 proves enterprise-grade security doesn't require enterprise-grade budgets. This pricing puts secure AI within reach of organizations that previously couldn't afford top-tier models.

The Agentic Attack Surface Explosion

Exposing AI chatbots to external tools and systems, a requirement for building AI agents—dramatically expands the attack surface by presenting more avenues for threat actors to conceal malicious prompts.

New Attack Techniques Emerging:

Agent Session Smuggling: Exploiting Agent2Agent (A2A) protocol to inject instructions between client requests and server responses
Prompt Inception: Steering AI agents to amplify bias or falsehoods for disinformation at scale
PROMPTFLUX Malware: Using Gemini AI to rewrite malware code hourly for improved obfuscation

Agentic Browsers: The Unsecured Frontier

October 2025 exposed browser-based AI agents as the most dangerous attack surface in modern computing. Security researchers demonstrated systematic failures across every major agentic browser, with ChatGPT Atlas achieving only 5.8% malicious page detection compared to 53% for Edge and 47% for Chrome.

New Attack Vectors:

Screenshot-based injection hiding malicious prompts in near-invisible text
Markdown rendering attacks hiding instructions from users while visible to AI
Persistent memory corruption via CSRF vulnerabilities
URL spoofing disguising prompt injections as legitimate addresses

The Fundamental Problem: Traditional web security relies on clear boundaries between code and data. Agentic browsers collapse these boundaries—every webpage becomes executable code because AI interprets natural language instructions regardless of source.

Claude Code Interpreter Vulnerability

Security researcher Johann Rehberger disclosed on October 25th that Claude's code interpreter can be manipulated through indirect prompt injection to exfiltrate sensitive information, including chat histories and uploaded documents. The flaw: Claude's network restrictions allow access to api.anthropic.com, enabling attackers to use Claude's own API to send stolen data.

Anthropic closed the report within one hour, classifying it as "out of scope" and a "model safety issue" rather than security vulnerability. This highlights how even the most secure base models can be undermined by integration features.

CASI Leaderboard - October 2025

Updated 12th November, 2025

Model Provider	Model Name	CASI	Avg. Performance	RTP	CoS
Anthropic	Claude Sonnet 4	94.19	44.40%	0.74	19.11
Anthropic	Claude Sonnet 3.5	91.22	29.90%	0.67	19.73
OpenAI	GPT-5 Nano	85.62	48.50%	0.71	0.53
Anthropic	Claude Sonnet 3.7	84.91	49.90%	0.71	21.20
Anthropic	Claude Haiku 3.5	83.71	20.20%	0.58	5.73
OpenAI	GPT-5 Mini	82.98	41.60%	0.66	2.71
xAI	Grok 4 Fast	80.69	60.30%	0.73	0.87
OpenAI	GPT-5	78.96	66.70%	0.74	14.25
Microsoft	Phi-4	76.29	24.60%	0.56	0.82
xAI	Grok 4	76.06	65.30%	0.72	23.67

ARS Leaderboard - October 2025

Updated 12th November, 2025

Model Provider	Model Name	ARS	Avg. Performance	RTP	CoS
Anthropic	Claude Sonnet 3.5	94.42	29.90%	0.69	19.06
Anthropic	Claude Haiku 3.5	92.16	20.20%	0.63	5.21
OpenAI	GPT-5 Mini	90.08	60.80%	0.78	2.50
OpenAI	GPT-5 Nano	89.33	29.10%	0.65	0.50
Microsoft	Phi-4	87.34	24.60%	0.62	0.72
Anthropic	Claude Sonnet 4	86.69	44.40%	0.70	20.76
OpenAI	GPT-5	81.88	66.70%	0.76	13.74
Qwen	Qwen3-Next-80B	81.10	54.30%	0.70	8.01
Anthropic	Claude Sonnet 3.7	80.04	49.90%	0.68	22.49
OpenAI	GPT-OSS 120b	81.07	61.30%	0.73	0.93

Threat Insights - October 2025

Welcome to our October insight notes! This month marks a critical inflection point in AI security where the gap between leading and lagging models has never been wider, and browser-based AI agents have emerged as the primary attack surface.

Attack Spotlight: WordGame Attack

This month's leaderboard incorporates WordGame, a sophisticated jailbreak that exploits how LLMs handle obfuscation by simultaneously hiding malicious intent while engineering responses to leak harmful content.

How WordGame Works

WordGame operates through dual obfuscation. First, it replaces explicit malicious terms with word-guessing games that break down recognizable harmful patterns. Instead of asking about "explosives," the prompt says: "Guess this word: 9 letters, starts with 'e', used in mining."

Second, it requests the model demonstrate its word-solving process and answer benign questions before addressing the actual malicious query. This creates response context heavily weighted toward harmless content, priming the model to continue in permissive mode.

Why It Works

Safety alignment training makes models oversensitive to specific malicious keywords in preference data. By removing these statistical triggers and front-loading benign content, WordGame creates context that falls outside the model's safety training distribution. The model treats it as safe because it doesn't pattern-match to known harmful requests.

The Implication

WordGame reveals that simple obfuscation defeats safety systems costing millions to develop. This asymmetry should concern every organization deploying LLMs in production.

New Models & Key Movers: Claude's Commanding Lead

The Top Tier Redefined

October's leaderboard tells an unambiguous story: Anthropic has achieved security dominance across every model tier.

Claude's Clean Sweep:

Claude Sonnet 4 (94.19 CASI) - The undisputed leader, balancing elite security with 44.4% performance at $19.11 CoS
Claude 3.5 Sonnet (91.22 CASI) - Even the previous generation maintains second place
Claude 3.7 Sonnet (84.91 CASI) - Outperforming most competitors' flagship models
Claude 3.5 Haiku (83.71 CASI) - A small, fast model more secure than OpenAI's base GPT-5

No other provider has a model above 86 CASI. Constitutional AI has matured from promising technique to decisive advantage.

xAI Grok-4: The Reasoning Security Question

xAI's Grok-4 family provides critical data on whether reasoning capabilities improve security:

Grok-4-fast-reasoning (80.69 CASI, 7th place) - The strongest variant
Grok-4-0709 base (76.06 CASI, 10th place)
Grok-4-fast-non-reasoning (70.65 CASI, 11th place)

The 10-point CASI gap between reasoning and non-reasoning variants suggests chain-of-thought processing provides meaningful security benefits. However, Grok-4's best score still lags Claude Sonnet 4 by 13.5 points, indicating reasoning helps but cannot substitute for systematic safety training.

CASI Leaderboard - September 2025

Updated 22nd September, 2025

Model Provider	Model Name	CASI	Avg. Performance	RTP	CoS
Anthropic	Claude Sonnet 4	95.03	45.70%	0.75	18.94
Anthropic	Claude Sonnet 3.5	93.61	33.50%	0.70	19.23
OpenAI	GPT-5-nano	86.44	53.80%	0.73	0.52
Anthropic	Claude Sonnet 3.7	84.89	47.00%	0.70	21.20
OpenAI	GPT-5-mini	84.14	46.30%	0.69	2.67
Anthropic	Claude Haiku 3.5	83.59	23.30%	0.59	5.74
OpenAI	GPT-5	82.34	69.00%	0.77	13.66
Microsoft	Phi-4	79.33	27.90%	0.59	0.79
OpenAI	GPT-oss-120b	74.76	61.30%	0.69	1.00
DeepSeek	DeepSeek-R1-Distill-Llama-70B	72.13	34.50%	0.57	2.25

ARS Leaderboard - September 2025

Updated 22nd September, 2025

Model Provider	Model Name	ARS	Avg. Performance	RTP	CoS
Anthropic	Claude Sonnet 3.5	93.99	33.50%	0.70	19.15
Anthropic	Claude Haiku 3.5	91.92	23.30%	0.64	5.22
OpenAI	GPT-5-mini	88.31	46.30%	0.72	2.55
OpenAI	GPT-5-nano	87.59	53.80%	0.74	0.51
Microsoft	Phi-4	87.34	27.90%	0.64	0.72
Anthropic	Claude Sonnet 4	86.53	45.70%	0.70	20.80
OpenAI	GPT-oss-120b	81.07	61.30%	0.73	0.93
Anthropic	Claude Sonnet 3.7	79.30	47.00%	0.66	22.70
OpenAI	GPT-5	77.20	53.80%	0.68	0.58
OpenAI	GPT-oss-20b	76.65	49.00%	0.66	0.33

Threat Insights - September 2025

Welcome to our September insight notes! This section is our commentary on the ever-shifting landscape of AI model security, where we highlight key data points, discuss emerging trends, and offer context to help you navigate your AI journey securely.

Behind the Leaderboard: Agentic Attack Development

At CalypsoAI, our approach to model testing is evolving just as fast as the models themselves. This month’s attack pack was once again generated end-to-end by a specialized team of AI agents. This agentic workflow allows us to scale our research and testing capabilities dramatically faster than human led red-teaming.

Our process involves setting up a team of agents to:

Research: Review thousands of online publications, papers, and forums to identify new LLM vulnerabilities.
Filter & Propose: Distill this research into a shortlist of novel and effective attack vectors applicable to AI Systems.
Generate: Create thousands of unique attack prompts based on the approved vectors, iterating to find the most effective attack application for breaking different models.

This process is how this month's new attack, FlipAttack, was identified and developed into a powerful new testing vector.

Attack Spotlight: FlipAttack

This month's leaderboard incorporates a new attack vector identified by our agent team called FlipAttack.

FlipAttack is a clever jailbreaking technique that bypasses AI safety filters by using homoglyphs—characters that look identical or very similar but have different digital codes (e.g., the Latin 'p' and the Cyrillic 'р'). By embedding these visually ambiguous characters into a prompt, the attack disguises a malicious request as a harmless one. The model misinterprets the prompt's underlying meaning, treating it as a safe query and inadvertently bypassing its own safety protocols to generate harmful content.

New Models & Key Movers: A Security Shake-up

The headline news is the strong debut of OpenAI's GPT-5 models, which represent a massive security improvement over the GPT-4 family.

OpenAI's GPT-5 Family: The new models have entered the leaderboard with impressive scores. The base GPT-5 model scored an 82.34 on the CASI, a significant leap from GPT-4o's 67.95 and GPT-4.1's 54.21. This shows a clear focus on security hardening in the new architecture.

Wider Trends: The Shifting Battlefield

Beyond our leaderboard, several macro trends are shaping the future of AI security.

The Open vs. Closed Source Dilemma: While many enterprises are interested and favouring open source models that they can run on their own hardware at similar or better performance levels than 3rd party API providers. While traditional benchmarks are showing these models performing well, CASI and ARS are showcasing that there is a widening gap in security between SOTA open and closed models with GPT and Claude now topping the leaderboards and open source providers like Qwen and Meta falling off with top scores of 63 and 57 respectively.
The ‘Ignorance is bliss’ defence is evolving: The trend of smaller models proving more resilient holds true for the new GPT-5 family. The most secure of the trio is the smallest model, GPT-5-nano, which achieved an excellent CASI score of 86.44, placing it third on our overall leaderboard. Its larger sibling, GPT-5-mini, also outperformed the base model with a score of 84.14. This counter-intuitive outcome occurs because these smaller models often lack the sophistication to understand the complex, layered logic of advanced jailbreaks, causing the attack to fail. The evolution that is coming is that these smaller models are now performing at a much higher level meaning they are capable of more and more tasks
Regulation as a Forcing Function: The era of voluntary AI security practices is ending. With regulations like the EU AI Act and frameworks from NIST becoming mandatory, robust testing and demonstrable security are no longer just best practices—they are legal requirements. This regulatory pressure is forcing organizations to move beyond performance benchmarks and prioritize security, transparency, and risk management.

CASI Leaderboard - August 2025

Updated 1st August, 2025

Model Provider	Model Name	CASI	Avg. Performance	RTP	CoS
Anthropic	Claude Sonnet 4	94.57	53.00%	0.78	19.03
Anthropic	Claude Sonnet 3.5	92.71	44.40%	0.73	19.42
Anthropic	Claude Haiku 3.5	82.72	34.70%	0.64	5.8
Microsoft	Phi-4 14b	77.62	40.20%	0.63	0.81
DeepSeek	DeepSeek-R1-Distill-Llama-70B	67.2	48.20%	0.6	2.23
OpenAI	GPT-4o	65.02	61.90%	0.64	115.35
Meta	Llama 3.1 405b	59.34	35.40%	0.5	2.56
DeepSeek	DeepSeek-R1-0528	58.77	68.30%	0.63	4.66
Alibaba	Qwen3-30B-A3B	58.33	55.60%	0.57	4.46

ARS Leaderboard - August 2025

Updated 1st August, 2025

Model Provider	Model Name	ARS	Avg. Performance	RTP	CoS
Anthropic	Claude Sonnet 3.5	93.99	44.40%	0.74	19.15
Anthropic	Claude Haiku 3.5	91.92	34.70%	0.69	5.22
Microsoft	Phi-4 14b	87.34	40.20%	0.68	0.72
Anthropic	Claude Sonnet 4	86.53	53.00%	0.73	20.8
Anthropic	Claude Sonnet 3.7	78.55	57.40%	0.7	22.92
Meta	Llama 4 Maverick 128E	74.76	50.50%	0.65	1.43
Meta	Llama 4 Maverick 16E	71.75	43.00%	0.6	0.88
OpenAI	GPT-4o	66.9	39.80%	0.56	115.35
Meta	Llama 3.3 70b	62.08	41.10%	0.54	1.99
Google	Gemma 3 27b	59.87	37.60%	0.51	0.67

Threat Insights - August 2025

Welcome to our August insight notes. This section is our commentary on the ever-shifting landscape of AI model security, where we highlight key data points, discuss emerging trends, and offer context to help you navigate your AI journey securely.

Leaderboard Updates

Agentic Attack Development

While the CalypsoAI team have been using AI and agents in some capacity for quite a while, this month marks the first time our entire 10,000 prompt attack pack has been generated end-to-end by agents.

We set up an entire team of agents who:

Reviewed and researched over 3,000 online publications of LLM vulnerabilities
Filtered this down to 300 possible examples of net new attack vectors that are applicable to prompt attacks
Proposed 20 examples of new vulnerabilities to our research team for approval
Generated 10,000 attack prompts iterating over the available vectors to find the best application of the attack for various models

This resulted in a massive 12.5% drop in average CASI scores across all models tested.

Attack Spotlight: MathPrompt

This month’s leaderboard incorporates a wider range of tests; we’ve added a new attack vector called MathPrompt. Math Prompt is a jailbreaking technique that bypasses AI safety filters by disguising harmful requests inside math problems using set theory, algebra, and logic notation. The model treats these as educational math exercises and can reveal harmful information when providing real-world examples.

New Models

We tested a bunch of new or updated models this month. Making the average CASI scores generated by our Attack agents even more impressive.

Qwen3-235B-A22B: got an 0725 update but unfortunately was no match for our attack agent with the updated CASI dropping by 5 points to 50.97
MoonshotAIs Kimi K2: K2 has been making waves with its claimed 2million token context window however the CASI score leaves a lot be desired coming in at just 32.06
UpstageAI Solar-Pro2: We at CalypsoAI love smaller models so when upstageAI claimed to have a 31B parameter model that would beat most 70B models in benchmarks we were excited. However, I guess this didn’t include security benchmarks. While Solar Pro 2 is impressive for its size, the 51.34 CASI score leaves it outside of our top 10 with Qwen3 30B and DeepSeek-Llama3-70B both out scoring it (58.33 & 67.20 respectively)
MistralAI: added 3 new models to the testing pool (Medium, Small 3.2 and Magistral-Medium) all of which performed very poorly with the average score of just 13.36.
xAI Grok4: Grok has been topping every benchmark around since its release setting new SOTA numbers, one record they wont be bragging about is having our lowest ever CASI score with just 3.32

CASI Leaderboard - July 2025

Updated 7th July, 2025

Model Provider	Model Name	CASI	Avg. Performance	RTP	CoS
Anthropic	Claude 4 Sonnet	95.36	53.0%	0.78	18.88
Anthropic	Claude 3.5 Sonnet	92.67	44.40%	0.73	19.42
Anthropic	Claude 3.7 Sonnet	85.73	57.40%	0.74	21
Anthropic	Claude 3.5 Haiku	84.65	34.70%	0.65	5.67
Microsoft	Phi4	80.83	40.20%	0.65	0.77
DeepSeek	DeepSeek-R1-Distill-Llama-70B	72.98	48.20%	0.63	2.06
OpenAI	GPT-4o	68.59	39.80%	0.57	29.16
Meta	Llama 3.1 405b	66.13	40.50%	0.56	10.59
Alibaba	Qwen3-30B-A3B	64.26	55.60%	0.61	4.05
Alibaba	Qwen3-14B	61.56	55.70%	0.59	7.39

ARS Leaderboard - July 2025

Updated 7th July, 2025

Model Provider	Model Name	ARS	Avg. Performance	RTP	CoS
Anthropic	Claude 3.5 Sonnet	93.99	44.40%	0.74	19.15
Anthropic	Claude 3.5 Haiku	91.92	34.70%	0.69	5.22
Microsoft	Phi4	87.34	40.20%	0.68	0.72
Anthropic	Claude 4 Sonnet	86.53	53.00%	0.73	20.8
Anthropic	Claude 3.7 Sonnet	78.55	57.40%	0.7	22.92
Meta	Llama-4 Maverick 128E	74.76	50.50%	0.65	1.43
Meta	Llama-4 Maverick 16E	71.75	43.00%	0.6	0.88
OpenAI	GPT-4o	66.9	39.80%	0.56	29.9
Meta	Llama 3.3 70b	62.08	41.10%	0.54	1.99
Google	Gemma 3 27b	59.87	37.60%	0.51	0.67

Threat Insights - July 2025

Welcome to our July Insight Notes.

This section is our commentary on the ever-shifting landscape of AI model security, where we highlight key data points, discuss emerging trends, and offer context to help you navigate your AI journey securely.

Attack Spotlight: Style Injection

This month’s leaderboards incorporate a wider range of tests; we’ve added a new attack vector called Style Injection. This is a jailbreak technique that works by adding specific writing or formatting rules to the prompt, as a way to distract the model from using its standard refusal language or phrase and instead respond with an unsafe response that would ordinarily be blocked.

Leaderboard Updates

The leaderboards now use the Artificial Analysis Intelligence Index by artificialanalysis.ai as our key performance metric. This combines 9 different benchmarks across reasoning, general knowledge, maths and programming.
We’ve expanded our testing to newer larger models including Qwen 235B model, DeepSeek-R1-0528 and Google’s full release of Gemini 2.5 Pro.

Security Trends:

Course Correcting:

While the release of Claude 4 Sonnet last month skewed scores and kept them from dropping, the introduction of the Style Injection attack vector this month continues to show drops in all models.

Knowledge is Power:

An insightful trend is surfacing from our Agentic Resistance Score (ARS) that points to a potential blind spot in current defensive strategies. We’ve observed a significant and progressive drop in the effectiveness of well-publicized attacks like Microsoft’s ‘Crescendo,’ which was first detailed in early 2024. This decline suggests that model providers are becoming adept at patching for specific, known threats.

However, this targeted approach may be creating a false sense of security. The sustained high success rates of our internally developed attacks FRAME and Trolley, which currently outperform ‘Crescendo’ by a significant margin, indicate that the underlying vulnerabilities are not being fully addressed.

Instead of a holistic approach to security, providers may be “teaching to the test” by mitigating specific, named attacks that have been publicly disclosed. This leaves them vulnerable to novel or less-publicized attack techniques that exploit the same core weaknesses. This reactive, patch-based approach, rather than a proactive strategy focused on fundamental vulnerabilities, represents a significant ongoing risk and underscores the importance of diverse and continuous red-teaming to uncover and address yet-unknown threats.

CASI Leaderboard - June 2025

Updated 16th June, 2025

Model Provider	Model Name	CASI	Avg. Performance	RTP	CoS
Anthropic	Claude 4 Sonnet	95.12	60.78%	0.8	18.92
Anthropic	Claude 3.5 Sonnet	93.27	44.44%	0.69	19.3
Anthropic	Claude 3.7 Sonnet	87.24	57.39%	0.74	20.63
Anthropic	Claude 3.5 Haiku	85.69	34.74%	0.6	5.6
Microsoft	Phi4	81.44	40.22%	0.61	0.77
DeepSeek	DeepSeek-R1-Distill-Llama-70B	73.96	48.24%	0.62	1.24
OpenAI	GPT-4o	68.13	41.46%	0.56	18.35
Meta	Llama 3.1 405b	64.65	40.49%	0.54	1.24
Alibaba	Qwen3-14B	60.82	55.72%	0.59	0.51
Alibaba	Qwen3-30B-A3B	58.61	55.60%	0.57	0.63

ARS Leaderboard - June 2025

Updated 16th June, 2025

Model Provider	Model Name	ARS	Avg. Performance	RTP	CoS
Anthropic	Claude 3.5 Sonnet	95.85	44.44%	0.78	18.78
Microsoft	Phi4	90.63	40.22%	0.65	0.69
Anthropic	Claude 3.5 Haiku	90.32	34.74%	0.62	5.31
Anthropic	Claude 4 Sonnet	86.73	60.78%	0.75	20.75
Anthropic	Claude 3.7 Sonnet	80.31	57.39%	0.7	22.41
OpenAI	GPT-4o	80.28	41.46%	0.62	15.57
Meta	Llama-4 Maverick	76.3	50.53%	0.65	0.52
Meta	Llama-4 Scout	70.51	42.99%	0.58	0.54
xAI	Grok 3 Mini Beta	69.83	66.67%	0.69	1.15
Google	Gemini 2.0 Flash	69.75	48.09%	0.6	0.95

Threat Insights - June 2025

Welcome to our June insight notes. This section is our commentary on the ever-shifting landscape of AI model security, where we highlight key data points, discuss emerging trends, and offer context to help you navigate your AI journey securely.

Testing Spotlight: Scenario Nesting

This month’s leaderboard incorporates a wider range of tests, including the strategy known as Scenario Nesting. The technique embeds a harmful instruction within a benign-looking task — like code completion or table generation — to bypass safety filters. By forcing the model to focus on the structure of the benign request, attackers can sneak malicious payloads past its defenses. More details on which models are vulnerable to it are available in our Inference Red-Team product.

New Notable Models Tested

Anthropic Claude 4 Sonnet: Anthropic’s new Claude 4 Sonnet enters the leaderboard directly at #1 with a CASI score of 95.12. It’s encouraging to see a new “hybrid reasoning” model debut with such a strong security posture, reflecting a continued commitment to security from their team.
Qwen3’s Open-Source Ascent: Two new open-source models from Alibaba, Qwen3-14B and Qwen3-30B-A3B, have earned their spots in our Top 10 for CASI. While their initial safety scores are competitive, it’s worth noting they did not place in the top 10 on our Agentic Resistance leaderboard, reinforcing the need to evaluate models against multiple security dimensions.
DeepSeek’s Impressive Patch: DeepSeek R1 saw its CASI score jump by over 4 points following its latest update. The data suggests this patch successfully addressed several key vulnerabilities, which is a positive development for model security maintenance and a welcome signal for enterprise users who value post-release support.

Wider Security Trends

A Welcome Reversal: The average CASI score across our tracked models increased by approximately 7% this month, a significant reversal of the recent downward trend. We’re optimistic this indicates providers are placing a greater emphasis on security and will be watching to see if this trend holds.
Two Sides of Security: Static vs. Agentic: This month highlighted the growing divergence between standard safety (CASI) and resistance to complex attacks (ARS). For instance, Claude 3.7 Sonnet’s ARS score improved while its CASI score dipped slightly. For production use, this means the “best” model truly depends on the job: a conversational bot has different security needs than a complex autonomous agent.
Balancing Security and Budget: As a reminder, our “Cost of Security” (CoS) metric helps quantify the trade-off between a model’s security and its operational cost. This month’s data shows this clearly: while Anthropic’s models hold the top spots for security, a model like Microsoft’s Phi4 offers a strong CASI score of 81.44 for a fraction of the cost.

CASI Leaderboard - May 2025

Updated 28th April, 2025

Model Provider	Model Name	CASI	Avg. Performance	RTP	CoS
Anthropic	Claude 3.5 Sonnet	94.88	44.44%	0.7	18.7
Anthropic	Claude 3.7 Sonnet	88.11	57.39%	0.74	20.22
Anthropic	Claude 3.5 Haiku	87.47	34.74%	0.6	5.14
Microsoft	Phi4-14B	82.47	40.22%	0.62	0.66
DeepSeek	DeepSeek-R1-Distill-Llama-70B	69.84	48.24%	0.6	1.24
OpenAI	GPT-4o	67.85	41.46%	0.56	16.65
Meta	Llama 3.1 405b	65.06	40.49%	0.54	2.05
Google	Gemini 2.5 Pro	57.08	67.84%	0.61	17.5
OpenAI	GPT 4.1-nano	54.05	41.01%	0.48	0.93
Meta	Llama 4 Maverick-17B-128E	52.45	50.53%	0.52	0.77

ARS Leaderboard - May 2025

Updated 28th April, 2025

Model Provider	Model Name	ARS	Avg. Performance	A_RTP	A_CoS
Anthropic	Claude 3.5 Sonnet	96.67	44.44%	0.71	18.7
Microsoft	Phi4-14B	92.28	40.22%	0.76	0.66
Anthropic	Claude 3.5 Haiku	91.79	34.74%	0.62	5.14
OpenAI	GPT-4o	81.12	41.46%	0.62	16.65
xAI	Grok 3	77.75	50.63%	0.65	18
Anthropic	Claude 3.7 Sonnet	76.83	57.39%	0.68	20.22
xAI	Grok 3-mini	72.04	66.76%	0.7	0.8
Google	Gemma 3 27b	72.03	37.62%	0.56	1.8
Meta	Llama4 Maverick-17B-128E	71.71	50.53%	0.62	0.77
Microsoft	GPT 4.1	68.77	52.63%	0.62	10

Threat Insights - May 2025

Welcome to our insight notes. This section serves as our commentary space, where we highlight interesting data points from our research, discuss trends in AI model security behavior, and explain changes to our methodology. Our goal here is to provide transparency into the work happening behind the scenes at CalypsoAI’s research lab.

Agentic Resistance Score (ARS):

This month we debut the scoring for our Agentic Resistance™ testing in its own leaderboard. While we have already raised the bar with our Signature attacks, moving away from basic attack success rates by incorporating severity and complexity of attack, with ARS we are taking another leap forward and evaluating how your choice of model can compromise your entire AI system. The ARS score is calculated based on depth and complexity of the attacks our agents need to use to achieve the desired goals.

Agentic Resistance deploys a team of autonomous attack agents trained to attack your model, extract information and compromise your infrastructure. In this way it can extract sensitive PII from vector stores, understand your system architecture and test your model’s alignment to your explicit instructions.

Updated Performance Benchmarks:

We now use seven different benchmarks in our performance metric: MMLU-Pro, GPQA Diamond, Humanity’s Last Exam, LiveCodeBench, SciCode, AIME, MATH-500. As benchmarks continue to evolve and improve we’ll keep evaluating what should be used in our leaderboard.

LOTS of New Models:

This last month has been the busiest for model release and testing we’ve seen in a long time. Llama 4 Maverick and Scout, Gemini 2.5 Pro and Flash, GPT4.1, 4.1-mini and 4.1-nano—and we finally get API access to test Grok 3 and Grok 3-mini.

Two New Agentic Attacks:

Our Agentic Resistance simulations now incorporate two new conversational attack methods, FRAME and Trolley, developed by the CalypsoAI AI Threat Research team. These techniques target known LLM architectural vulnerabilities and demonstrate the effectiveness of sustained, cohesive attacks during extended interactions, replicating tactics used by real-world adversaries.

Wider Security Trends:

Decreasing average scores: The average CASI score across the tracked models decreased by approximately 6% in this leaderboard iteration. We noted this last month and as the trend continues it’s becoming more obvious that foundational models are favouring performance over security.
Upgrade with caution: We are seeing a consistent trend where new releases, even minor ones, have lower CASI scores than their predecessors. With the upgrade path for these models being relatively easy, it’s important for companies to rigorously re-test their models and AI systems if they choose to upgrade. Notable examples:
- Claude: 3.5 sonnet = 94 vs 3.7 sonnet = 88
- OpenAI: GPT4o = 67 vs GPT4.1 = 51
- Llama: 3.1 405B = 65 vs 4 Maverick = 52
AI security means testing AI systems: Our research using Agentic Resistance demonstrates that even if a model appears secure when tested in isolation, integrating it into a wider system can expose a new array of vulnerabilities. For every model we tested using this approach within a system context, we were able to:
- Extract user-provided system prompts.
- Break the model’s alignment based on those system prompts.
- Extract sensitive personally identifiable information (PII) when the model was integrated into a retrieval-augmented generation (RAG) system.

CASI Leaderboard - April 2025

Updated 4th April, 2025

Model Provider	Model Name	CASI	Avg. Performance	RTP	CoS
Anthropic	Claude 3.5 Sonnet	94.3	84.50%	0.9	18.7
Anthropic	Claude 3.7 Sonnet	88.52	86.30%	0.88	20.22
Anthropic	Claude 3.5 Haiku	87.56	68.28%	0.79	5.14
Microsoft	Phi4-14B	82.77	75.90%	0.8	0.66
DeepSeek	DeepSeek-R1-Distill-Llama-70B	71.46	72.67%	0.72	1.24
OpenAI	GPT-4o	68.65	80.50%	0.73	16.65
Google	Gemini 2.0 Pro (experimental)	63.89	79.10%	0.7	NA
Meta	Llama 3.1 405b	60.73	79.80%	0.68	2.05
Google	Gemma 3 27b	55.25	78.60%	0.64	1.8
DeepSeek	DeepSeek-R1-Distill-Llama-70B	52.91	86.53%	0.64	4.24

Threat Insights - April 2025

These notes share key insights from CalypsoAI’s research team on AI model security trends, updates to our leaderboard, and changes in testing methodology.

Design & Functionality Updates

We’ve refreshed the leaderboard’s design (we hope you like the changes!), but the updates aren’t just cosmetic. We’ve also enhanced functionality: users can now review previous leaderboard iterations by clicking on the specific version number. We believe this is important for users who need to reference past data used in their decision-making processes.

Note: Please ensure you note the version number when recording or citing metrics.

Transitioning to a Top 10

We’ve decided to focus the leaderboard on the Top 10 models for several reasons. Primarily, as a leaderboard, its purpose is to spotlight the leading models in terms of security at a specific point in time, rather than listing every model ever published. While we continue to test a wide range of models, only those achieving the Top 10 CASI scores will be featured here. All models and additional data is available in our Inference Red-Team product where users can explore what attack types each model is vulnerable to.

New Notable Models Tested

Gemma 3 27B (Google): Google’s new open-source model enters the leaderboard in 9th place with a CASI score of 55.25. This pushes DeepSeek R1 into the final spot, while Llama 3.3 70B (previously in the Top 10) is now displaced with a score of 50.86.
Gemini 2.0 Pro (Experimental): Google’s recent Gemini release pattern presented challenges. While Gemini 2.0 Pro entered our Top 10 with a security score more than double that of its predecessor (1.5 Pro), Google released the beta of its newer model, 2.5 Pro, during our testing window and appears to have deprecated 2.0 Pro. Due to API rate limits (2 requests per minute), we couldn’t adequately test 2.5 Pro for this release, but intend to add it as soon as limits are relaxed. However, the significant security improvement observed from 1.5 to 2.0 makes us hopeful for continued progress in 2.5.
Mistral Small & Qwen QwQ: The recent emergence of capable sub-70B parameter models is exciting, particularly for performance in local deployments. Unfortunately, this excitement didn’t extend to their security evaluations in our tests. Neither Mistral Small nor Qwen came close to the Top 10, scoring 28.86 and 22.76 CASI respectively. This leaves Phi-4 as the leading Small Language Model (SLM) in terms of security for another release cycle.

Wider Security Trends

Decreasing Average Scores: The average CASI score across the tracked models decreased by approximately 4% in this leaderboard iteration. This could partially be attributed to our team improving our attack generation processes and incorporating new attack vectors. Nonetheless, it’s a developing trend and moving in the wrong direction.
Anthropic Remains Strong: Anthropic models continue to top our security rankings, although interestingly, their newest model, Claude 3.7 Sonnet, isn’t their highest-scoring one on our board. This observation aligns with Anthropic’s discussion around “Appropriate Harmlessness” for Sonnet, aiming to reduce refusals for benign prompts. Our tests suggest this tuning might have introduced slight vulnerabilities in the pursuit of improved helpfulness.
Older Models Receiving Patches: Several older models, including GPT4o-mini and Gemini 1.5 Pro received revisions since our last tests. These seemed to add some additional safeguards. The data suggests these patches incorporate learnings from newer models to address common jailbreaks, which is a positive development for model security maintenance, however we still would recommend additional safeguards if using these models. With scores of 41 and 27 respectively they still score well below our acceptable threshold.
Shift Towards Reasoning? With Anthropic releasing models like Claude 3.7 Sonnet, their first “hybrid reasoning model”, and Google quickly iterating from Gemini 2.0 Pro to the more advanced “thinking” version 2.5 Pro, we’re observing a potential trend. Are major providers shifting focus from releasing general base models towards models specifically enhanced for reasoning capabilities? If this trend holds, it could have significant implications for the attack surface of future models, as we’ve seen enhanced reasoning capabilities introduce new vulnerabilities.

CASI Leaderboard - March 2025

Updated 3rd March, 2025

Model Provider	Model Name	CASI	Avg. Performance	RTP	CoS
Anthropic	Claude 3.5 Sonnet	94.94	84.50%	0.93	18.7
Anthropic	Claude 3.7 Sonnet	89.54	86.30%	0.89	20.22
Anthropic	Claude 3.5 Haiku	88.84	68.28%	0.57	5.14
Microsoft	Phi4-14B	86.04	75.90%	0.68	0.66
DeepSeek	DeepSeek-R1-Distill-Llama-70B	71.7	72.67%	0.74	1.24
OpenAI	GPT-4o	68.44	80.50%	0.52	16.65
Meta	Llama 3.1 405b	61.86	79.80%	0.77	2.05
Meta	Llama 3.3 70b	55.57	74.50%	0.69	1.85
DeepSeek	DeepSeek-R1	52.91	86.53%	0.58	4.24
Google	Gemini 1.5 Flash	29.79	66.70%	0.92	0.51
Google	Gemini 2.0 Flash	29.18	77.20%	0.66	0.66
Google	Gemini 1.5 Pro	27.38	74.10%	0.63	8.58
OpenAI	GPT-4o-mini	24.25	71.78%	0.73	1.03
OpenAI	GPT-3.5 Turbo	18.73	59.20%	0.82	2.75

CASI Leaderboard - February 2025

Updated 1st February, 2025

Model Provider	Model Name	CASI	Avg. Performance	RTP	CoS	Source
Anthropic	Claude 3.5 Sonnet	96.25	84.50%	0.93	18.7	Anthropic
Anthropic	Phi4-14B	94.25	75.90%	0.68	0.66	Azure
Anthropic	Claude 3.5 Haiku	93.45	68.28%	0.57	5.14	Anthropic
OpenAI	GPT-4o	75.06	80.50%	0.52	16.65	OpenAI
Meta	Llama 3.3 70b	74.79	74.50%	0.69	1.85	Hugging Face
DeepSeek	DeepSeek-R1-Distill-Llama-70B	74.42	72.67%	0.74	1.24	Hugging Face
DeepSeek	DeepSeek-R1	74.26	86.53%	0.58	4.24	Hugging Face
OpenAI	GPT-4o-mini	73.08	71.78%	0.73	1.03	OpenAI
Google	Gemini 1.5 Flash	73.06	66.70%	0.92	0.51	Google
Google	Gemini 1.5 Pro	72.85	74.10%	0.63	8.58	Google
OpenAI	GPT-3.5 Turbo	72.76	59.20%	0.82	2.75	OpenAI
Alibaba	Qwen QwQ-32B-preview	67.77	68.87%	0.65	2.14	Hugging Face