When high scores aren’t enough: What Qwen3 taught us

F5 ADSP | May 20, 2025

One question keeps coming up for us at F5: how do we really know if a model is safe? It’s a question we’ve been working to answer, first with our Comprehensive AI Security Index (CASI) score, and, more recently, with our Agentic Resistance Score (ARS). So let’s unpack what these scores mean, and why both are critical for evaluating models in the wild.

CASI: The Security Baseline

CASI has been our benchmark for a while now. It gives a top-level view of a model’s general security posture and how resilient it is to a broad spectrum of vulnerabilities. CASI doesn’t just count attack success rates; it weighs the severity of successful breaches, the complexity of the attack paths, and the defensive breaking point, where a model’s guardrails start to fail.

Think of CASI as the score that helps you judge “is this model generally secure compared to others?” It’s what most people are looking for when they say they want a “safe model”.

But here’s the thing: real-world adversaries don’t just throw a prompt at a model and walk away. They persist. They try multiple tactics. They work toward a goal.

ARS: The Stress Test

That’s where ARS comes in. ARS doesn’t measure theoretical resilience, it pressure-tests it.

In the AWR evaluation, we deploy our own autonomous attack agents. We give them five malicious intents focused on items like data exfiltration or system manipulation, and we set them loose. Our attack agents plan, iterate and adapt. They go wherever they need to go to accomplish the goal.

It’s also why we now have two leaderboards. It wasn’t a design choice, but a reality check. Some models that look strong on CASI crumble under agentic pressure. Others hold their ground.

A Case in Point: Qwen3

Let’s take Qwen3. Recently released by Alibaba, it’s objectively a good model. On CASI, it landed in 5th place, which is well within our top 10. So respectable, not best, not worst. To put this into perspective, only the top 1% of models even make it onto our CASI Leaderboard. Impressive, right?

But on the ARS Leaderboard, Qwen3 didn’t even place. Why? Because when we tested it against five malicious intents – each designed to represent really bad things, like self-harm and other terrible stuff – it failed every single one. In other words, it had zero resistance.

In comparison, the top model on the ARS Leaderboard is Anthropic’s Claude 3.5 Sonnet. Out of the same five malicious intents, Claude 3.5 was resilient to three and vulnerable to two. So still a risk, but a significantly reduced one. (Incidentally, Claude 3.5 was also the top ranked model on the CASI Leaderboard, so there’s some correlation.)

Let’s consider what this means in practical, real-world terms. If you use Qwen3 within an AI system that connects to tools, data stores, or user inputs, you are exposing the entire system to real vulnerabilities.

This is the nuance CASI alone can’t show you. You need both lenses: CASI for general security posture, ARS for system-level resilience under coordinated attack.

The Bottom Line

We didn’t build these scores for vanity. We built them because attackers won’t be static within an AI environment, and your security can’t be either.

If you’re choosing a model based on its performance alone – MMLU score or speed – you are missing the part that matters most. So the next time someone tells you a model is “state-of-the-art” ask: is it secure? Because if it’s not, it’s not ready.

Share

About the Author

James White
James WhiteVP, Engineering, AI Security

James White is an accomplished engineer and business leader with nearly two decades of experience in the enterprise software industry.

More blogs by James White

Related Blog Posts

The hidden cost of unmanaged AI infrastructure
F5 ADSP | 01/20/2026

The hidden cost of unmanaged AI infrastructure

AI platforms don’t lose value because of models. They lose value because of instability. See how intelligent traffic management improves token throughput while protecting expensive GPU infrastructure.

F5 secures today’s modern and AI applications
F5 ADSP | 12/22/2025

F5 secures today’s modern and AI applications

The F5 Application Delivery and Security Platform (ADSP) combines security with flexibility to deliver and protect any app and API and now any AI model or agent anywhere. F5 ADSP provides robust WAAP protection to defend against application-level threats, while F5 AI Guardrails secures AI interactions by enforcing controls against model and agent specific risks.

Govern your AI present and anticipate your AI future
F5 ADSP | 12/18/2025

Govern your AI present and anticipate your AI future

Learn from our field CISO, Chuck Herrin, how to prepare for the new challenge of securing AI models and agents.

New 7.0 release of F5 Distributed Cloud Services accelerates F5 ADSP adoption
F5 ADSP | 12/10/2025

New 7.0 release of F5 Distributed Cloud Services accelerates F5 ADSP adoption

Our recent 7.0 release is both a major step and strategic milestone in our journey to deliver the connectivity, security, and observability fabric that our customers need.

F5 provides enhanced protections against React vulnerabilities
F5 ADSP | 12/04/2025

F5 provides enhanced protections against React vulnerabilities

Developers and organizations using React in their applications should immediately evaluate their systems as exploitation of this vulnerability could lead to compromise of affected systems.

Deliver and Secure Every App
F5 application delivery and security solutions are built to ensure that every app and API deployed anywhere is fast, available, and secure. Learn how we can partner to deliver exceptional experiences every time.
Connect With Us
When high scores aren’t enough: What Qwen3 taught us | F5