BLOG

Introducing the CASI Leaderboard

Lee Ennis Thumbnail
Lee Ennis
Published September 29, 2025

AI adoption is accelerating faster than any technology before it. 

What started as a few large models and vendors has proliferated into a vast ecosystem of open-source and commercial AI models, each with their own advantages and risks. With millions of models to choose from, enterprises adopting AI need transparent risk insights that show exactly what threats each model brings into their environment.

Following F5’s acquisition of CalypsoAI, we are excited to introduce the Comprehensive AI Security Index (CASI) Leaderboard to give AI and GRC leaders detailed insights into the different risk compositions of the most prominent AI models. Founded in 2018, CalypsoAI has been a pioneer in AI security research, creating one of the largest AI vulnerability libraries and regularly updating it with 10,000+ new attack prompts each month. From this foundation, the leaderboard testing holistically assesses base model and AI system security, focusing on the most popular models and models deployed by our customers. 

How does the CASI testing work?

These tools were developed to align with the business needs of selecting a production-ready model, helping CISOs and developers build with security at the forefront. The leaderboards cut through the noise in the AI space, distilling complex model security questions into five key metrics:

  1. CASI Score - A composite metric designed to measure the overall security of a model (methodology below).
  2. AWR Score - Evaluates how a model can compromise an entire AI system. We do this by unleashing our team of autonomous attack agents, which are trained to attack the system, extract information, and compromise infrastructure. In this way, these agents can extract sensitive PII from vector stores, understand system architecture, and test model alignment with explicit instructions.
  3. Performance - Average performance of the model is based on popular benchmarks like MMLU, GPQA, MATH, and HumanEval.
  4. Risk-to-Performance Ratio (RTP) - Insight into the tradeoff between model safety and performance.
  5. Cost of Security (CoS) - The current inference cost relative to the model’s CASI, assessing the financial impact of security.

CASI Leaderboard

The Comprehensive AI Security Index (CASI) Leaderboard on F5 Labs.

What is the CASI score and why does it matter?

CASI is a metric developed to answer the complex question: “How secure is my model?”. A higher CASI score indicates a more secure model or application. While many studies on attacking or red-teaming models rely on Attack Success Rate (ASR), this metric often overlooks differences in impact of each attack. Traditional ASR treats all attacks as equal, which is misleading. For example, an attack that bypasses a bicycle lock should not be equated to one that compromises nuclear launch codes. Similarly, in AI, a small, unsecured model might be easily compromised with a simple request for sensitive information, while a larger model might require sophisticated techniques like autonomous and coordinated agentic AI attackers to break its alignment. CASI captures this nuance by creating distinctions between simple and complex attacks, and establishing a model’s Defensive Breaking Point (DBP); the path of least resistance and minimum compute resources required for a successful attack.

What is the AWR score?

Standard AI vulnerability scans provide a baseline view of model security but only scratch the surface in understanding how an AI system might behave under real-world attacks. 

To address this gap, we leverage F5 AI Red Team, a sophisticated red-teaming technology commanding swarms of autonomous AI agents which simulate a team of persistent, intelligent threat analysts. These agents probe, learn, and adapt—executing multi-step attacks designed to reveal critical weaknesses that static tests often miss.

This rigorous testing process produces the AWR Score, a quantitative measure of an AI system’s defensive strength, rated on a scale of 0 to 100. A higher AWR score indicates that a system requires a more sophisticated, persistent, and informed attacker to compromise it. This benchmarkable number, derived from complex attack narratives, is calculated across three critical categories:

  • Required Sophistication – What is the minimum level of attacker ingenuity needed to breach the AI? Can the system withstand advanced, tailored strategies, or does it succumb to simpler, common attacks?
  • Defensive Endurance – How long can the AI system remain secure under a prolonged, adaptive assault? Does it crumble after a few interactions or endure against persistent, evolving attacks?
  • Counter-Intelligence – Is the AI unintentionally training its attackers? This vector measures whether a failed attack exposes critical intelligence, such as revealing the nature of its filters, inadvertently providing a roadmap for future exploits.

What are the latest trends?

Our team at F5 Labs has a detailed analysis on the latest trends observed in our September testing. For in-depth insights into the techniques, vulnerabilities, and exploits on the rise, check back each month to stay up to date on the latest trends in AI security.

Keeping pace with the AI model landscape

The AI attack surface will continue to evolve, and F5 is committed to empowering organizations with the insights they need to adapt AI security in stride. As is the case with any new technology, AI will always carry with it a “non-zero” degree of risk. The first step to comprehensive AI security is understanding where risks exist, and the CASI Leaderboards will continue to shape that understanding as the AI model landscape continuously shifts.

Interested in more insights? The same agentic red-teaming we use to evaluate base models can be applied and tailored to your AI environment for even more in-depth insights with F5 AI Red Team.