LLM evaluation: Building trust with security scoring

F5 ADSP | June 30, 2025

As enterprise AI adoption accelerates, LLM evaluation is no longer just about performance, it’s about trust. Security teams must validate not only how well a model performs, but how safely it behaves. Yet most evaluation frameworks focus on accuracy benchmarks, leaving security and risk factors vague or unmeasured.

This lack of clarity is dangerous. Without explainable, benchmarkable security metrics, there’s no credible way to compare models, demonstrate resilience, or make risk-adjusted decisions about how and where AI should be deployed.

At F5, we’ve spent many months working with enterprise security leaders who’ve echoed a consistent concern: “We need to explain to internal stakeholders what a score means, how it was derived, and what we can do to improve it.”

This post explores how to build that trust by understanding the two complementary scores that now form the backbone of secure GenAI evaluation: the Comprehensive AI Security Index (CASI) and the Agentic Resistance Score (ARS) score.

The Problem with AI Security "Scoring" Today

When enterprises test AI models, they often get either a binary outcome: Did the model break or not? Or a vague risk rating: Low/Medium/High, with little detail.

Neither approach holds up when AI systems enter production and begin interacting with sensitive data, external users, or downstream agents.

Security leaders need more than attack success rates, they need visibility into:

  • Severity: How dangerous is a successful attack?
  • Complexity: How hard is it to exploit the model?
  • Context: Does it break under real-world, multi-turn, or application-level use?

That’s why we developed two distinct scoring systems, each addressing a different layer of AI system security.

CASI: A Model's Security DNA

The Comprehensive AI Security Index (CASI) helps teams add security rigor to their LLM evaluation process by measuring more than just jailbreak success. CASI scores foundational models on a scale of 0–100, incorporating:

  • Severity of impact: Not all failures are equal—disclosing credentials is worse than answering trivia.
  • Attack complexity: Models that break under simple phrasing aren’t as secure as those that require advanced adversarial strategies.
  • Defensive breaking point: How quickly and under what conditions does a model's alignment collapse?

This lets security teams choose models with high resilience—not just those that pass easy tests.

For example, Alibaba's Qwen3 scored competitively on CASI, suggesting strong model-level defenses. However, when tested using agentic methods, it failed to withstand more sophisticated, persistent attacks. This highlights the limits of model-only evaluation.

These results are published on the F5 Labs, a regularly updated, public resource that ranks the most widely used LLMs based on real-world red-teaming. Unlike conventional performance charts, this leaderboard helps enterprises compare models on security, risk, cost, and system-level resilience, making it a critical tool for informed LLM evaluation.

ARS: When Models Become Systems

While CASI measures foundational model resilience, it doesn’t account for system-level behavior. That’s where the Agentic Resistance Score (ARS) comes in.

ARS measures how an AI system—including any agents, retrieval tools, or orchestration layers—holds up under persistent, adaptive attacks. These tests are executed by autonomous adversarial agents that:

  • Learn from failed attempts
  • Chain attacks across multiple turns
  • Target hidden prompts, vector stores, and retrieval logic

Scored from 0–100, ARS is built around three dimensions:

  1. Required Sophistication: How clever does the attacker need to be?
  2. Defensive Endurance: How long can the system resist?
  3. Counter-Intelligence: Does the system reveal useful attack clues even when it blocks the initial threat?

A higher ARS score means your AI system is not just secure in isolation. It can withstand contextual, agentic attacks that mimic what real threat actors are already testing in the wild.

Why Security Leaders Need Both

Security leaders evaluating AI deployments, especially those involving RAG architectures, autonomous agents, or complex orchestration workflows, need a layered view of trust. CASI helps choose a foundation. It tells you whether a model’s built-in defenses are robust enough for enterprise-grade applications. ARS validates your deployment. It shows whether your custom workflows or integrated systems introduce new vulnerabilities, even when the base model scores well.

Recent model testing emphasizes this divergence. In June, Anthropic Claude 3.7 Sonnet's ARS improved while its CASI dipped slightly, signaling that its system-level behavior had been hardened, even as some model-level vulnerabilities remained.

Transparency Drives Actionability

Scoring isn’t useful unless it drives decisions. Security leaders tell us they need:

  • Explainable scoring methodologies that can be shared with risk committees and product teams
  • Clear, numerical indicators that can be benchmarked, tracked, and improved
  • Application-aware red-teaming that exposes system-level weaknesses (not just model flaws)

That’s why we publish scoring methodologies, provide prompt-level logs in red-team reports, and support continuous security testing across both models and AI systems.

Trust in AI doesn’t come from vague assurance. It comes from scoring that reflects real risk, testing that mirrors real threats, and reporting that leads to real improvements.

Trust Is Built, Not Claimed

AI is only as trustworthy as the process you use to evaluate and monitor it. Transparent security scoring—at both the model and system level—gives security teams the language, evidence, and confidence they need to deploy GenAI safely.

And for enterprises working across regulated industries, high-risk domains, or user-facing AI agents, that confidence is mandatory.

Share

About the Author

Jessica Brennan
Jessica BrennanSenior Product Marketing Manager

More blogs by Jessica Brennan

Related Blog Posts

The hidden cost of unmanaged AI infrastructure
F5 ADSP | 01/20/2026

The hidden cost of unmanaged AI infrastructure

AI platforms don’t lose value because of models. They lose value because of instability. See how intelligent traffic management improves token throughput while protecting expensive GPU infrastructure.

F5 secures today’s modern and AI applications
F5 ADSP | 12/22/2025

F5 secures today’s modern and AI applications

The F5 Application Delivery and Security Platform (ADSP) combines security with flexibility to deliver and protect any app and API and now any AI model or agent anywhere. F5 ADSP provides robust WAAP protection to defend against application-level threats, while F5 AI Guardrails secures AI interactions by enforcing controls against model and agent specific risks.

Govern your AI present and anticipate your AI future
F5 ADSP | 12/18/2025

Govern your AI present and anticipate your AI future

Learn from our field CISO, Chuck Herrin, how to prepare for the new challenge of securing AI models and agents.

F5 recognized as one of the Emerging Visionaries in the Emerging Market Quadrant of the 2025 Gartner® Innovation Guide for Generative AI Engineering
F5 ADSP | 11/25/2025

F5 recognized as one of the Emerging Visionaries in the Emerging Market Quadrant of the 2025 Gartner® Innovation Guide for Generative AI Engineering

We’re excited to share that F5 has been recognized in 2025 Gartner Emerging Market Quadrant(eMQ) for Generative AI Engineering.

Self-Hosting vs. Models-as-a-Service: The Runtime Security Tradeoff
F5 ADSP | 05/01/2025

Self-Hosting vs. Models-as-a-Service: The Runtime Security Tradeoff

As GenAI systems continue to move from experimental pilots to enterprise-wide deployments, one architectural choice carries significant weight: how will your organization deploy runtime-based capabilities?

Deliver and Secure Every App
F5 application delivery and security solutions are built to ensure that every app and API deployed anywhere is fast, available, and secure. Learn how we can partner to deliver exceptional experiences every time.
Connect With Us
LLM evaluation: Building trust with security scoring | F5