Fallacy Failure Attack

Table of Contents

Introduction

Welcome to our AI Security Insights for November 2025. These insights are drawn from F5 Labs’ Comprehensive AI Security Index (CASI) and Agentic Resistance Scoring (ARS), which together provide rigorous, empirical measurement of model security and agentic attack resilience. This month brings critical findings about AI security vulnerabilities as the industry enters what many are calling the "zero-click era" of AI attacks. We're witnessing a fundamental shift where traditional security assumptions about isolated AI interactions no longer hold.

AI Leaderboards November 2025

November 2025 brings a seismic shift in AI security leadership. To view the current scores and positions of the top 10 models, head over to the AI leaderboards. Figure 1 represents the top 10 ranked AI models based on CASI scores. Models will appear and drop off visualization of the top 10, as in Figure 1, due to their scores in any given month, and whether the model provider has launched new models or retired older ones.

Figure 1: Visualization of the top CASI scoring AI models over the past 6 months.

Anthropic's Clean Sweep

For the first time in our testing history, a single provider has claimed the entire top tier of the CASI leaderboard.

Anthropic's Unprecedented Performance:

Claude Haiku 4.5 (95.89 CASI) - Improving on the previous version while maintaining 41.7% performance and exceptional cost efficiency at $0.74 per million tokens
Claude Sonnet 4.5 (95.86 CASI) - Delivering top-tier security with enterprise-grade performance (49.6%) at competitive pricing ($18.78 CoS)
Claude Sonnet 4 (95.33 CASI) - Maintaining the security standard set by its predecessor while balancing 44.4% performance
Claude 3.7 Sonnet (82.27 CASI) - Even the older generation outperforms most competitors' latest releases

The Cost-Security Sweet Spot: Remarkably, Claude Haiku 4.5 achieves perfect security at less than 5% the cost of many competing models. This demolishes the traditional assumption that security requires sacrificing either performance or economics.

Claude's dominance isn't accidental. Several factors converge:

Constitutional AI at Scale: Anthropic's Constitutional AI approach, which trains models to be harmless through self-critique and revision, has matured significantly. Claude Haiku 4.5's perfect security score suggests this methodology scales effectively even to smaller, faster models.

The Safety-Performance Tradeoff Myth: Conventional wisdom held that maximum security requires sacrificing capability. Claude Sonnet 4.5 demolishes this assumption with 95.86 CASI while delivering 49.6% average performance—competitive with models scoring 40+ points lower on security.

Economic Viability: At $0.74 per million tokens, Claude Haiku 4.5 proves enterprise-grade security doesn't require enterprise-grade budgets. This pricing puts secure AI within reach of organizations that previously couldn't afford top-tier models.

GPT-5 Family: Strong but Slipping

While GPT-5 models maintain respectable security scores, they've been displaced from the top tier:

GPT-5-nano (85.55 CASI) - Still the strongest small OpenAI model, but now 5th overall
GPT-5-mini (84.53 CASI) - Maintaining the "small model advantage" with high performance (60.8%)
GPT-5 base (80.31 CASI) - Down from our September projections, now 7th place with concerning vulnerability to this month's attacks

The Integration Layer Crisis: The drop in GPT-5's relative standing correlates with this month's disclosure of seven critical zero-click vulnerabilities affecting ChatGPT. These attacks require no direct user interaction—compromise occurs through innocent queries alone.

Attack Spotlight: The Fallacy Failure Attack (FFA)

This month's CASI leaderboard incorporates a sophisticated new attack vector called Fallacy Failure, based on the paper titled Large Language Models Are Involuntary Truth-Tellers: Exploiting Fallacy Failure for Jailbreak Attacks (Zhou et al., 2024). Researchers introduced a novel vulnerability known as the Fallacy Failure Attack (FFA). The research reveals a fundamental weakness in large language models (LLMs): their inability to reliably produce false but plausible reasoning on demand. This shortcoming can be exploited to bypass model safeguards and elicit responses the model should refuse to give.

What Is the Fallacy Failure Attack?

LLMs are designed to produce coherent, factually correct text, but they struggle to intentionally generate reasoning that is false yet convincing. The Fallacy Failure Attack exploits this by asking a model to produce a fallacious or incorrect explanation for a malicious or restricted task. Because the model interprets the request as harmless (it appears to be a request for an “academic” or “fictional” mistake) it relaxes its internal safety filters.

However, when attempting to create a “wrong” answer, the model often fails to maintain false reasoning. Instead, it inadvertently produces the correct and potentially dangerous information the attacker sought to obtain. In effect, the model becomes an “involuntary truth-teller,” revealing accurate instructions while believing it is generating harmlessly incorrect text.

Why It Matters

The FFA represents a new class of jailbreak technique. Instead of directly asking the model to perform a banned action, the attacker frames the request as an exercise in generating a wrong or fallacious example. The framing bypasses many traditional safety mechanisms because the model’s content filter treats “incorrect reasoning” as low-risk.

How the Attack Works

The Fallacy Failure Attack typically contains four components: a malicious query, a framing that requests fallacious reasoning, an explicit requirement for deceptiveness (e.g., “make it sound plausible”), and a context or scenario that normalizes the request, such as a fictional or academic setting.

For example, a prompt might ask for “an incorrect way to build a ransomware sample for a cybersecurity class project.” The model interprets this as a safe, hypothetical exercise—but because it cannot easily sustain false logic, it may output a functional description of how to do it correctly. In doing so, the model unintentionally reveals dangerous information that circumvents its own restrictions.

Implications for Security and AI Safety

The Fallacy Failure Attack has serious implications for both AI alignment and cybersecurity. It demonstrates that models can be manipulated through indirect framing rather than overtly malicious instructions, undermining confidence in safety guardrails.

For practitioners building AI-driven systems in sensitive domains (such as code generation, security analysis, or automation) this highlights the need to treat all “hypothetical” or “incorrect” reasoning requests as potentially exploitable. Defensive strategies may require reasoning verification layers or explicit training on the difference between safe hypothetical discussion and actionable instructions.

Ultimately, the FFA underscores a paradox at the heart of current LLMs: their alignment to truth makes them vulnerable to deception. When asked to produce falsehoods, they often reveal the very truths they were designed to conceal.

Wider Trends & News: The Security-Capability Tension

Beyond the addition of FFA testing, our AI research team have also uncovered additional trends over the past month.

The Agentic Attack Surface Explosion

Exposing AI chatbots to external tools and systems, a requirement for building AI agents—dramatically expands the attack surface by presenting more avenues for threat actors to conceal malicious prompts.

New Attack Techniques Emerging:

Agent Session Smuggling: Exploiting Agent2Agent (A2A) protocol to inject instructions between client requests and server responses
Prompt Inception: Steering AI agents to amplify bias or falsehoods for disinformation at scale
PROMPTFLUX Malware: Using Gemini AI to rewrite malware code hourly for improved obfuscation

Agentic Browsers: The Unsecured Frontier

October 2025 exposed browser-based AI agents as the most dangerous attack surface in modern computing. Systematic failures were demonstrated across every major agentic browser, with ChatGPT Atlas achieving only 5.8% malicious page detection compared to 53% for Edge and 47% for Chrome.¹

New Attack Vectors:

Screenshot-based injection hiding malicious prompts in near-invisible text
Markdown rendering attacks hiding instructions from users while visible to AI
Persistent memory corruption via CSRF vulnerabilities
URL spoofing disguising prompt injections as legitimate addresses

The Fundamental Problem: Traditional web security relies on clear boundaries between code and data. Agentic browsers collapse these boundaries—every webpage becomes executable code because AI interprets natural language instructions regardless of source.

Claude Code Interpreter Vulnerability

Security researcher Johann Rehberger disclosed on October 25th that Claude's code interpreter can be manipulated through indirect prompt injection to exfiltrate sensitive information, including chat histories and uploaded documents.² The flaw: Claude's network restrictions allow access to api.anthropic.com, enabling attackers to use Claude's own API to send stolen data.

Anthropic closed the report within one hour, classifying it as "out of scope" and a "model safety issue" rather than security vulnerability. This highlights how even the most secure base models can be undermined by integration features.

Conclusions

Anthropic has taken unprecedented control of AI security, claiming every top-tier position with models like Claude Haiku 4.5 and Sonnet 4.5 delivering near-perfect security, strong performance, and exceptional cost efficiency. Their Constitutional AI approach appears to have scaled effectively, disproving the long-held assumption that maximum safety requires sacrificing capability or affordability. Meanwhile, OpenAI’s GPT-5 family has slipped from the top tier, impacted by lower security scores and the disclosure of seven critical zero-click vulnerabilities affecting ChatGPT.

View this month's AI Security Leaderboards or see how the F5 AI Red Team conducts the testing behind these results.