AI factories need intelligent infrastructure. New results from The Tolly Group show why.

F5 ADSP | March 17, 2026

Scott CalvetDirector, Product Marketing | F5

The industry is entering a new phase of AI infrastructure.

The first wave of generative AI focused on building and training models. The next phase is about operating AI at scale. Enterprises, hyperscalers, and a new generation of neocloud providers are building what we call AI factories. These environments are designed to generate tokens as efficiently, reliably, and securely as possible.

In these environments, every millisecond of latency and every percentage of GPU utilization matters.

In The Tolly Group’s testing, F5 BIG-IP Next for Kubernetes running on NVIDIA BlueField DPUs delivered up to 40% higher token throughput, 61% faster time to first token, and 34% lower request latency compared to traditional load balancing solutions.

GPUs are among the most expensive components in the modern data center. When requests are routed inefficiently or GPUs sit underutilized, organizations are effectively leaving compute capacity and money on the table.

Yet many AI deployments still rely on traditional round-robin load balancing, which distributes requests without understanding the real-time state of the GPU infrastructure. That approach worked well for traditional web applications. It is not sufficient for AI inference.

The importance of GPU-aware infrastructure

AI inference environments are fundamentally different from traditional application architectures.

Inference clusters consist of pools of GPUs running large language models and other AI workloads. At any given moment, each GPU may be handling very different workloads depending on factors such as queue depth, model execution time, token generation length, and memory utilization.

When traffic is distributed blindly, requests can easily be routed to GPUs that are already busy while other accelerators remain underutilized. The result is higher response latency and inefficient infrastructure utilization.

This is why AI infrastructure needs to become workload aware.

At F5, we believe the infrastructure layer must evolve so it can intelligently distribute AI inference requests based on real-time conditions in the environment.

Independent validation from The Tolly Group

To better understand the impact of GPU-aware load balancing, we commissioned The Tolly Group, an independent technology validation firm, to evaluate our approach compared with traditional load balancing solutions.
The results were compelling. In The Tolly Group’s testing, F5 BIG-IP Next for Kubernetes running on NVIDIA BlueField DPUs delivered up to:

40% higher token throughput
61% faster time to first token
34% lower request latency

These improvements were measured against widely used open source load balancing solutions.

One interesting observation from the testing was that the benefits became even more pronounced with smaller models, where efficient traffic distribution can have an even larger impact on system performance.

Offloading infrastructure processing with DPUs

Another important dimension of this architecture is infrastructure offload.

In many deployments today, traffic management functions run directly on the server CPU alongside AI workloads. That means valuable CPU resources are consumed by infrastructure tasks rather than application workloads.

By running F5 BIG-IP Next for Kubernetes on NVIDIA BlueField DPUs, traffic management functions can be offloaded from the server CPU to the DPU.

In The Tolly Group’s testing, the difference was significant. BIG-IP Next for Kubernetes running on Nvidia BlueField DPUs required about 2 CPU cores, while traditional CPU-based load balancing required roughly 12 cores.

That represents approximately 80 percent lower CPU utilization, allowing those CPU resources to remain available for other workloads.

The future of AI infrastructure

As AI factories scale, infrastructure efficiency will become one of the most important factors determining the success of AI deployments.

Organizations will need to maximize GPU utilization, minimize latency, improve token throughput, and ensure infrastructure security and isolations. Achieving those goals requires infrastructure that understands the dynamics of AI workloads, and GPU-aware load balancing and DPU-accelerated infrastructure are important steps toward that future.

At F5, we believe intelligent infrastructure will play a central role in enabling the next generation of AI factories. The results validated by The Tolly Group represent an exciting step in that direction. You can get a full copy of the report here.

Also, be sure to visit the F5 BIG-IP Next for Kubernetes webpage.

Featured Blog Posts

Three things every CISO should know about API security

F5 completes acquisition of CalypsoAI, introduces F5 AI Guardrails and F5 AI Red Team

F5’s announcement to acquire CalypsoAI builds towards TRiSM framework

Tags: F5 BIG-IP, F5 and NVIDIA, AI Infrastructure

About the Author

Scott CalvetDirector, Product Marketing | F5

More blogs by Scott Calvet

Featured Blog Posts

Three things every CISO should know about API security

F5 completes acquisition of CalypsoAI, introduces F5 AI Guardrails and F5 AI Red Team

F5’s announcement to acquire CalypsoAI builds towards TRiSM framework

Related Blog Posts

F5 ADSP | 06/04/2026

Kubernetes-native WAF for the gateway era: F5 WAF for NGINX now integrates with F5 NGINX Gateway Fabric

F5 extends WAFs to deliver consistent, scalable protection across clusters and environments with F5 NGINX Gateway Fabric and F5 NGINX Ingress Controller.

F5 ADSP | 03/26/2026

From dashboard fatigue to operational excellence: Why XOps needs F5 Insight for ADSP

Learn how F5 Insight for ADSP lays the visibility foundation for XOps—turning fragmented signals across applications and infrastructure into actionable intelligence.

F5 ADSP | 01/20/2026

The hidden cost of unmanaged AI infrastructure

AI platforms don’t lose value because of models. They lose value because of instability. See how intelligent traffic management improves token throughput while protecting expensive GPU infrastructure.

F5 ADSP | 12/18/2025

Govern your AI present and anticipate your AI future

Learn from our field CISO, Chuck Herrin, how to prepare for the new challenge of securing AI models and agents.

F5 ADSP | 11/25/2025

F5 recognized as one of the Emerging Visionaries in the Emerging Market Quadrant of the 2025 Gartner® Innovation Guide for Generative AI Engineering

We’re excited to share that F5 has been recognized in 2025 Gartner Emerging Market Quadrant(eMQ) for Generative AI Engineering.

F5 ADSP | 05/01/2025

Self-Hosting vs. Models-as-a-Service: The Runtime Security Tradeoff

As GenAI systems continue to move from experimental pilots to enterprise-wide deployments, one architectural choice carries significant weight: how will your organization deploy runtime-based capabilities?