AI factories need intelligent infrastructure. New results from The Tolly Group show why.

F5 ADSP | March 17, 2026

The industry is entering a new phase of AI infrastructure.

The first wave of generative AI focused on building and training models. The next phase is about operating AI at scale. Enterprises, hyperscalers, and a new generation of neocloud providers are building what we call AI factories. These environments are designed to generate tokens as efficiently, reliably, and securely as possible.

In these environments, every millisecond of latency and every percentage of GPU utilization matters.

In The Tolly Group’s testing, F5 BIG-IP Next for Kubernetes running on NVIDIA BlueField DPUs delivered up to 40% higher token throughput, 61% faster time to first token, and 34% lower request latency compared to traditional load balancing solutions.

GPUs are among the most expensive components in the modern data center. When requests are routed inefficiently or GPUs sit underutilized, organizations are effectively leaving compute capacity and money on the table.

Yet many AI deployments still rely on traditional round-robin load balancing, which distributes requests without understanding the real-time state of the GPU infrastructure. That approach worked well for traditional web applications. It is not sufficient for AI inference.

The importance of GPU-aware infrastructure

AI inference environments are fundamentally different from traditional application architectures.

Inference clusters consist of pools of GPUs running large language models and other AI workloads. At any given moment, each GPU may be handling very different workloads depending on factors such as queue depth, model execution time, token generation length, and memory utilization.

When traffic is distributed blindly, requests can easily be routed to GPUs that are already busy while other accelerators remain underutilized. The result is higher response latency and inefficient infrastructure utilization.

This is why AI infrastructure needs to become workload aware.

At F5, we believe the infrastructure layer must evolve so it can intelligently distribute AI inference requests based on real-time conditions in the environment.

Independent validation from The Tolly Group

To better understand the impact of GPU-aware load balancing, we commissioned The Tolly Group, an independent technology validation firm, to evaluate our approach compared with traditional load balancing solutions.
The results were compelling. In The Tolly Group’s testing, F5 BIG-IP Next for Kubernetes running on NVIDIA BlueField DPUs delivered up to:

  • 40% higher token throughput
  • 61% faster time to first token
  • 34% lower request latency

These improvements were measured against widely used open source load balancing solutions.

One interesting observation from the testing was that the benefits became even more pronounced with smaller models, where efficient traffic distribution can have an even larger impact on system performance.

Offloading infrastructure processing with DPUs

Another important dimension of this architecture is infrastructure offload.

In many deployments today, traffic management functions run directly on the server CPU alongside AI workloads. That means valuable CPU resources are consumed by infrastructure tasks rather than application workloads.

By running F5 BIG-IP Next for Kubernetes on NVIDIA BlueField DPUs, traffic management functions can be offloaded from the server CPU to the DPU.

In The Tolly Group’s testing, the difference was significant. BIG-IP Next for Kubernetes running on Nvidia BlueField DPUs required about 2 CPU cores, while traditional CPU-based load balancing required roughly 12 cores.

That represents approximately 80 percent lower CPU utilization, allowing those CPU resources to remain available for other workloads.

The future of AI infrastructure

As AI factories scale, infrastructure efficiency will become one of the most important factors determining the success of AI deployments.

Organizations will need to maximize GPU utilization, minimize latency, improve token throughput, and ensure infrastructure security and isolations. Achieving those goals requires infrastructure that understands the dynamics of AI workloads, and GPU-aware load balancing and DPU-accelerated infrastructure are important steps toward that future.

At F5, we believe intelligent infrastructure will play a central role in enabling the next generation of AI factories. The results validated by The Tolly Group represent an exciting step in that direction. You can get a full copy of the report here.

Also, be sure to visit the F5 BIG-IP Next for Kubernetes webpage.

Share

About the Author

Scott Calvet
Scott CalvetDirector, Product Marketing | F5

More blogs by Scott Calvet

Related Blog Posts

A sneak peek into F5 BIG-IP v21.1: AI security, PQC, and software enhancements
F5 ADSP | 03/11/2026

A sneak peek into F5 BIG-IP v21.1: AI security, PQC, and software enhancements

Learn how F5’s BIG-IP v21.1 delivers PQC-readiness, AI workload security, modern API and protocol protection, and BIG-IP TMOS software modernization.

The hidden cost of unmanaged AI infrastructure
F5 ADSP | 01/20/2026

The hidden cost of unmanaged AI infrastructure

AI platforms don’t lose value because of models. They lose value because of instability. See how intelligent traffic management improves token throughput while protecting expensive GPU infrastructure.

F5 secures today’s modern and AI applications
F5 ADSP | 12/22/2025

F5 secures today’s modern and AI applications

The F5 Application Delivery and Security Platform (ADSP) combines security with flexibility to deliver and protect any app and API and now any AI model or agent anywhere. F5 ADSP provides robust WAAP protection to defend against application-level threats, while F5 AI Guardrails secures AI interactions by enforcing controls against model and agent specific risks.

Govern your AI present and anticipate your AI future
F5 ADSP | 12/18/2025

Govern your AI present and anticipate your AI future

Learn from our field CISO, Chuck Herrin, how to prepare for the new challenge of securing AI models and agents.

F5 recognized as one of the Emerging Visionaries in the Emerging Market Quadrant of the 2025 Gartner® Innovation Guide for Generative AI Engineering
F5 ADSP | 11/25/2025

F5 recognized as one of the Emerging Visionaries in the Emerging Market Quadrant of the 2025 Gartner® Innovation Guide for Generative AI Engineering

We’re excited to share that F5 has been recognized in 2025 Gartner Emerging Market Quadrant(eMQ) for Generative AI Engineering.

Self-Hosting vs. Models-as-a-Service: The Runtime Security Tradeoff
F5 ADSP | 05/01/2025

Self-Hosting vs. Models-as-a-Service: The Runtime Security Tradeoff

As GenAI systems continue to move from experimental pilots to enterprise-wide deployments, one architectural choice carries significant weight: how will your organization deploy runtime-based capabilities?

Deliver and Secure Every App
F5 application delivery and security solutions are built to ensure that every app and API deployed anywhere is fast, available, and secure. Learn how we can partner to deliver exceptional experiences every time.
Connect With Us
AI factories need intelligent infrastructure. New results from The Tolly Group show why. | F5