The hidden cost of unmanaged AI infrastructure

F5 ADSP | January 20, 2026

Scott CalvetDirector, Product Marketing | F5

AI infrastructure teams are making one of the largest technology investments in their organization’s history. Some already have GPU clusters in production. Others are still preparing to bring them online. In both cases, the challenge is the same. Once AI infrastructure is deployed, instability, inefficiency, and poor traffic control quickly translate into lost value and operational risk.

When something goes wrong, the impact is immediate. Services slow down. Capacity drops. Teams scramble to recover systems that were working just moments before. Even environments that are not yet fully utilized are exposed. Early instability, misconfiguration, or uneven traffic patterns can undermine confidence in platforms that are expected to support future growth.

“F5 BIG-IP Next for Kubernetes approaches AI inference as a systems problem. Networking, load balancing, and security are integrated into a unified data plane that adapts to real time conditions. This allows inference platforms to scale without sacrificing stability as demand grows.”

As AI platforms move into production, many organizations are discovering that pushing GPUs harder is not always the answer. Unmanaged traffic, uneven load, and sudden demand spikes can drive GPUs into unstable operating conditions once utilization ramps. Recovery is rarely instant. Capacity can be unavailable for extended periods, and repeated stress increases operational risk over time.

When GPU investments reach into the tens of millions of dollars, these events are not just technical inconveniences. They are business problems.

This is why conversations about AI efficiency often turn to tokens.

In simple terms, a token is a unit of text processed or generated by an AI model. Tokens can represent parts of words, full words, or punctuation. When a user submits a prompt and an AI system generates a response, it produces tokens one after another. The number of tokens generated over time is a direct measure of how much useful work an AI platform is delivering.

Tokens matter because they connect infrastructure performance to real outcomes. More tokens per second mean more users served, faster responses, and greater revenue potential from the same underlying GPU investment. This is why tokens have become a common shorthand for AI efficiency at an executive level.

But tokens are not the root cause. They are the result.

Tokens are downstream of infrastructure health

Inference traffic behaves very differently from traditional application traffic. Requests are long lived, bursty, and highly variable in how much GPU time they consume. When traffic is distributed evenly in theory but unevenly in practice, GPUs swing between idle and overloaded states.

Under sustained load, these imbalances compound. Queues grow. Latency spikes. Retries add pressure instead of relieving it. In production environments, recovery from these conditions is rarely as simple as restarting a process. GPU resets often involve driver reloads, fabric reinitialization, scheduler reconciliation, and application warmup. During this time, inference capacity is effectively unavailable.

In real environments, this recovery window is measured not in seconds, but in meaningful operational time. Even partial loss of capacity across a cluster can reduce service availability for extended periods, especially when workloads need to rebalance and warm back up.

The important realization for many teams is that token throughput is downstream of infrastructure behavior. Stability and traffic control determine whether GPUs spend their time generating value or recovering from overload.

Sustained inference efficiency is an infrastructure problem

One of the clearest lessons from real-world AI deployments is that peak benchmarks are a poor predictor of production performance. What matters instead is sustained, predictable behavior under real load conditions.

Production inference platforms succeed or fail based on their ability to:

Handle uneven and bursty demand without creating hot spots
Maintain consistent latency as traffic volumes fluctuate
Deliver steady throughput hour after hour, not just in short benchmark runs

In controlled performance testing, F5 BIG-IP Next for Kubernetes running on and accelerated by NVIDIA BlueField 3 DPUs, showed how infrastructure behavior directly shapes inference efficiency. By making traffic decisions using live CPU, GPU, host, and network telemetry, the platform delivered measurable gains over traditional data plane approaches.

The DPU accelerates these gains, but the underlying traffic management benefits apply broadly across AI inference environments.

In these tests, AI inference workloads achieved up to up to 47% higher token throughput, up to 86% faster time to first token, and up to 45% lower end to end latency.

These gains were achieved without changing models or GPU hardware. They were the result of improving how traffic was distributed and managed across the infrastructure, allowing GPUs to operate more consistently under load.

In practical terms, these improvements translate into fewer stalled requests, fewer retries amplifying load, and more time spent doing useful inference work instead of recovering from instability.

These results are documented in F5’s validated performance testing for AI inference at scale. The testing evaluated real inference traffic under sustained load and measured token throughput, time to first token, and end-to-end latency across multiple data plane implementations. The findings reinforce a critical point. Infrastructure behavior plays a decisive role in sustained AI performance, not just peak benchmarks.

The takeaway is straightforward. Intelligent traffic management increases sustained token throughput by preventing instability rather than reacting to it after the fact. For organizations running AI at scale, this translates directly into avoided downtime, protected GPU lifespan, and millions of dollars in infrastructure value that would otherwise be lost to inefficiency and recovery events.

How intelligent traffic management changes GPU behavior

Many AI platforms still rely on static rules or simple round robin approaches to distribute inference traffic. These methods assume all requests behave similarly and that all GPUs respond the same way under load. That assumption breaks down quickly at scale.

Some inference requests complete quickly. Others consume GPU resources for extended periods. When traffic is distributed without awareness of real time system conditions, a subset of GPUs becomes overloaded while others remain underutilized. This imbalance is one of the most common root causes of instability in production AI environments.

F5 BIG-IP Next for Kubernetes addresses this challenge by making traffic decisions based on live telemetry rather than static assumptions. It continuously evaluates CPU, GPU, host, and network signals to understand where capacity exists and where stress is building. When deployed on NVIDIA BlueField DPUs, this logic runs inline with the traffic itself, offloading networking and security processing from host CPUs and reducing contention for shared resources.

The result is not only faster responses in the moment, but more stable behavior over time. GPUs spend more time doing useful work and less time recovering from overload conditions. This consistency enables higher sustained token throughput without pushing hardware into unstable operating zones.

Protecting GPU investment through stability

For infrastructure and platform teams, one of the most compelling outcomes of intelligent traffic management is control. Preventing overload conditions reduces the likelihood of extended recovery events and lowers the operational overhead associated with incident response.

More importantly, stability protects capital investment. Modern AI clusters often represent millions of dollars in deployed GPU capacity. Even modest inefficiencies or recurring recovery events can compound quickly when multiplied across dozens or hundreds of GPUs running continuously.

Intelligent traffic management helps address several root causes of infrastructure stress and their downstream business impact:

Uneven traffic distribution that creates GPU hot spots and overload conditions
Saturation events that trigger recovery cycles and take inference capacity offline
Lost availability that affects user experience and service commitments
Increased operational effort as teams are pulled into recovery and incident response
Accelerated degradation of GPU reliability and predictability over time

Most large operators already plan multi-year depreciation cycles not because GPUs immediately fail, but because sustained stress reduces their usable service life. Preventing these conditions improves both uptime and long-term return on GPU investment.

The same validated performance testing highlights this broader shift. Rather than optimizing for short lived peak performance, the focus moves to sustained efficiency under load. By smoothing traffic behavior and avoiding overload conditions, platforms can improve token output while maintaining predictable performance over time.

Building durable AI platforms for production

As AI becomes foundational infrastructure, the criteria for success increasingly resemble those of other mission critical systems. Reliability, predictability, and cost control matter just as much as raw performance.

BIG-IP Next for Kubernetes approaches AI inference as a systems problem. Networking, load balancing, and security are integrated into a unified data plane that adapts to real time conditions. This allows inference platforms to scale without sacrificing stability as demand grows.

Tokens remain an important metric. They are the visible outcome of a well-run system, not the starting point. The real leverage comes from managing traffic intelligently so that performance gains do not come at the expense of uptime or hardware longevity.

For AI platform teams building for production, sustained performance is what ultimately determines success. Intelligent traffic management makes that possible by keeping GPUs stable, available, and productive over time.

To learn more, read our whitepaper, “Validated performance for AI inference at scale with F5 BIG-IP Next for Kubernetes."

Featured Blog Posts

Three things every CISO should know about API security

F5 completes acquisition of CalypsoAI, introduces F5 AI Guardrails and F5 AI Red Team

F5’s announcement to acquire CalypsoAI builds towards TRiSM framework

Tags: AI Infrastructure, F5 BIG-IP, AI Security