AI-aware traffic management with F5 BIG-IP Next for Kubernetes

F5 Ecosystem | September 22, 2025

Have you ever used an AI-powered app to draft content or generate an image—typed your request, hit enter, and then waited? And waited? Only to have the response finally arrive, slow and off the mark, filled with irrelevant details?

As frustrating as that feels, the real story is what’s happening behind the scenes. Companies that deliver those AI experiences either have to build highly optimized infrastructure themselves, or rely on GPU-as-a-Service and LLM-as-a-Service providers to do it for them.

Making everything look simple on the surface is a massive challenge for those providers. They’re shouldering the burden behind the scenes—keeping GPUs busy, response times tight, and token usage under control—so that we get a fast, reliable experience.

“The combination of intelligence and programmability … isn’t just about performance. It’s designed to help make AI infrastructure more predictable, more adaptable, and more cost efficient.”

And to complicate things further, in the world of AI infrastructure only one thing is constant: change. Models evolve rapidly. Workloads spike without warning. New security, compliance, or routing needs often emerge faster than release cycles.

That’s why intelligent and programmable traffic management isn’t a “nice-to-have.” It’s a necessity.

With F5 BIG-IP Next for Kubernetes 2.1 deployed on NVIDIA BlueField-3 DPUs, we’re taking traffic management to the next level, combining intelligent load balancing and expanded programmability to meet the unique demands of AI infrastructure.

Smarter Load Balancing for Faster AI

Traditional load balancing spreads traffic evenly. This works well for web apps, but in the case of AI, even isn’t always efficient. A small prompt can’t be treated in the same way as a massive token-heavy request; otherwise GPUs overload, inference pipelines stall, or resources go idle.

BIG-IP Next for Kubernetes 2.1 makes load balancing smarter, using real-time NVIDIA NIM telemetry, which includes pending request queues, key-value (KV) cache usage, GPU load, video random-access memory (VRAM) availability, and overall system health. BIG-IP Next for Kubernetes 2.1 intelligently and quickly routes each request to its optimal processing destination.

The impact is clear:

Higher utilization equals lower cost per token. Optimized GPU utilization frees up CPU cycles and reduces idle GPU time. This results in more tenants per server and less overprovisioning.
Faster responses mean happier users. Reduced time-to-first-token (TTFT) and response latency creates smoother experiences, fewer retries, more usage.
Better monetization results in scalable revenue models. Token-based quota enforcement and tiering applied in real time mean clear monetization boundaries and predictable pricing models.

Programmability That Keeps Pace

Intelligence gives you efficiency, but programmability gives you control. With enhanced programmability via F5 iRules on BIG-IP Next for Kubernetes 2.1, we’re putting customers in the driver’s seat so they can adapt instantly instead of waiting for the next feature release.

Today that means access to capabilities like LLM routing (steering requests across models and versions in real time), token governance (enforcing quotas and billing directly in the data path), and MCP traffic management (scaling and securing Model Context Protocol traffic between AI agents).

And this is just the beginning. The real value of programmability lies in its flexibility: as new models, service level agreements, and compliance requirements emerge, providers can craft their own policies without being limited to out-of-the-box features.

The combination of intelligence and programmability in BIG-IP Next for Kubernetes 2.1 isn’t just about performance—it’s designed to help make AI infrastructure more predictable, more adaptable, and more cost efficient.

Whether an AI cloud provider is delivering GPU capacity for compute, AI models, or both, they can now scale without overbuilding, monetize without complexity, secure without slowing down, and customize without rewrites.

For providers, this means less time wasted putting out fires and more focus on innovation and growth. For customers, it means responses that are faster, sharper, and more reliable. These are the behind-the-scenes infrastructure wins that make every AI Interaction feel effortless—and deliver the kind of AI experiences that keep users coming back.