Inference: The most important piece of AI you’re pretending isn’t there

F5 Research | September 29, 2025

Lori Mac VittieDistinguished Engineer and Chief Evangelist

Everyone wants to talk about AI like it begins and ends with APIs. With models. With shiny dashboards that say, "inference complete." But that illusion only holds if you never lift the hood.

Underneath every chatbot, agent, RAG pipeline, and orchestration layer, there’s an inference server. Not a metaphor. Not a buzzword. A literal application server that happens to be running a model instead of a JAR file. And just like traditional application servers, inference engines are where performance breaks, where observability matters, and where your security surface actually lives.

The problem? Almost no one is treating them that way.

Inference in the enterprise isn’t theoretical

According to the Uptime Institute's 2025 AI Infrastructure Survey, 32% of data center operators are already supporting inference workloads. Another 45% say they’ll be doing so in the next few months. That’s not experimental. That’s a shift in the compute substrate. And it’s a shift we’re still mostly blind to.

Inference servers aren’t theoretical. They have names. vLLM. TGI. Triton. Ollama. And they are not interchangeable. vLLM, for example, has been shown to outperform Hugging Face Transformers by up to 24x, and beats TGI by more than 3x in sustained throughput thanks to architectural improvements like PagedAttention and batched scheduling. These aren’t optimization quirks. They’re infrastructure consequences.

We’re talking real numbers: vLLM sustains over 500 tokens per second in batch mode versus TGI’s sub-150. Prompt evaluation durations drop by over 40%, which translates directly into faster response times and better GPU utilization. In a production loop, that’s the difference between scaling inference and stalling under load.

And it doesn’t stop at performance. Tools like vLLM and Ollama expose detailed telemetry: total duration, token-level evaluation windows, prompt-vs-response splits. Not just token counts, but when, where, and how long each token took to compute. That level of granularity is how you troubleshoot drift. It’s how you enforce guardrails. And if you don’t have it, you’re scaling blind.

Like their application server predecessors, inference is where application delivery and security meet AI. It’s where traffic steering and load balancing happen; where payloads are inspected, analyzed, and acted on to ensure security and privacy. Where prompts are sanitized, responses are filtered, and performance is optimized. It is the strategic point of control in AI architectures at which organizations can address the top ten delivery challenges that always plague applications and APIs, whether legacy, modern, or AI.

Why inference gets left behind

Inference is often overlooked because we’re still stuck in API-land. But if you think inference is just another service behind an ingress, you haven’t tried debugging a RAG loop under load. Or tracing misfires across concurrent agent chains. Or dealing with prompt injection in a regulated large language model (LLM) that has to log every decision for audit.

That’s not a theoretical problem. That’s a network bottleneck waiting to happen.

Inference servers are the container for your model. They are the runtime. The choke point. The security boundary. The place where you actually scale AI. A model is math. It’s a dataset, a fancy excel spreadsheet. You don’t scale that; you load it into an inference server and that’s what you scale.

So if you’re serious about operationalizing AI, stop talking about abstract architecture diagrams and start asking harder questions:

What inference engines are we running?
Where are they deployed?
Who can access them?
What telemetry do we collect per request?

These aren’t academic concerns. They’re infrastructure truths. And the longer we ignore them, the more brittle our AI deployments become. Models matter. APIs help. But inference is where reality asserts itself. If you’re not scaling inference, you’re not scaling AI.

Inference is a critical component of AI infrastructure

Most organizations are still hybrid when it comes to AI, relying on SaaS-based tools for convenience while cautiously exploring self-hosted inference. The problem is, SaaS hides the hard parts. Inference is abstracted behind slick APIs and polished UIs. You don’t see the engine misfire, the GPU choke, or the prompt timing drift. But the minute you step into self-hosted territory (and you will) you inherit all of it. Performance, observability, and security aren’t just “nice to haves.” They’re prerequisites.

If your organization doesn’t understand how inference actually works under the hood, you’re not building an AI strategy. You’re just hoping someone else got it right.

Featured Blog Posts

Inference: The most important piece of AI you’re pretending isn’t there

How does SecOps feel about AI? Part 2: Data protection

Tags: Office of the CTO

About the Author

Lori Mac VittieDistinguished Engineer and Chief Evangelist

More blogs by Lori Mac Vittie

Featured Blog Posts

Inference: The most important piece of AI you’re pretending isn’t there

How does SecOps feel about AI? Part 2: Data protection

Related Blog Posts

F5 Research | 10/22/2025

Lessons we are learning from our security incident

F5 CISO Christopher Burger answers common questions from customers surrounding the recently disclosed security incident.

cybersecurity

F5 Research | 09/29/2025

Inference: The most important piece of AI you’re pretending isn’t there

Scaling AI means scaling inference. Learn why inference servers are critical for managing performance, telemetry, and security in production AI workloads.

Office of the CTO

F5 Research | 09/17/2025

Dealing with application vulnerabilities: best practices for security testing

Both the scale and complexity of application vulnerabilities are rapidly escalating. Discover why a proactive, multi-layered approach to security testing is critical.

Application Security

F5 Research | 09/15/2025

How does SecOps feel about AI? Part 2: Data protection

F5 conducted a comprehensive sentiment analysis of security professionals on Reddit about their thoughts on AI.

Enterprise AI Security and Delivery,

AI Security,

DevSecOps

F5 Research | 09/15/2025

IL5/6 won’t save you: Prompt injection threatens read-only LLMs

As LLMs integrate into U.S. Department of Defense IL5/IL6 environments, discover how F5’s solutions secure data, prevent injection attacks, and enhance zero trust models.

Multicloud

F5 Research | 09/11/2025

How does SecOps feel about AI?

F5 conducted a comprehensive sentiment analysis of security professionals on Reddit about their thoughts on AI.

AI Security

Inference: The most important piece of AI you’re pretending isn’t there

Inference in the enterprise isn’t theoretical

Why inference gets left behind

Inference is a critical component of AI infrastructure

About the Author

Related Blog Posts

Lessons we are learning from our security incident

Inference: The most important piece of AI you’re pretending isn’t there

Dealing with application vulnerabilities: best practices for security testing

How does SecOps feel about AI? Part 2: Data protection

IL5/6 won’t save you: Prompt injection threatens read-only LLMs

How does SecOps feel about AI?

WHAT WE OFFER

RESOURCES

SUPPORT

PARTNERS

COMPANY