If cloud-native apps were a handful of garden hoses, AI inference and agent systems are an industrial sprinkler array controlled by a toddler with a fire helmet and a juice box. You think you know where the water is going, but you don’t. By the time you realize it, something expensive is already soaked.
Inference workloads and agent-driven execution don’t behave like traditional services. They don’t follow predictable request paths, they don’t repeat workflows consistently, and they don’t fail politely. They shape their own runtime, redirect based on partial results, retry when they think it’s wise, and consume resources in highly variable bursts. If your observability strategy is built on averages, rollups, or 30-second dashboards, that’s operational malpractice.
Impact on performance
Inference latency isn’t a single metric, and it isn’t tied to a single tier. It’s shaped by model selection, token count, GPU queue depth, upstream data access time, and whatever prompt gymnastics an agent performs mid-flight. Agents make this even more chaotic by branching, chaining, or retrying without announcing their plans.
If you can’t see performance at the level of intent, token generation, model pathway, and execution lineage, you will blame the wrong component, optimize the wrong tier, or scale the wrong resource. Meanwhile, users (human or automated) will simply conclude: “AI is slow.”
Impact on availability
Traditional health checks only answer, “Did it respond?” AI workloads need, “Did it respond within expectations for this task?”
Inference systems rarely go down; instead, they degrade into polite but unusable. Slow answers, stale cache hits, or quiet hallucinations can all look like success unless you have visibility into task type, expected latency floor, and quality thresholds. With agents piling on decisions, retries become invisible loops that look like normal traffic right up until something collapses.
The operational nightmare here isn’t downtime, it’s incorrect, incomplete, or context-inappropriate success.
Impact on scalability
You can’t scale what you can’t see, and you definitely can’t cost-optimize what you can’t attribute. AI capacity planning must account for:
- Token throughput
- Model concurrency and the cost of model switching
- GPU saturation
- Agent retry multiplication
- Cost per call
- Execution stretch across chained steps
Without that level of visibility, you oscillate between over-provisioning (financial regret) and under-provisioning (support calls, angry PMs, escalation war rooms). Neither proves maturity; both prove observability debt.
Best practices
Complete observability for AI isn’t about collecting more logs, it’s about capturing faster, richer, behavior-aligned telemetry that understands what the AI was trying to do, what it actually did, and what it consumed to get there. That requires instrumentation that can explain not only what happened, but why, under what assumptions, and at what cost.
This means tracing execution lineage for agent-assembled workflows, elevating cost and latency to first-class operational signals, correlating model behavior to intent rather than endpoints, and incorporating agent-declared expectations into observability metadata. Metadata evolves from “request info” to “runtime contract” with latency budgets, retry preferences, data-sensitivity labels, agent priority tiers, and cost ceilings included at the call level.
As this evolves, we will see the rise of a high-speed telemetry plane: a purpose-built, sub-millisecond data stream capable of capturing per-token, per-step, and per-agent signals without melting storage or threatening SLOs. This telemetry plane will likely sit close to the inference layer, support semantic compression, and provide continuous, low-latency insight into cost, capacity, health, and intent alignment. Without this, AI observability will either overwhelm existing pipelines or arrive too late to be useful.
Systems will need to ingest, correlate, and reason over this telemetry in near real time, not batch mode, and use it to drive adaptive routing, agent throttling, selection of lower-cost inference paths, and early detection of runaway workflows before they become outages or invoices.
Delivery is different when AI is in the room
AI doesn’t break cleanly, predictably, or loudly and incomplete observability ensures you won’t notice until the business does.
If intelligence is the new application tier, then observability must shift from “What happened?” to: “What was supposed to happen, why did we take this path, and was it worth the latency, risk, and cost?”
When you can answer that in real time, you’re ready for AI.
Read more about the Top 10 Application Delivery challenges faced by organizations across the globe.
About the Author

Lori MacVittie is a Distinguished Engineer and Chief Evangelist in F5’s Office of the CTO with deep expertise in application delivery, automation strategy, and infrastructure. She is known for turning complexity into clarity whether she’s defining guardrails for AI agents, dissecting brittle multicloud architectures, or probing the limits of scalable systems. She brings more than thirty years of industry experience across application development, IT architecture, and network and systems operations. Before joining F5, she served as an award-winning technology editor. MacVittie holds an M.S. in Computer Science and is a prolific author whose publications span security, cloud, and enterprise architecture. She is also an avid tabletop and video gamer with unapologetically strong opinions about cheese.
More blogs by Lori Mac VittieRelated Blog Posts

AI App Delivery Top 10: Lack of fault tolerance and resiliency
AI makes classic resiliency gaps far more costly: single GPU or dependency failures cascade through synchronous inference chains, compounding latency and degrading outputs.

AI App Delivery Top 10: Weak DNS practices
For inference and agentic systems, DNS resilience is critical to availability and performance—bad resolution means misrouting, latency spikes, and service blackouts across regions.

What is the Application Delivery Top 10?
F5 aims to help organizations address challenges in delivering and securing applications, APIs, and generative AI with the Application Delivery Top 10 list.

Behavior and boundaries: The agentic security shift
Agents create emergent, unbounded sequences where risk accumulates over time. Security must shift from single-request validation to continuous behavioral governance across multi-step, evolving flows.

AI is driving the emergence of new traffic types
AI adoption is creating new first-class traffic types: inference requests plus machine-driven automation traffic and high-volume telemetry traffic that feed control loops.

From packets to prompts: Inference adds a new layer to the stack
Inference is not training. It is not experimentation. It is not a data science exercise. Inference is production runtime behavior, and it behaves like an application tier.
