Managing context windows in AI isn’t magic. It’s architecture.

Industry Trends | August 27, 2025

Lori Mac VittieDistinguished Engineer and Chief Evangelist

So let’s clear that up, because it turns out that understanding where context (session) lives in AI architectures is pretty dang important. And you know that because I used the word “dang” in a blog. Yeah, I really mean it.

If you’re building your own LLM-powered app and using the OpenAI API to do it, you’re not using ChatGPT. You’re talking to a model. A raw one. And if you think your app is going to behave like ChatGPT just because it’s calling GPT-4, you’re already off the rails.

ChatGPT isn’t a model; it’s a runtime.

People get this wrong all the time, because OpenAI branding hasn’t helped. The “ChatGPT” experience—the memory, the tone, the context, the way it gracefully tracks what you said six questions ago and doesn’t suddenly hallucinate a PhD in marine botany—isn’t magic. It’s architecture. And that architecture doesn’t come for free when you hit the API.

The ChatGPT app is a layered, managed experience. It stitches together:

Prompt engineering
Context management
Persistent memory
Guardrails and fallbacks
Summarization, truncation, and control logic

You don’t get any of that by default with the API. None.

When you call GPT-4 (or any other model, for that matter) directly, you’re feeding it a raw sequence of messages in a stateless format and hoping your context window doesn’t overflow or fragment along the way. It’s a blank model, not a digital butler.

The context window is on you.

In ChatGPT, the inference server manages the full context window. You type, it remembers (until it doesn’t), and you can rely on the app to mostly keep track of what’s relevant. There's a memory system layered on top, and OpenAI quietly curates which pieces get passed back in.

If you’re using the API? You are the inference server. That means:

You build the message stack.
You decide what to include.
You manage the token budget.
You truncate, summarize, or lose coherence.

And if you get it wrong (and you will, we all do early on) it’s on you when the model forgets the user’s name halfway through the session or starts inventing data you never fed it.

Memory isn’t a feature. It’s infrastructure.

ChatGPT has memory. You don’t. Not unless you build it. When you use the API, it’s stateless. You manage all context, memory, and flow. When you use the ChatGPT app, OpenAI does it all for you. It’s baked in.

The “memory” in ChatGPT isn’t just a note stuck to a prompt. It’s a system that stores user facts, preferences, goals, and constraints across sessions, and injects them contextually into the prompt at the right time. If you want something like that in your app, you’ll need:

A data store
Schema for memory entries
Control logic to determine when to include what
A way to keep it from bloating your token count

In other words: infrastructure.

This is why most homegrown AI apps feel brittle. Because they’re talking to an LLM like it’s a chatbot, not like it’s a tool in a system. The result? Lost threads, repetitive prompts, weird regressions, and user frustration.

So what?

If you’re building on the API, you’re building infrastructure whether you meant to or not. And if your users expect a ChatGPT-like experience, you’re going to have to deliver:

Persistent memory
Smart context compression
Turn-based state management
Guardrails, fallbacks, and injection logic

But that’s just the beginning. Because if you're building this on top of an enterprise-grade app delivery stack—and you are, right? RIGHT?—then the rest of the platform better show up, too.

This is where application delivery and security teams come into play.

App delivery: You’re not just serving HTML or JSON anymore

AI-native apps are stateful, chat-driven, and often real-time. Which means:

Session affinity suddenly matters again. You can't shard conversation state across stateless backends unless you're managing session tokens like it's 2009.
Latency is UX death. You're streaming multi-turn interactions, not serving static pages. Your load balancers and edge logic better know how to prioritize low-latency conversational flows.
Token cost = bandwidth + compute + cash. Efficient delivery matters now. Payload size isn’t just a network concern; it’s a billing event.

This isn’t a brochure site. This is a living interface. You can’t slap a CDN in front of it and call it a day.

Security: If you don’t own the prompt, you own the risk

LLMs are programmable via prompt. That means the prompt is now the attack surface.

Injection attacks are real, and they're already happening. If your input sanitization is lax, you’re letting users reprogram your AI on the fly.
Prompt manipulation can leak sensitive memory, derail workflows, or spoof intent.
Model output can be exfiltrated, manipulated, or poisoned if you don’t wrap it in policy enforcement and observability.

Security controls must evolve:

You need WAFs that understand AI traffic. Blocking bad JSON isn’t enough. You need guards on instruction sets, not just payload structure.
You need audit trails for LLM decisions, especially in regulated environments.
You need rate limiting and abuse protection, not just for API calls, but for prompt complexity and cost spikes.

And that’s not even touching the data governance nightmare that comes with injecting user records into context windows.

Build the right stack now before agentic AI runs amok.

Basically, ChatGPT is a polished product. The OpenAI API is a raw tool. Confuse the two, and your stack’s going to pop.

Build wisely. Build an enterprise grade app delivery and security stack to support your burgeoning AI portfolio. Trust me, it’s going to be even more important when you start building AI agents and then eagerly embrace agentic architecture. Because if you think context-drift and hallucinations are problems for end-user focused AI applications, just wait until you see what agentic AI can do to your business and operational AI applications.

Featured Blog Posts

Introducing the CASI Leaderboard

Extranets aren’t dead; they just need an upgrade

Navigating higher education during a time of tightening budgets: How F5 can help

Tags: API, Application Delivery, Office of the CTO, AI