Okay, kids, full stop right now. I’m hearing terms like “tokens” and ”models” thrown around like Gen Z slang on a TikTok video, and most of the time it’s completely wrong.
I said what I said.
It’s (past) time to lay out how inference-based applications are built, where tokens are created and consumed, and how the various pieces of the puzzle fit together. So grab some coffee and let’s dive right in.
I shouldn’t have to start here but I will. When we say “inference” we’re really talking about the way a large language model (LLM) reasons through its voluminous data. You don’t call an LLM directly. There is really no such thing as an “LLM API,” and I will give you “the mom glare” if you say it.
The APIs used to invoke inference go through, wait for it, an inference server.
Inference servers are runtimes for AI models, just like app servers are runtimes for Java and a host of other development languages. You generally don’t run a model locally without an inference server. Popular choices include Ollama, vLLM, NVIDIA Triton, and HuggingFace Text Generation Interface (TGI).
Now, like app servers—on which you’d typically deploy an application package—you deploy an “AI package” (a model) on an inference server. The server exists to serve AI model capabilities via an API with a minimal set of API endpoints (e.g., /generate, /completions, or /embeddings in REST or gRPC) focused on model inference tasks like text generation, tokenization, or embeddings.
To interact with an inference server, you build an app. Yeah, a traditional application (that might be modern because, of course it might) and it will communicate with the inferencing server.
Users get an app, too, either in a browser or on a phone, and it generally uses a plain old API to talk to that app you built. And voila! You have a complete message flow from client to app to inference server and back.
Yes, it’s really just a new take on the three-tier web application with the data tier being replaced by the inferencing tier. I know, terribly disappointing, isn’t it?
The simplest form of an AI app is a three-tier web application in which the data tier is replaced by an inferencing server.
That is inference. Just a new kind of application workload that rewards the same operational discipline we have always valued, now expressed in tokens, context, and streams instead of sessions, cookies, and queries.
This is usually where folks get tripped up and use words in ways that violate the Prime Directive of terminology. Tokens are units of words processed by an LLM. They are officially generated by a tokenizer, on the inference server.
You can include a tokenizer in your app to count tokens ahead of time for cost prediction and local rate limits, but the definitive token count happens on the inference stack. Tokens are a model representation, not network traffic; the wire carries JSON text. You’ll only see token IDs if the API explicitly returns them.
By default, infrastructure does not see tokens. It sees JSON. Tokens aren’t on the wire. If you want infra to act on tokens, you must count them at the gateway with the same tokenizer or consume counts the model server emits. Tokens are the currency of AI, but they mostly live inside the inference stack unless you surface them.
Inference is not mystical. It is an application workload with new knobs. Traffic is JSON. Tokens are model units. Embeddings are vectors. Mix those up and you will design the wrong controls, price the wrong thing, and debug the wrong layer. For example, routing based on JSON size instead of token count could overload a model with long-context requests.
If you want infrastructure to act on tokens, teach it how. Put a tokenizer at the gateway or consume the counts the model server emits. Otherwise, your routers see text, not tokens, and they will make text-level decisions while your bills are racked up at the token level. Note that tokenizers must match the model’s (e.g., GPT-4’s tokenizer differs from LLaMA’s) to ensure accurate counts. Mismatched tokenizers could lead to incorrect cost or limit calculations.
Treat availability as usable and correct, not just up. Track tokens per second, time to first token, and context failures (such as exceeding context window limits or incoherent outputs from saturated context) the same way you used to track queries per second and cache hits. Keep logs like you mean it. Inference traffic is not training data unless you say so. If you keep prompts and outputs, protect them.
Name things correctly. Measure the right layer. Put limits where they matter, like token-based quotas at the gateway to prevent cost overruns. Do that and inference behaves like the rest of your stack: predictable, affordable, and boring in all the best ways. Tokens buy decisions inside the model, not bandwidth on your network. Call the thing by its name and you will keep your users happy and your invoices short.