Compression isn’t about speed anymore, it’s about the cost of thinking

Industry Trends | February 02, 2026

Lori Mac VittieDistinguished Engineer and Chief Evangelist | F5

Once upon a time, back when applications were simple, networks were slow, and everyone swore that SOAP was the future, compression meant survival. You compressed because bandwidth was expensive, storage was limited, and users valued rapid response. Smaller payloads meant faster delivery, happier customers, and fewer surprise invoices from your cloud provider.

Back then, compression was noble. It made everything better.

But the AI era has a remarkable talent for taking our comfortable assumptions, holding them up to the light, and revealing that the world has rotated while we weren’t looking. And nowhere is that morex evident than in how we think about compression today.

Traditionally, compression was about performance. Then it was about bandwidth. Today?

Compression is about not bankrupting yourself on inference.

Let’s start with the obvious: yes, bandwidth still costs money. Cloud egress is infamous, and data transfer bills can still produce heart palpitations. But be honest and compare the cost of moving a megabyte across the wire with the cost of generating 10,000 tokens on a top-shelf large language model (LLM). One is a forgotten rounding error on the monthly bill. The other is a sternly worded message from finance asking why you’ve suddenly consumed the budget for Q3.

In the AI world, we don’t compress to make things smaller. Every token generated is an act of cognition and cognition, for machines, is expensive. So, we compress to make them cheaper to “think about.”

The new economics of compression

LLMs have redefined bottlenecks in ways that feel almost disrespectful to the past three decades of systems engineering. It used to be that you optimized network paths, minimized payloads, and pre-compressed assets so your app wouldn’t take six days to load on a 3G connection.

Now the slowest, most expensive component in the system isn’t the network at all.

It’s the brain.

Every token an LLM emits demands GPU cycles, VRAM, energy, latency, and money. Lots of money, depending on which model you’ve fallen in love with this quarter. The cost of generating text now dwarfs the cost of transporting it, which means we’ve inverted the compression value chain:

We compress not to shrink the data, but to reduce the number of thoughts an AI has to think.

And honestly? That’s a very funny sentence, but it’s also the operational truth.

Where compression lives now

In the olden times, when we traversed networks uphill, both ways, compression lived at the edge of the network in specialized devices. Eventually, it consolidated on application delivery controllers and took on names like “minification” and “HTTP compression.” For a time, it was specialized functionality. Today? It’s just part and parcel of application delivery.

But, thanks to AI, we’re seeing the emergence of new compression techniques. Probably because we have to. We’re no longer just compressing text using well-known algorithms. We’re striking out words like a Chicago- or AP-style editor with a pen full of red ink and something to prove.

Prompt compression

This is the new heavyweight champion. You shrink the prompt to shrink the invoice. Irrelevant details? Gone. Redundant context? Deleted. Overly chatty instructions? Trimmed like an overgrown hedge. The shorter the prompt, the fewer tokens consumed, and the happier your procurement department.

Output compression

“Be concise” has quietly graduated from a writing preference to a cost-control strategy. Short answer = cheap answer. Long answer = someone’s paying for that verbosity.

Embedding compression

You’re not reducing bytes here, you’re reducing dimensionality, which reduces memory footprint, which reduces retrieval cost, which reduces everything your vector store is quietly billing you for every minute.

Model compression

Pruning, quantization, distillation. In another era, these were academic curiosities. Today they serve one purpose: run it cheaper.

If it also runs faster? Wonderful. If it fits on a smaller GPU? Miraculous. But the point is, and always has been, to lower the compute burn.

Compression as control

When your system’s most expensive operation is thinking, you start treating thoughts like a limited resource. This is the inverse of every performance model we’ve been taught. Network is cheap. Storage is cheap. CPU is cheap. Memory is cheap enough that we barely pretend to manage it anymore.

But GPU inference? That’s the new oil. And yes, I hate that cliché but if the shoe fits, wear it.

And like oil, we now have a global economy dedicated to extracting every last drop efficiently. Compression is no longer a nicety; it’s a pillar of operational AI.

It’s how you stay inside budget, scale responsibly, prevent accidental million-dollar token overruns, and prevent agents from rewriting War and Peace because you forgot to set max_tokens.

We compress now not because our networks can’t handle the load, but because our AIs can’t handle the invoice. The future isn’t about making data smaller; it’s about making thinking cheaper.

Compression no longer serves the network. It serves the ledger. And if that gives rise to operational accounting, then don’t be surprised. I won’t.

Featured Blog Posts

Introducing the CASI Leaderboard

Extranets aren’t dead; they just need an upgrade

Navigating higher education during a time of tightening budgets: How F5 can help

Tags: AI, Application Delivery, Office of the CTO, Operations