Can Engineers Build Networks Too Complicated for Humans to Operate? Part II: Making Sense of Network Activities and System Behaviors

In part I of this series, I explored some of the issues surrounding the fact that we have managed to build networks so large and complex that it is essentially impossible to grasp any significant fraction of network activities without asking for help from… the network itself. In this installment, I delve into some actual techniques for acquiring, analyzing, and managing the Hydra’s head of firehoses aimed at your head if you are the one in charge of making sense of network activities and system behaviors.

But first, I’d like to mention another reason that solving this problem is critical—and one of the most challenging puzzles of my professional life. Attackers or red-team folks have it easy. For them to plant a flag, declare victory, and do their silly looking celebration dance, they need to succeed once. They will fail several thousand times for each success, but the failures are Internet noise. Their failures are how I check to see if an Internet device is functioning properly. If I don’t see an attacker failing to take over the connected system in 5 to 10 seconds, I can assume the device isn’t configured and on the Internet. Defenders, on the other hand, have to succeed 100% of the time. Any failure on the part of a defender is notable and potentially disastrous. If we think about that in relation to the “rock-in-a-rock-stack-on-planet-rock” problem we have with seeing anything on a busy network, we start to see why and how figuring out a way to capture, parse, interpret, and generally process network activities at scale is such a challenging problem.

In part I, I mentioned that in some circumstances the only absolute truth is the detailed packet contents. In some circumstances, 100% of the metadata surrounding an event will point one way, but examination of the packet contents will show that the truth is otherwise. My previous example was something that looked bad but wasn’t, so I’ll provide a counter example (and the more common type) of something that looks fine but indicates that your system is compromised.

Malware beaconing (essentially the method and act of some malware reaching out to its operator for new commands) comes in many forms, and some are detectable as highly periodic, aimed at a specifically known bad reputation site, or detectable in some other way. Clever malware authors use command and control (C&C) channels that look more like normal data, for example, DNS lookups or posting and reading comments in Britney Spears’ Instagram feed.1 The only way you know the difference between a “legitimate” access of Britney’s Instagram and C&C activities is by examining the actual packet contents. As it turns out, a Russian-generated bot’s communication with its master is measurably different from random folks posting comments. I know, I was surprised too.

Sooo, you need to capture contents for analysis, but a full day of packet capture for your 10G network would use more disk space than your current datacenter. You could go out and buy a lot of disk space, or you could find a way to identify just the stuff you want and capture that. The rest of this entry is about how to accomplish this.

First, I need to provide a brief explanation of ring buffers—what they are, how they work (high level) and what you might use one for. A buffer is generally a memory area where you are temporarily storing data before it’s used. The shelf between the short order cook and the wait staff is a kind of buffer, where you store food briefly until it can be picked up by a server for delivery to a table. This is a buffer that’s being operated as what’s known as a “first in, first out” queue, but the queueing mechanism is not as important to our example as the way in which the buffer (shelf) fills and what happens when it’s full. When the short order cook fills the shelf and has new things to put on it, there’s much yelling and pointing and wait staff schlepping food to tables as fast as they can. Essentially, everything has to stop until there’s more buffer space on the shelf to place freshly cooked food.

When dealing with packet capture, this is the equivalent of filling up your disk. No more packets will be captured until you delete some of what you have or move it elsewhere. Instead of having Nagios start yelling at us about the full disk, we can adopt another method for dealing with the buffer, called ring or circular buffering. The best way to imagine a ring buffer, oddly enough, is as a straight tube that’s open on both ends. The length of the tube is equivalent to the storage size of the buffer. You push things in one end of the tube/buffer, and once it’s full, anything that can’t fit into the allocated storage size falls out the other end (gets deleted).

The reason this is called a ring buffer is that the reality of operating one means that you don’t actually move things down a pipe, you create a circular linked list of storage areas and move the pointer to the next area to be deleted/overwritten. It’s far more efficient than moving things, and we need efficiency to capture packets at the speeds we are talking about.

At the end of this blog, I’ll explain exactly how to accomplish this with a real packet capture tool, tcpdump, which is widely available and simple to operate. Before that, though, I should say a bit about what this method means. It means that you really only have access to a window of packet capture at any one time. It’s not optimal, but there is a way for us to make the most of it, by running these packets though whatever detective techniques we have and using detection of stuff to trigger archival copy of the interesting packets for later (possibly human) examination.

The way it works is that you really have two indexes into the ring buffer, the normal one that watches where the end of the buffer is and what to overwrite next, and one that allows you to quickly copy a time-bounded slice out of the buffer, for example, “I want all of the packets from 2:34 to 2:37,” which yanks a few hundred thousand packets out of the buffer for further examination. The key here is that the trigger is something like an IDS hit, or optimally something more sophisticated like an event-correlated score tagging. The result, though, is that now you have a little chunk of packet capture which may have something in it that warrants further analysis. These smaller chunks are something you can store for a lot longer than the several firehoses worth crossing your network every few seconds.

There are dozens of tools out there for capturing packets, searching through the packets and managing large number of packets. For our example, I’m going to explain how to perform packet capture to a ring buffer using two popular tools so you don’t have to code up something of your own.

TCPDUMP

tcpdump is the old standard for UNIX and UNIX-like systems. It’s been the go-to for command line ad-hoc packet capture for a few decades and, like vi, it’s something you learn because it’s nearly always available. (And some of us just like vi as an editor—don’t judge me.) With tcpdump and with wireshark, the way to accomplish this is to time limit each capture file, and tell it to overwrite older files based on another criterial (file number, time.) For tcpdump, we use the “-G” parameter to limit individual capture to a specified time period and the “-W” parameter to limit the number of files generated and rotate (per the ring buffer requirement).

A typical command line would look like:

sudo tcpdump -i en0 -w Capture -W 10 -C 10 &

-i: What interface on the system to capture packets from

-w: Write the packets to this filename

-C: File size, each unit being 1,000,000 bytes

-W: How many files to write before deleting the oldest

The command line above results in a file list like this:

$ ls -lth

total 190032

-rw-r--r-- 1 root staff 6.9M Oct 3 16:00 Capture8

-rw-r--r-- 1 root staff 9.5M Oct 3 15:58 Capture7

-rw-r--r-- 1 root staff 9.5M Oct 3 15:58 Capture6

-rw-r--r-- 1 root staff 9.5M Oct 3 15:58 Capture5

-rw-r--r-- 1 root staff 9.5M Oct 3 15:58 Capture4

-rw-r--r-- 1 root staff 9.5M Oct 3 15:58 Capture3

-rw-r--r-- 1 root staff 9.5M Oct 3 15:58 Capture2

-rw-r--r-- 1 root staff 9.5M Oct 3 15:58 Capture1

-rw-r--r-- 1 root staff 9.5M Oct 3 15:57 Capture0

-rw-r--r-- 1 root staff 9.5M Oct 3 15:57 Capture9

Note that this capture has been running for a while, so the file currently being overwritten is Capture8. To illustrate, I’ve produced this listing sorted by date (newest on top) showing sizes (around 10MB, except the one that’s being written.)

So, now that we have our ring capture going, we can start considering what we can do with that. The simplest thing we can imagine is that based on a detection, say from an anti-virus tool, we can save a chunk of packet capture off for later examination by an analyst. For example, if your edge-based AV tool kicks off an alert that a C&C host associated with Exploit-SWFRedirector.b is getting hit from an inside source at exactly 15:59, you can trigger the following command:

cp Capture7 /analysis_pool/Capture7`date +%s`

cp Capture8 /analysis_pool/Capture8`date +%s`

This will save about 20 MB of packet capture to a place where it can be examined at a later date. Depending on how many triggers you are getting, you might have to adjust the file naming, since this will collide on multiple passes of the ring with copies for the same file name in less than a second.

What we’ve done here is to trap a lot of rock-colored items on a mountain of rocks on planet rock. Once we look inside the trap, we can see if we found the rock-colored lizard, and release everything else back into the wild.

Can Engineers Build Networks Too Complicated for Humans to Operate? Part II: Making Sense of Network Activities and System Behaviors

How to selectively capture packets for further analysis and avoid buying a storage farm.

Authors & Contributors

Authors & Contributors

Read More from F5 Labs

The Ghost in the Shell: Why Agentic AI is a Corporate Security Nightmare

The State of Post-Quantum Cryptography (PQC) on the Web

Introducing the CASI Leaderboard