The Importance of Calibrating Application Health Measurements

Lori MacVittie Miniatura
Lori MacVittie
Published May 11, 2020

Recently, I was reminded of the importance of calibrating measurements when I re-entered the realm of reef keeping. Like rapid changes in the application landscape, reef keeping has dramatically changed in the past five years. 

Today, I'm enabled by connected monitors and probes that automatically track and alert me if something is awry. Web and mobile dashboards enable monitoring, programming, and an at-a-glance view of the health of my tank and status of the myriad devices that keep it running. 

Astute readers will note that PH appears to be "below minimum." No need to worry, all is well. You see the probe (sensor) is not properly calibrated. It's a common problem; calibrating probes is a process fraught with failure. At the moment I know that the probe is calibrated low, and automatically adjust the measurement based on that knowledge. 

Like the health of a reef tank, the health of applications requires careful attention to key metrics. Deviations, especially wild fluctuations, can indicate a problem. Manual adjustment of metrics is not a process you want to mimic when it comes to applications. Manual adjustments may work for one application or even two, but the average organization has between 100 and 200 applications in its digital portfolio. You need accurate measurements calibrated against typical health patterns.

As with most network and application metrics, this means taking samples for a period of time and learning the "highs" and "lows." Thresholds can then be used to determine anomalous behavior.

The issue is not the principle, but rather execution.

First, we're generally focused on only one measurement point: the application. Interestingly, the health of a reef tank requires measuring salinity and PH along with temperature because both values are impacted by temperature. Measuring application health is much the same; it is impacted by other measures, such as network performance and load. Unfortunately, most organizations aren't necessarily taking a holistic view of application health. The application itself may be fine, but the customer experience might be abysmal thanks to a poorly performing connected device or network. 

We need to broaden our view of application health by expanding what we measure. Moreover, we need to calibrate additional measurements to make sure we can identify what's typical and what is not. Because "what is not typical" can be indicative of a problem or, worse, an attack.

Second, the challenge of scale arises from the need to calibrate across multiple points for every application we need to monitor (spoiler: that's all of them). We can't expect operators to manually calibrate that many data points. It's not humanly possible.

That's where machines come in.

Advanced Analytics

Advanced analytics and machine learning are one of the answers to the issue of scale. Machines can, and do, process vast volumes of telemetry at significant rates. They can ingest, normalize, and analyze patterns and relationships in quantities of data that we, as human beings, simply cannot manage. In this way, machine learning provides the ability to calibrate "normal" across a range of related data points and immediately detect deviant patterns that indicate a problem.

It's easy enough to correlate performance problems with an application to a Monday morning surge of logins. What isn't easy is to recognize that Bob usually doesn't log in until Monday afternoon. And yet today, he is. That's an anomaly that isn't readily recognizable by human operators because we don't have that level of visibility. With enough telemetry emitted by the application, client, and application services that comprise the code-to-customer experience, advanced analytics can detect that anomaly. It can also flag it or push a new business flow that verifies Bob is actually trying to log in.

That capability is like what many applications do today on a device level. Many digital processes push verification codes and ask us to prove we're human by identifying all the cars in a blurry image. But it's the device details that trigger the new business flow, not the behavior of logging in at an unusual time of day. In the future, we'll need to be able to trigger flows based on both, especially if we continue to support a distributed workforce.

That makes calibration a critical part of the process. And calibration is achieved by taking (a lot of) measurements and coming up with "normal." That, too, is a process that challenges human scale and requires machines to ingest and analyze significant amounts of telemetry.

Advanced analytics will ultimately enable observability and give rise to new services able to uncover hidden application insights (DEM), enable smarter app service orchestration (AI Ops), and produce previously undiscoverable business value (AI-Enhanced Services).

To do that we need to generate copious amounts of telemetry so we can calibrate "normal" behavior for applications, users, and everything in the data path in between.