Forget uptime. A low MTTR is the new '5 9s' for IT

F5 Ecosystem | May 30, 2017

Outages are expensive. Whether they’re ultimately the result of an attack or a failure in software or hardware isn’t that relevant. The costs per minute of downtime are increasing, thanks to the growing reliance on APIs and web apps of the modern, digital economy.

For some, those costs are staggering. It’s estimated that Amazon’s 40 minutes of downtime back in 2013cost them $2.64M. That’s $1100 per second for those disinclined to do the math. If you think that’s horrifying, consider Google, whose 5-minute downtime in the same year cost them $109K per minute (or $1816.67 per second) for a whopping total of $545K. For 5 minutes. Technically, if that was all they suffered, that’s the vaunted “5 9s” IT is tasked with achieving.

How often do outages happen? Too often, apparently. If you’ve never seen this one, take a gander at pingdom’s live outage map. It’s built from data culled from its over 700,000 global users. This morbidly fascinating map displays outages occurring in the past hour across the globe. The bright flashes depicting outages are a nice touch; really drives home the splash they make with users.

Which is to say an unwanted one.

The digital economy exacerbates this problem. Earlier this year an S3 outage at Amazon knocked out a whole spate of customers’ apps and web sites. But lest you pin this problem on public cloud providers, a quick dive into the site builtwith.com will quickly erase that belief. The percentages of sites taking advantage of CDNs and APIs is perhaps alarmingly high if you consider the dependency that incurs on someone else’s uptime. It’s hard to find a site that doesn’t rely on at least one external API or service, which increases the possibility of downtime because if that external service is down, so are you.

Basically, IT settled on “5 9s” because it is impossible to achieve 100% availability. The key today, when per second costs are skyrocketing thanks to the shift of the economy into the digital realm, is to minimize downtime. In other words, setting goals that require a low mean-time to resolution (MTTR), is just as critical – maybe more – than trying to eliminate downtime.

One of the key measures of “high performing organizations” in Puppet Labs’ 2016 State of DevOps Report is MTTR, defined as the time it takes to restore service when a service incident occurs (e.g. unplanned outage or service impairment). The highest performing organizations (based on the report’s assessment) take less than one hour while medium and low performing organizations take “less than one day”. “In other words” the report notes, “high performers had 24 times faster MTTR than low performers.”

You’ll note the question wasn’t “if” there is a service incident. It was “when” there is a service incident. The assumption is that an incident will occur, and thus the key is to minimize the time to resolution. A 2016 survey by IHS reported that “on average, survey respondents experience 5 downtime events per month, and 27 hours of downtime per month” cost the average mid-sized organization $1 and their larger counterparts up to $60M.

If we assume Murphy’s Law still presides over Moore and Conway, the answer is to try to minimize MTTR in order to reduce the time (and costs) associated with inevitable downtime.

That means visibility is critical, which means monitoring. Lots and lots of monitoring. But not just the website, or the web app, or the API – we need to monitor the full stack. From the network to the app services to the application itself. That’s something not everyone does, and when they do, they appear to do it inconsistently.

alt="atlassian-incident-response"

Consider the 2017 xMatters|Atlassian DevOps Maturity surveyin which 50% of respondents declared they “wait for operations to declare a major incident” before responding. A frightening 1/3 of companies “learn about service interruptions from their customers.”

In a digital economy, every second matters. Not just because it costs money but because it negatively impacts future revenues, as well. Decreasing brand value and trust with customers results in fewer purchases, users, and eventually stagnating growth. That’s not a direction organizations’ should going.

Monitoring is the first step to detecting issues that cause outages. But monitoring alone doesn’t help MTTR. Communication does. Alerting the relevant stakeholders as soon as possible and arming them with the information they need to troubleshoot the issue will assist in a faster time to resolution. That means sharing – one of the four key pillars of DevOps – is key to improving MTTR. Even if you aren’t embracing other aspects of DevOps at a corporate level yet, sharing is one you should consider elevating to a top level initiative. Whether it’s through ChatOps or e-mail, a mobile app or a dynamically updated wiki page, it’s imperative that the information gleaned through monitoring be shared widely across the organization.

A hiccup in a switch or server may seem innocuous, but left alone it might wind up knocking out half the services a critical app depends on. In the 2017 State of the Network study conducting by Viavi, 65% of network and systems administrators cite “determining whether problem is caused by network, system, or apps” as their number one challenge when troubleshooting application issues. Greater visibility and full-stack monitoring is one way to address this challenge, by ensuring that those responsible for finding the root cause have at hand as much information about the status and health of all components in the data path as possible.

Visibility is key to the future of IT. Without it, we can’t achieve the level of automation necessary to redress outages before they occur. Without visibility we can’t reduce MTTR in a meaningful way. Without it, we really can’t keep the business growing at a sustainable rate.

Visibility, like security, should be a first class citizen in the strategy stack driving IT forward. Because outages happen, and it is visibility that enables organizations to recover quickly and efficiently, with as little damage to their brand and bottom line as possible.

Share
Tags: 2017

About the Author

Related Blog Posts

F5 accelerates and secures AI inference at scale with NVIDIA Cloud Partner reference architecture
F5 Ecosystem | 10/28/2025

F5 accelerates and secures AI inference at scale with NVIDIA Cloud Partner reference architecture

F5’s inclusion within the NVIDIA Cloud Partner (NCP) reference architecture enables secure, high-performance AI infrastructure that scales efficiently to support advanced AI workloads.

F5 Silverline Mitigates Record-Breaking DDoS Attacks
F5 Ecosystem | 08/26/2021

F5 Silverline Mitigates Record-Breaking DDoS Attacks

Malicious attacks are increasing in scale and complexity, threatening to overwhelm and breach the internal resources of businesses globally. Often, these attacks combine high-volume traffic with stealthy, low-and-slow, application-targeted attack techniques, powered by either automated botnets or human-driven tools.

F5 Silverline: Our Data Centers are your Data Centers
F5 Ecosystem | 06/22/2021

F5 Silverline: Our Data Centers are your Data Centers

Customers count on F5 Silverline Managed Security Services to secure their digital assets, and in order for us to deliver a highly dependable service at global scale we host our infrastructure in the most reliable and well-connected locations in the world. And when F5 needs reliable and well-connected locations, we turn to Equinix, a leading provider of digital infrastructure.

Volterra and the Power of the Distributed Cloud (Video)
F5 Ecosystem | 04/15/2021

Volterra and the Power of the Distributed Cloud (Video)

How can organizations fully harness the power of multi-cloud and edge computing? VPs Mark Weiner and James Feger join the DevCentral team for a video discussion on how F5 and Volterra can help.

Phishing Attacks Soar 220% During COVID-19 Peak as Cybercriminal Opportunism Intensifies
F5 Ecosystem | 12/08/2020

Phishing Attacks Soar 220% During COVID-19 Peak as Cybercriminal Opportunism Intensifies

David Warburton, author of the F5 Labs 2020 Phishing and Fraud Report, describes how fraudsters are adapting to the pandemic and maps out the trends ahead in this video, with summary comments.

The Internet of (Increasingly Scary) Things
F5 Ecosystem | 12/16/2015

The Internet of (Increasingly Scary) Things

There is a lot of FUD (Fear, Uncertainty, and Doubt) that gets attached to any emerging technology trend, particularly when it involves vast legions of consumers eager to participate. And while it’s easy enough to shrug off the paranoia that bots...

Deliver and Secure Every App
F5 application delivery and security solutions are built to ensure that every app and API deployed anywhere is fast, available, and secure. Learn how we can partner to deliver exceptional experiences every time.
Connect With Us