NetOps Take Note of SRE Focus on MTTR to Realize Availability

F5 Ecosystem | May 14, 2018

Site Reliability Engineer (SRE) is a relatively new role – usually within engineering or operations – focused on maintaining, unsurprisingly, site reliability. That generally means availability of applications, but it also includes performance. That makes sense, as unresponsive apps are considered, by most end-users today, unavailable.

Perusing Catchpoint’s 2018 SRE Report, I was struck by charts comparing service level indicators and SRE metrics. Usually when you see “availability” as a top-level service level indicator, you also see “uptime” or “downtime” as a metric. Not so for SREs in this survey.

When asked what service level indicators were most important for their services, 84% overwhelming declared “end-user availability” as number one. Latency came in second with 61% and error rate a close third with 60%. Performance – described as end-user response time in the report – grabbed an impressive 58% of answers.

Now note the metrics used to define success. In the world of infrastructure and operations, we’re more accustomed to seeing metrics like “uptime” and catchy phrases like “5 9s”.

SREs tend to view success in terms of incident rates and MTTR, instead. Given that the same report noted that 41% of SREs were in a “DevOps Engineer” role prior to becoming an SRE, this should be unsurprising. DevOps itself is more concerned with MTTR than it is calculating uptime because it’s assumed there will be downtime. The key is to minimize it by resolving it quickly rather than wasting time trying to avoid it at all.

Still, astute readers will note that by minimizing MTTR you are minimizing downtime. Faster resolution, less downtime.

The subtle difference between the two is that human beings tend to optimize for what they’re measured on. If you’re measured on lines of code, you’re going to write a lot of lines of code – whether you need them or not. If you’re measured on security incidents, you’re going to lock everything down and scream NO to any changes that might encourage a breach. If you measure folks on uptime, operations will focus on keeping systems up and available – but not on architecting and instrumenting systems and applications that decrease MTTR.

This is one of those ‘cultural’ aspects of DevOps – a change in the way we approach operations – that needs to carry over into NetOps. If we keep optimizing for uptime we miss the opportunity to put into place the alerting and observability (like monitoring and robust logging) that reduces mean time to resolution and achieves our goal of minimizing downtime.

Digging through logs – even centralized ones – is not an efficient means of getting the heart of a problem and resolving it. Real-time monitoring and alerting on key variables that impact availability such as capacity, connectivity, and performance across the entire data path (network, infrastructure, application) will invariably reduce the time it takes to resolve issues if you’re aware of systems or services that are degrading or have abruptly failed.

NetOps needs to adopt this approach with respect to reliability in the production pipeline because it’s an overall better approach to dealing with inevitable failure and it aligns with their DevOps counterparts. After all, there’s a reason we only reached for 5 9s, isn’t there? Because we recognized that failure happens no matter how hard we try and perfection is impossible.

Moving from uptime/downtime to MTTR as a metric for success encourages cross-team collaboration and the use of observability tools across the full length of the production pipeline. There’s a reason monitoring and alerting tools were at the top of “must have tools” for SREs in the survey. Observability (with the goal of alerting on error/incident) plus collaboration is a better formula for assuring that everyone – NetOps, DevOps, and App Dev, too – can meet their goal of keeping apps both fast and available.

Share
Tags: 2018

About the Author

Related Blog Posts

F5 accelerates and secures AI inference at scale with NVIDIA Cloud Partner reference architecture
F5 Ecosystem | 10/28/2025

F5 accelerates and secures AI inference at scale with NVIDIA Cloud Partner reference architecture

F5’s inclusion within the NVIDIA Cloud Partner (NCP) reference architecture enables secure, high-performance AI infrastructure that scales efficiently to support advanced AI workloads.

F5 Silverline Mitigates Record-Breaking DDoS Attacks
F5 Ecosystem | 08/26/2021

F5 Silverline Mitigates Record-Breaking DDoS Attacks

Malicious attacks are increasing in scale and complexity, threatening to overwhelm and breach the internal resources of businesses globally. Often, these attacks combine high-volume traffic with stealthy, low-and-slow, application-targeted attack techniques, powered by either automated botnets or human-driven tools.

F5 Silverline: Our Data Centers are your Data Centers
F5 Ecosystem | 06/22/2021

F5 Silverline: Our Data Centers are your Data Centers

Customers count on F5 Silverline Managed Security Services to secure their digital assets, and in order for us to deliver a highly dependable service at global scale we host our infrastructure in the most reliable and well-connected locations in the world. And when F5 needs reliable and well-connected locations, we turn to Equinix, a leading provider of digital infrastructure.

Volterra and the Power of the Distributed Cloud (Video)
F5 Ecosystem | 04/15/2021

Volterra and the Power of the Distributed Cloud (Video)

How can organizations fully harness the power of multi-cloud and edge computing? VPs Mark Weiner and James Feger join the DevCentral team for a video discussion on how F5 and Volterra can help.

Phishing Attacks Soar 220% During COVID-19 Peak as Cybercriminal Opportunism Intensifies
F5 Ecosystem | 12/08/2020

Phishing Attacks Soar 220% During COVID-19 Peak as Cybercriminal Opportunism Intensifies

David Warburton, author of the F5 Labs 2020 Phishing and Fraud Report, describes how fraudsters are adapting to the pandemic and maps out the trends ahead in this video, with summary comments.

The Internet of (Increasingly Scary) Things
F5 Ecosystem | 12/16/2015

The Internet of (Increasingly Scary) Things

There is a lot of FUD (Fear, Uncertainty, and Doubt) that gets attached to any emerging technology trend, particularly when it involves vast legions of consumers eager to participate. And while it’s easy enough to shrug off the paranoia that bots...

Deliver and Secure Every App
F5 application delivery and security solutions are built to ensure that every app and API deployed anywhere is fast, available, and secure. Learn how we can partner to deliver exceptional experiences every time.
Connect With Us
NetOps Take Note of SRE Focus on MTTR to Realize Availability | F5