SRE Brings Modern Enterprise Architectures into Focus

F5 Ecosystem | January 12, 2023

Being ‘on’ is not the measure of performance. Just because power is getting to a lamp doesn’t mean it’s giving the user enough light to reliably see what they’re doing. Many different factors beyond power will impact whether or not the user can see. The bulb could be dim—either because it’s dying, or it’s got power but not enough. If the bulb doesn’t have enough power to keep it bright there could be an issue with the wiring or a restriction in the electricity flow due to a dimmer. And if the bulb is bright, the lamp shade could be too dark, or the space too large for only one light. In short, there are many factors impacting the lamp’s performance and subsequently the user’s experience. Similarly, there’s more to assessing the performance and reliability of systems and applications beyond the traditional measure of uptime. Reliability is also dependent upon the level of service.

Systems and applications are made up of many components—infrastructure, APIs, security, workflows, logic, data, etc.—brought together for a purpose, and just being on doesn’t ensure reliability. Like with the lamp analogy, you must be able to evaluate and assess all the components to be able to confirm optimum performance and experience. Whereas in a brick-and-mortar business determining a shortcoming in the level of service can be as easy as walking the ‘customer path’ to assess the overall experience, in a digital business that can be a significant challenge. With the business and IT siloes created by traditional enterprise architectures, identifying an issue and finding its root cause isn’t always easy or efficient. Business leaders may think there’s a problem but the IT teams managing their components may not if everything is ‘on.’ Site reliability engineering (SRE) is the bridge between the business and IT to ensure the execution of business commitments by means of service level objectives (SLOs).

What is SRE?

Site reliability engineering originated at Google in the early 2000s and according to them “is what you get when you treat operations as if it’s a software problem.” In our terms it’s a set of processes, practices, and tools as well as a culture and a mindset employed to create reliable, efficient, and scalable systems that support business objectives.

SRE focuses on reliable—not just available—and scalable systems. We add that it’s a mindset and a culture because, like security, everyone should be expected to positively contribute to quality, reliable systems no matter their role. While also a culture and mindset, the practice of SRE is often embedded in a service team that’s delivering the whole service from end to end. These teams are generally responsible for improving the core system and enabling innovation through monitoring availability, latency, performance, and recovery, while driving for continuous improvement with automation and efficiency. In essence, they’re looking at the whole room, not just verifying that the lamp is on.

How SRE uses SLIs to meet SLOs

Site reliability engineering defines the measures of the SLOs and SLIs (service level indicators) to meet business outcomes. More simply put, SRE unites the development, security, and operational teams’ needs and goals to reliably deliver on the promises made by the business to their customers.

If the business commitment is that users will reliably have enough light to see what they are doing (service level), an SLO could be one brightly lit lamp (availability) is maintained for every 10 square feet of space. Meanwhile another SLO could be a defined MTTR (mean time to recover), in this example the amount of time in which dead or dying light bulbs will be replaced. SLIs then are the thresholds defined by site reliability engineers and IT to ensure SLOs are achieved, such as monitoring the luminous flux, the electricity flow to each lamp, or the marginal changes in lamp location caused by users bumping or moving them around. In application delivery systems these could look like CPU utilization, API call and database query time, etc. It’s up to the site reliability engineers to define the SLI measures that impact the business SLOs and what responses will be taken when they fall below specific thresholds by adjusting operating policies and configuration.

SRE’s benefit in modern enterprise architectures

The measures, thresholds, and responses are the intersection of SRE with the other domains of a modern enterprise architecture designed for the application delivery of a digital business. Operational data—telemetry—feeds the observability of the defined measures and thresholds set forth by SRE. Automation is the combined application of tools, technologies, and practices to enable site reliability engineers to scale defined responses with less toil, thus enabling the efficient satisfaction of the SLOs of a digital service. And the system reliability of digital services improves the likelihood of a positive user experience with your digital business.

To reiterate, SRE acts as a bridge unifying the efforts of IT and the business by using all the tools, technologies, and processes available to go beyond just having systems ‘on,’ to also ensuring they are performing reliably. By adopting SRE into the enterprise architecture, businesses can be proactive in the care of their system applications and notice drops or irregularities sooner, which the site reliability engineers can then investigate and resolve before the user experience is impacted.

To learn how to integrate SRE into your business and support the transformation journey to an efficient and scalable digital business, read “The Need for Speed,” a chapter by Julia Renouard in our O’Reilly book, Enterprise Architecture for Digital Business.

Share

Related Blog Posts

At the Intersection of Operational Data and Generative AI
F5 Ecosystem | 10/22/2024

At the Intersection of Operational Data and Generative AI

Help your organization understand the impact of generative AI (GenAI) on its operational data practices, and learn how to better align GenAI technology adoption timelines with existing budgets, practices, and cultures.

Using AI for IT Automation Security
F5 Ecosystem | 12/19/2022

Using AI for IT Automation Security

Learn how artificial intelligence and machine learning aid in mitigating cybersecurity threats to your IT automation processes.

The Commodification of Cloud
F5 Ecosystem | 07/19/2022

The Commodification of Cloud

Public cloud is no longer the bright new shiny toy, but it paved the way for XaaS, Edge, and a new cycle of innovation.

Most Exciting Tech Trend in 2022: IT/OT Convergence
F5 Ecosystem | 02/24/2022

Most Exciting Tech Trend in 2022: IT/OT Convergence

The line between operation and digital systems continues to blur as homes and businesses increase their reliance on connected devices, accelerating the convergence of IT and OT. While this trend of integration brings excitement, it also presents its own challenges and concerns to be considered.

Adaptive Applications are Data-Driven
F5 Ecosystem | 10/05/2020

Adaptive Applications are Data-Driven

There's a big difference between knowing something's wrong and knowing what to do about it. Only after monitoring the right elements can we discern the health of a user experience, deriving from the analysis of those measurements the relationships and patterns that can be inferred. Ultimately, the automation that will give rise to truly adaptive applications is based on measurements and our understanding of them.

Inserting App Services into Shifting App Architectures
F5 Ecosystem | 12/23/2019

Inserting App Services into Shifting App Architectures

Application architectures have evolved several times since the early days of computing, and it is no longer optimal to rely solely on a single, known data path to insert application services. Furthermore, because many of the emerging data paths are not as suitable for a proxy-based platform, we must look to the other potential points of insertion possible to scale and secure modern applications.

Deliver and Secure Every App
F5 application delivery and security solutions are built to ensure that every app and API deployed anywhere is fast, available, and secure. Learn how we can partner to deliver exceptional experiences every time.
Connect With Us