Being ‘on’ is not the measure of performance. Just because power is getting to a lamp doesn’t mean it’s giving the user enough light to reliably see what they’re doing. Many different factors beyond power will impact whether or not the user can see. The bulb could be dim—either because it’s dying, or it’s got power but not enough. If the bulb doesn’t have enough power to keep it bright there could be an issue with the wiring or a restriction in the electricity flow due to a dimmer. And if the bulb is bright, the lamp shade could be too dark, or the space too large for only one light. In short, there are many factors impacting the lamp’s performance and subsequently the user’s experience. Similarly, there’s more to assessing the performance and reliability of systems and applications beyond the traditional measure of uptime. Reliability is also dependent upon the level of service.
Systems and applications are made up of many components—infrastructure, APIs, security, workflows, logic, data, etc.—brought together for a purpose, and just being on doesn’t ensure reliability. Like with the lamp analogy, you must be able to evaluate and assess all the components to be able to confirm optimum performance and experience. Whereas in a brick-and-mortar business determining a shortcoming in the level of service can be as easy as walking the ‘customer path’ to assess the overall experience, in a digital business that can be a significant challenge. With the business and IT siloes created by traditional enterprise architectures, identifying an issue and finding its root cause isn’t always easy or efficient. Business leaders may think there’s a problem but the IT teams managing their components may not if everything is ‘on.’ Site reliability engineering (SRE) is the bridge between the business and IT to ensure the execution of business commitments by means of service level objectives (SLOs).
Site reliability engineering originated at Google in the early 2000s and according to them “is what you get when you treat operations as if it’s a software problem.” In our terms it’s a set of processes, practices, and tools as well as a culture and a mindset employed to create reliable, efficient, and scalable systems that support business objectives.
SRE focuses on reliable—not just available—and scalable systems. We add that it’s a mindset and a culture because, like security, everyone should be expected to positively contribute to quality, reliable systems no matter their role. While also a culture and mindset, the practice of SRE is often embedded in a service team that’s delivering the whole service from end to end. These teams are generally responsible for improving the core system and enabling innovation through monitoring availability, latency, performance, and recovery, while driving for continuous improvement with automation and efficiency. In essence, they’re looking at the whole room, not just verifying that the lamp is on.
Site reliability engineering defines the measures of the SLOs and SLIs (service level indicators) to meet business outcomes. More simply put, SRE unites the development, security, and operational teams’ needs and goals to reliably deliver on the promises made by the business to their customers.
If the business commitment is that users will reliably have enough light to see what they are doing (service level), an SLO could be one brightly lit lamp (availability) is maintained for every 10 square feet of space. Meanwhile another SLO could be a defined MTTR (mean time to recover), in this example the amount of time in which dead or dying light bulbs will be replaced. SLIs then are the thresholds defined by site reliability engineers and IT to ensure SLOs are achieved, such as monitoring the luminous flux, the electricity flow to each lamp, or the marginal changes in lamp location caused by users bumping or moving them around. In application delivery systems these could look like CPU utilization, API call and database query time, etc. It’s up to the site reliability engineers to define the SLI measures that impact the business SLOs and what responses will be taken when they fall below specific thresholds by adjusting operating policies and configuration.
The measures, thresholds, and responses are the intersection of SRE with the other domains of a modern enterprise architecture designed for the application delivery of a digital business. Operational data—telemetry—feeds the observability of the defined measures and thresholds set forth by SRE. Automation is the combined application of tools, technologies, and practices to enable site reliability engineers to scale defined responses with less toil, thus enabling the efficient satisfaction of the SLOs of a digital service. And the system reliability of digital services improves the likelihood of a positive user experience with your digital business.
To reiterate, SRE acts as a bridge unifying the efforts of IT and the business by using all the tools, technologies, and processes available to go beyond just having systems ‘on,’ to also ensuring they are performing reliably. By adopting SRE into the enterprise architecture, businesses can be proactive in the care of their system applications and notice drops or irregularities sooner, which the site reliability engineers can then investigate and resolve before the user experience is impacted.
To learn how to integrate SRE into your business and support the transformation journey to an efficient and scalable digital business, read “The Need for Speed,” a chapter by Julia Renouard in our O’Reilly book, Enterprise Architecture for Digital Business.