Fail fast is the mantra of speed today. Whether DevOps or business, the premise of operating in a digital economy demands uptime as close to perfect as you can get it.
While the theory of this philosophy is good, in practice the result is often just more failure. By not focusing on finding the root cause (MTTR) and instead just on assuring availability (uptime), we're losing valuable data at unprecedented rates. Face it - when uptime is all you care about, MTTR becomes Mean-time-to-Reboot instead of Mean-Time-to-Resolution. And without a resolution - a reason for the downtime - you can't prevent it from happening again.
This approach is detrimental to the business.
You see, you aren't dropping packets, you're dropping parts of pennies. And as the classic criminal trope of siphoning off fractions of pennies from transactions to build up millions teaches us, every fraction of a penny counts. Every second in which a component, a service, a server fails to respond, you're losing value - both experiential and existential. Consumers won't stand for poor performance or downtime, and business ledgers can't tolerate either, either.
And if you know anything about throughput and bandwidth, you know that the basis for both calculations lies in the packets per second that can be processed by the underlying system. That's not just true in the network, but for every component that interacts with a transaction. The app. The application services. Routers. Switches. Databases. If it has a network connection, it is bound to this same calculation and constrained by its capacity to pass packets.
WARNING: MATHS AHEAD
The speed of today's networks ensures that we're doing just that at a rate of millions of packets per second. Business transactions, of course, are primarily conducted via a (literal) web of HTTP transactions, each one passing information crucial to conducting business. The number of packets required to conduct a transaction depend on the amount of data required. The average packet carries 1500 bytes of data (that's the MTU). So if an HTTP-based message carrying a JSON payload that represents a transaction requires 4500 bytes (after encryption, of course), that's about three packets. So let's be generous and say a typical digital business transaction requires five packets. A 10Gbps network can process just under 15M packets per second. Assuming enough compute capacity is available, you could then say that equates to 3M transactions. Let's assume every transaction is worth a fraction (one-third) of a penny. That's $1M per second.
Now, no one actually processes transactions at that speed or volume. Even Visa - who inarguably processes data at rates most enterprises don’t require - claims its capacity is about 24,000 transactions per second. Assuming the same value of those transactions - one third of a penny - that's still $8,000 per second.
The point being that failure in the transaction chain comprised of routers, switches, network and application service infrastructure, app infrastructure, and components is A Very Bad Thing™. It's costly, because a failure means packets aren't being processed, and neither are the pennies they represent. And there is no part of the digital economy that does not rely on packets being passed.
The answer thus far is in the "fail fast" mantra - just spin up a new instance of X or Y or Z or whatever component failed. But that component failed *for a reason* and it is of the utmost importance that the reason is uncovered and addressed. Quickly. Because there are still expensive seconds between failure and restoration that cost business value. If it failed once, it's likely to fail again. And again.
VISIBILTY: THE KEYSTONE ON WHICH MTTR RESTS
This is why visibility is so critical to success in the digital economy. Because it is visibility that enables all the ops to find and remediate the cause of failure. Unfortunately, it is visibility that is often sacrificed for speed. Not literal speed of transactions, but time to value. In our rush to get apps to market faster and more frequently, we have not adequately invested in enabling the visibility necessary to mitigate failure.
In fact, one might argue that the "fail fast" philosophy of DevOps is a response to that failure. Without the ability to find and address the cause of failure, DevOps has determined it's better to restore availability than waste time. That ability is growing more and more elusive as organizations adopt multi-cloud approaches to deploying applications.
In 2018, the multi-cloud challenge of visibility was cited by fewer than one-third (31%) In 2019, that jumped to more than one-third (39%) to tie with performance and security as top challenges for multi-cloud. Visibility is a critical component of the over-arching "observability" that brings together monitoring, analytics, and alerting to provide valuable insights into the state of a system at any time. That's particularly important during a failure, because the state of the system is spread across multiple IT fiefdoms that may or may not enable sharing of information that can quickly lead to a resolution instead of just a reboot.
The ability of a service mesh to add value through distributed tracing is an excellent example of enabling visibility. But we need to extend that to include the entire chain of application services that scale and secure the applications executing in a containerized world. And that includes distributed components and applications running in public cloud that may be part of the execution chain. Visibility across environments, infrastructure, and applications is required to find and address issues that cause downtime or poor performance.
Visibility is imperative to enable organizations to return to measuring success on MTTResolution rather than MTTReboot.
About the Author

Related Blog Posts
At the Intersection of Operational Data and Generative AI
Help your organization understand the impact of generative AI (GenAI) on its operational data practices, and learn how to better align GenAI technology adoption timelines with existing budgets, practices, and cultures.
Using AI for IT Automation Security
Learn how artificial intelligence and machine learning aid in mitigating cybersecurity threats to your IT automation processes.
The Commodification of Cloud
Public cloud is no longer the bright new shiny toy, but it paved the way for XaaS, Edge, and a new cycle of innovation.
Most Exciting Tech Trend in 2022: IT/OT Convergence
The line between operation and digital systems continues to blur as homes and businesses increase their reliance on connected devices, accelerating the convergence of IT and OT. While this trend of integration brings excitement, it also presents its own challenges and concerns to be considered.
Adaptive Applications are Data-Driven
There's a big difference between knowing something's wrong and knowing what to do about it. Only after monitoring the right elements can we discern the health of a user experience, deriving from the analysis of those measurements the relationships and patterns that can be inferred. Ultimately, the automation that will give rise to truly adaptive applications is based on measurements and our understanding of them.
Inserting App Services into Shifting App Architectures
Application architectures have evolved several times since the early days of computing, and it is no longer optimal to rely solely on a single, known data path to insert application services. Furthermore, because many of the emerging data paths are not as suitable for a proxy-based platform, we must look to the other potential points of insertion possible to scale and secure modern applications.
