Single points of failure are the bane of engineering, and engineers put great effort into eliminating them from the systems they design. Increasingly, however, companies are handing over large amounts of their IT infrastructure and application portfolios to third-party providers. This reveals an interesting form of the single point of failure. If an organization uses one cloud platform or managed service exclusively, it becomes the same weak point it was trying to avoid.
N+1, or Is It?
The general approaches to solving for single points of failure is to implement at least one of the following:
- Have more than N of a thing. If you need Internet connectivity, use two different providers coming into your data center on two different circuits, ideally through different entry points in your building. That way, if someone manages to accidentally dig up one of your fiber cables, you still have another to use.
- Have more than N capacity. If you typically use 100 Mbps of bandwidth, provision an extra X%, so that if your needs suddenly rise unexpectedly, you have some buffer.
- Have an N2. If your Internet is fiber, have another connection via satellite or something else. If one of them goes down, the other may keep you connected. Alternately, have an entire additional backup site, a hot spare, that you can use if your main site is destroyed, for example, in an earthquake.
While the specifics of how this looks for a given company will vary, these general methods are the ways organizations have kept up with growth and avoided outages for decades.
They are, unfortunately, all very expensive. They increase complexity in our environments, require a lot of staff, and are heavily reliant on capital expenditure (CapEx). They do work, although are often not economically feasible.
As Much N as You Need, and None That You Don’t!
The expense and complexity of building scalable, reliable environments is the reason why the cloud, in general, and Software as a Service (SaaS)—along with Functions as a Service (FaaS), Infrastructure as a Service (IaaS), and all the other services that go along with them—have become so popular. They allow you to change CapEx into operational expenditure (OpEx), where you pay only for what you use, or you pay a relatively small subscription fee for a service at an established rate.
The Cloud and SaaS allow you to scale up or down as needed and to leverage the infrastructure of huge cloud providers for fast, scalable, reliable, secure, and often less expensive service for your company. You don’t have to hire as many staff, you don’t need to spend oodles of money on servers and networking gear, and with a few cloud-savvy engineers, you can go from a small startup to a multinational presence in almost no time. You can even hand off running a lot of the applications you need to external vendors—email, teleconferencing, communication systems, customer management, even human resources.
Behind the scenes, those third parties still need to do all those things mentioned earlier: massively provision, ensure redundancy, and maintain distinct systems. Ideally, they can do that better than you could by leveraging economies of scale.
If things go wrong, you have a service-level agreement and a contract. If they screw up, well, hopefully it won’t be too bad, and you might get some of your money back. And since providing these services is their business, you can believe they’re going to take it seriously. It’s shifting risk from things under your control to things outside of your control, but it often makes a great deal of sense to do so.
N = 0
As businesses with “as a Service” offerings proliferate, there are going to be clear winners: companies that dominate the market and capture large sections of it. As of July 2021, Amazon Web Services (AWS) has a 32% market share for Platform as a Service, IaaS, and hosted private cloud markets, followed by Microsoft Azure at 20%. Google Cloud sits at third place with 9%.1
Additionally, as the market matures, mergers and acquisitions will tend to concentrate market share in a few companies. Depending on other factors, such as region, country, or industry vertical, some companies may reach near 100% market share, becoming the only game in town.
While this is not new, it does expose a distinct issue. If an organization relies on one provider, a problem with that provider will affect the company, which will have little recourse other than to wait it out.
One might argue that such outages should be avoidable, and they often are. But there will always be edge cases and unexpected events. AWS’s recent hours-long control plane interruption is only one recent example.2
N+1 All Over Again
Ultimately, companies face a decision that involves tradeoffs. The advantages of moving to an OpEx from a CapEx model are too significant to ignore entirely. But relying on one provider comes with new risks.
Perhaps the most balanced view would be to find a way to have redundancy in providers again, like the redundancy companies used to have in their on-premises environments, at least for those services that cannot be allowed to fail. Running duplicate applications in more than one public cloud is difficult but achievable with containerization and microservices. While taking this approach can provide redundancy, it also puts some of the benefits of the public cloud out of reach, and it may also cost more. As an example, cloud service providers want you to run your code in their FaaS frameworks, using their managed data stores, and leveraging their API gateways. Their pricing models reflect this, and arguably, a "provider native" approach will provide higher performance. Given these issues, a fully redundant multi-cloud approach might be best limited to those services that absolutely must remain available.
N = ?
For less critical pieces, an approach (that may already be happening organically) might be to run each application in the cloud provider that suits it best. This can limit the impact of an outage to a subset of noncritical applications if the overall organization of the application portfolio is done with some care.
Whatever the approach taken, it will come at some cost, both in terms of money, engineering complexity, and staffing. Hopefully, such an approach will not be as costly as doing it all yourselves. But we are going to have to keep engineering, keep finding those single points of failure, and keep coming up with better approaches to minimize them, even as we shift the locations of those failures outside of our own organizations and onto third-party providers.