Is DevOps ‘Build to Fail’ Philosophy a Security Risk?

Lori MacVittie Miniatur
Lori MacVittie
Published February 08, 2018

Breaking Betteridge’s Law of Headlines, the short answer is yes. But as all things today involving technology, the long answer is a bit more involved than that.

DevOps has become, I think, fairly pervasive across all industries. While not every organization adopts every aspect of the approach, or applies the same zealous adherence to its principles as, say, Netflix, it’s definitely a ‘thing’ that’s happening.

While not directly a proof point, when we asked the more than 3000 respondents how digital transformation was impacting their application decisions for our State of Application Delivery 2018, two of the top three answers were ‘employing automation and orchestration to IT systems and processes’ and ‘changing how we develop applications (for example, moving to agile)’. To me, both are inferred reactions to adopting at least portions of a DevOps approach to developing and delivering applications in modern architectures.

So, if orgs are adopting some of the tools and techniques related to DevOps, one might assume they’re also adopting others. One of those might even be (cue dramatic music): building to fail.

Now, that phrase is somewhat imprecise, as no one sits around designing systems to fail. What they do do, however, is design systems that are resilient to failure. That means, for example, if an instance (server) crashes, the system should be able to automatically handle the situation by removing the dead instance and starting a new one to take its place.

Voila! Built to fail.

And while this is certainly a desirable reaction – particularly when a system is under heavy load and demand – there is a risk in the approach that needs to be considered and, one hopes, subsequently addressed.

Consider the Cloudflare vulnerability of early 2017. Cloudflare – which has been admirably transparent in its own reporting of the issue – notes that basically, the problem was a memory leak (resulting in potential data leakage) caused by a defect in an extension of an HTTP parser. Long story short, bug caused memory leak which caused instances to crash. Those instances were killed and restarted because, built to fail.    

For the record, this isn’t a ‘bash Cloudflare for a bug’ post. As a developer, I am highly sympathetic to having one’s defects exposed so publicly. I am less sympathetic in situations where there’s little regard for discovering why something is crashing or leaking memory or just outright failing.

Which is the point of today’s post. Because sometimes the DevOps philosophy leaves its adherents with a laissez-faire approach to post-failure investigation.

It is perfectly reasonable to react to a service/app failure by killing and restarting the service to ensure availability – as long as you then investigate the crash to determine what caused it. Apps don’t crash for no reason. If it fell over, something pushed it. Nine times out ten, it’s like a non-exploitable error. Nothing to write a blog post about. But the one time it’s a serious vulnerability waiting to be exploited makes it worth the ostensibly wasted effort on the other nine. Because that is something to write a blog post about.

It is not reasonable to ignore it.

Monitoring and alerting on failures and other issues is also a key component of a well-rounded DevOps program. That’s the “S” in the CAMS that make up a holistic DevOps approach: Culture, Automation, Measurement and Sharing. Damon and John (who coined the acronym back in 2010) were not just talking about pizza and beer (though that’s a good way to encourage the “C” for Culture of DevOps). It’s also about data and state of systems. It’s about ensuring that those who can benefit from knowing, know. And that includes a failure in the system.

A failure – particularly a crash – should not go unchecked. If a system in the pipeline crashes, someone should know about it and someone ought to check it out. To ignore it is a security risk. Worse, it’s an avoidable risk because it’s your environment, your systems, and your code. You have complete control, and thus no excuse to ignore it.

So yes, in a nutshell, ‘build to fail’ can expose your apps – and business – security risks. The good news is those risks are completely manageable, if you ensure that philosophy isn’t on paper as ‘built to fail’ but in practice winds up ‘built to ignore a fail'.

Pay attention to things that crash – even if you restart them to keep availability high. You may save yourself (and your business) from trending on Twitter for all the wrong reasons.