BLOG

The Art of Scaling Containers: Retries

Lori MacVittie Thumbnail
Lori MacVittie
Published January 11, 2018

Scaling containers is more than just slapping a proxy in front of a service and walking away. There’s more to scale than just distribution, and in the fast-paced world of containers there are five distinct capabilities required to ensure scale: retries, circuit breakers, discoverydistribution, and monitoring.

In this post on the art of scaling containers, we’ll dig into retries.

Retries

When you’re a kid playing a game, the concept of a retry is common to a lot of games. “Do over!” is commonly called out after a failure, in the hopes that the other players will let you try again. Sometimes they do. Sometimes they don’t. That rarely stops a kid from trying, I’ve noticed. 

When scaling apps (or services if you prefer), the concept of a retry is much the same. A proxy, upon choosing a service and attempting to fulfill a request, receives an error. In basic load balancing scenarios this is typically determined by examining the HTTP response code. Anything other than a “200 OK” is an error. That includes network and TCP layer timeouts. The load balancer can either blindly return the failed response to the 

Insanity-quote

requestor or, if it’s smart enough, it can retry the request in the hopes that a subsequent request will result in a successful response.

This sounds pretty basic, but in the beginning of scale there was no such thing as a retry. If it failed, it failed and we dealt with it. Usually by manually trying to reload from the browser. Eventually, proxies became smart enough to perform retries on their own, saving many a keyboard from wearing out the “CTRL” and “R” buttons.

On the surface, this is an existential example of the definition of insanity. After all, if the request failed the first time, why should we expect it to be successful the second (or the third, even?).

The answer lies in the reason for the failure. When scaling apps, it is important to understand the impact connection capacity has on failures. The load on a given resource at any given time is not fixed. Connections are constantly being opened and closed. The underlying web app platform – whether Apache or IIS, a node.js engine or a some other stack – has defined constraints in terms of capacity. It can only maintain X number of concurrent connections. When that limit is reached, attempts to open new connections will hang or fail.

If this is the cause of a failure, then in the microseconds it took for the proxy to receive a failed response a different connection may have closed, thereby opening up the opportunity for a retry to be successful.

In some cases, a failure is internal to the platform. The dreaded “500 Internal Server Error”. This status is often seen when server is not overloaded, but has made a call to another (external) service that failed to respond in time. Sometimes this is the result of a database connection pool reaching its limits. The reliance on external services can result in a cascading chain of errors that, like a connection overload, is often resolved by the time you receive the initial error.

You might also see the oh-so-helpful “503 Service Unavailable” error. This might be in response to an overload but as is the case with all HTTP error codes, they are not always good predictors of what actually is going wrong. You might see this in response to a DNS failure or in the case of IIS and ASP, a full queue. The possibilities are really quite varied. And again, they might be resolved by the time you receive the error, so a retry should definitively be your first response.

Of course you can’t just retry until the cows come home. Like the unintended consequences of TCP retransmission – which can overload networks and overwhelm receivers – retries eventually become futile. There is no hard and fast rule regarding how many times you should retry before giving up, but between 3 and 5 is common.

At that point, it’s time to send regrets to the requestor and initiate a circuit breaker. We’ll dig into that capability in our next post.