Scaling containers is more than just slapping a proxy in front of a service and walking away. There’s more to scale than just distribution, and in the fast-paced world of containers there are five distinct capabilities required to ensure scale: retries, circuit breakers, discovery, distribution, and monitoring.
In this second post on the art of scaling containers, we’ll dig into circuit breakers.
Thomas Edison, who famously invented thousands of gadgets that didn’t work and a few famous ones that did, gave us the concept of a circuit breaker back in an 1879 patent application. Yes, patents were a thing even then. While Edison’s version used fuses, which must be replaced (some of us might remember frantically searching for them in our old timey cars), more modern versions are built to “trip” and stop the flow of electricity. They can then be reset, restoring normal flow and operation.
In the context of scale, circuit breakers operate on the same principle. They detect an “overflow” and deliberately cut it off to avoid overwhelming services on the other end of the connection. They can also be reset, subsequently restoring normal flow of requests and responses.
Circuit breakers have been a part of load balancing proxies for quite some time. The premise has been that if – after X tries – you still can’t reach a given service, it’s out of commission. There’s reason to keep asking it for something it can’t give you, and doing so is only wasting resources at the proxy and the network. So after a (typically) configurable number of failures, a proxy will “break” the circuit and refuse to attempt further connections.
This is not the same as a retry, though the process appears similar. Retries operate on the premise that the request will eventually succeed. A circuit breaker operates on the premise that the request will fail, and thus wasting time and resources doing so is to be avoided.
Once the problem has been resolved, the circuit breaker can be “reset” and normal flow can resume.
In the early days, this process was manual. An operator was required to perform the reset after assuring that the target service was indeed back in service. In more recent years, this process has become automated through the use of health monitoring. This typically includes periodic attempts to reach the service and, upon success, resets the circuit breaker to allow normal operations once again.
Circuit breakers are particularly important in a containerized, microservices setting because of the high volume of traffic flowing to and from and between services. While some failures may be recognized quickly, others will not be noticed until a lengthy TCP timeout due to issues in the network stack. Timeouts incur undesirable latency, so circuit breaking and retries should take into consideration overall application tolerances (or intolerance, as the case may be) for latency. Configuration of such values needs to take into account timeout values and the business’ expectations with respect to performance. Low tolerance of latency may require fewer retries and faster circuit breaking behavior.