Distributed Circuit Breakers ✈︎

We live in an ever-evolving universe where the only constant is change. We at Hipmunk are especially sensitive to this fact of life. As a meta-search site for flights, hotels, cars and packages, the successful functioning of our site relies on the success rates of our API calls to our partners and data providers.

In the travel industry, the freshness of results is very important. Whether we're talking about hotels, flights or any other travel-related product information, pricing or availability data cannot reliably be cached (beyond, say, a few minutes, to debounce page refreshes). We ensure that our users are shown the most accurate results by retrieving fresh data at every search request.

We also flesh out this raw results data with other rich data related to the product which come from other third-party services. We might connect to APIs to geocode locations, get weather information, determine extra amenities, ratings and images; this is already not an exhaustive list, and we're adding more all the time.

Our reliance on external services exposes us to the risk of any one of them failing or timing out. In the past, we would experience a third-party failure as a burst of warnings in our logs. Of course, our search-runner infrastructure is designed to continue with as much of the user's query as possible and gracefully degrade the user experience within our products.

However, coding around dependency failures required a large amount of boilerplate. Furthermore, imagine a scenario where one of our providers goes down at 5 a.m., resulting in a flood of warnings in the logs. A flood of such warnings cannot be ignored, because amidst this noise we could miss something more critical.

For an engineer on duty and responding to such a page, the standard playbook is to log in to our control panel and disable the offending service. In most cases, issues with the upstream provider naturally resolve themselves after a few hours, at which point she would verify the service's health and re-enable it.

Although this process is as easy as a few clicks within an internal tool at Hipmunk, this is a manual endeavour that can be frustrating for the person on call. Read on to find out how Circuit Breakers help us automate these steps, freeing up valuable developer time and leading to page-free sleep.

Introduction to Circuit Breakers

What is a circuit breaker? It's a guard on all calls to a distinct service. This guard monitors invocations to the service, and tracks their successes and failures. Logically, the guard state can be viewed as a state machine.

Now, the conventional literature on Circuit Breakers is awash with terms like "open", "closed" and "half-open", which are inspired by the metaphor of an actual solenoid circuit breaker that trips to open a circuit.

A diagram of the inside of a (real) circuit breaker.
Image source: Wikimedia.org. Used under the Creative Commons Attribution ShareAlike 2.5 Generic Licence.

We at Hipmunk found this terminology to be more confusing than helpful in communicating and reasoning about the state of a circuit. Calling a circuit "open" if it has failed and "closed" if it is working makes sense if you really buy into the circuit metaphor. It's equally possible for developers unfamiliar with the literature to subconsciously carry a valve metaphor instead, and expect exactly the reverse to be true.

So, we came up with the following terminology for the states of the state machine.

Healthy: This is the "default" state for a Circuit Breaker. A Healthy circuit represents a service that is responding successfully to a sufficiently high proportion of requests. We're confident in sending that service 100% of our request load.

Unhealthy: The mirror opposite of Healthy. An Unhealthy circuit has recently had too many failures recently, and all remote calls to this service are disabled. Any code which attempts to call this service will instantly receive a fallback value.

Forced Healthy and Forced Unhealthy: These are two auxiliary states that we added to allow us manual, fine-grained control over services. Developers can manually use these states if there are conditions where they want to override the normal behaviour of the monitor. The intent here is not to use these as feature flags for services. Instead, we can use these states to more accurately adapt to temporary conditions.

Recovering: When a circuit is in the recovering state, it means that it has spent enough time in the unhealthy state, and now we want the circuit to attempt to self-heal. To fully understand how recovery works, we need to understand the transitions between the various states.

The states of a Circuit Breaker, and the transitions between them

The easiest transitions to understand are between healthy and the forced states, since they are always prompted by the input of an engineer at Hipmunk. An engineer can force a circuit from any state to forced-healthy or forced-unhealthy (the transition lines from unhealthy and recovering to the forced states have been left out of the diagram for simplicity). A forced state can always be reset to healthy.

Next comes the transition from healthy to unhealthy. When the ratio of failed calls to successful ones exceeds a critical threshold, the decision is made to transition the service to the 'unhealthy' state.

After a certain threshold of requests has exceeded, or time has passed in the 'unhealthy' state, a circuit automatically transitions into the recovering state. Here, we once again route live traffic through this circuit, but only for a small number of requests.

The success of these live requests is used to judge if the service downtime has been resolved. The margin of proof to make this judgement is higher than the level we require to take a service from healthy to unhealthy. This prevents most services from "flapping" by transitioning from healthy to unhealthy to recovering to healthy due to random variation in their results. If the service meets this high bar, the circuit transitions to healthy again, and the incident has ended.

Conversely, if the service is still undergoing difficulties, we send the circuit back to the Unhealthy state. We use a backoff algorithm to ensure that we wait longer than the previous time before we check the service again.

Architecture of the Circuit Breaker System

The architecture of our Circuit Breakers system is informed by the general architecture of Hipmunk's backend services. We wanted to build circuit breakers while reusing many of the components already present in our infrastructure. For this reason, it makes sense for us to start our discussion with a broad view of Hipmunk's architecture. If you want to get a deeper view of the Hipmunk tech stack, check out this article by our VP of Engineering, Navin Lal.

Hipmunk's backend servers are organized according to a differentiated monolith pattern. The same Python codebase is deployed on all our app servers, but we segment the types of requests each server handles at the load balancer. This allows us the simplicity of development and deployment that a monolith brings, while allowing us to tune server numbers and their specific hardware to the workloads we expect them to serve.

All of Hipmunk's servers use a distributed event-driven messaging system to publish server events. These include info, warning and error logs, but also encompass other more semantic events. Any server can publish events into this system, and any other server can subscribe to these events in (near) real-time.

Hipmunk's application servers and other services also have access to Redis for short-term storage. We use Redis heavily to cache database queries, expensive computations, and, as I mentioned earlier, to debounce third-party queries.

Finally, we have Apache Zookeeper as a distributed store of configuration for the various services within Hipmunk. We use Zookeeper as the source of truth for configuration that we would like to change at runtime without necessitating a deploy—such as feature flags.

These are the main pieces of infrastructure that played a part in the implementation of distributed Circuit Breakers.

The Architecture of Circuit Breakers

We conceptually decomposed the circuit breaker system into two components--a monitor and a processor.

The monitor is made available to all the other code running at Hipmunk as a library, and may run on any application server. Whenever a call is made to a service that is protected by a circuit breaker, it is the monitor that retrieves the current state for that service's key from ZooKeeper. If ZooKeeper says that the circuit's state is healthy, the monitor runs the code.

One requirement we set ourselves was to ensure that if ZooKeeper was unavailable, (or if for some other reason the circuit breaker code was unable to complete its book-keeping), the monitor would default to assuming it was healthy. It is far safer to err on the side of conservatism and run all the services all the time, than to accidentally disable all our services because some component temporarily is unreachable.

The monitor then calls the actual code that interacts with the remote service. The code has many ways it can signal to the monitor whether it was successful or not. The simplest is by raising an exception, which the monitor detects and momentarily intercepts. If this happens, the monitor fires an event signifying a failure, and then reraises the exception. If no error was logged, the monitor will send an indicator of success instead.

If the monitor detects that the state within ZooKeeper has changed to unhealthy, it now omits running the inner code, and instead calls the fallback. It still logs an event that the function was called, so that the processor is aware that the call is made. This helps the processor determine when a sufficient number of requests have been skipped, so that it can move the circuit back into recovery.

In this way, the monitor code is kept very simple. Its only two operations are to read from ZooKeeper, and to write to our distributed event system. This is highly appropriate for code that might be called all over our servers in a number of different contexts, some performance-sensitive.

The management of the state of the different circuits is handled by the circuit breaker processor. This process subscribes to the success and failure events logged from all the monitors. The processor maintains a running count of success and failure events that occurred within the last window within Redis, as well as a timestamp of the last time that circuit changed state.

All of the logic around state transitions is thus encapsulated within the processor. It is the processor's job to react to events by consulting Redis, updating the counts, and then updating ZooKeeper if a state transition is required. The processor also makes sure to send an email out to our alerts mailing list if a circuit breaker either trips or recovers.

Of course, having just one circuit breaker processor makes for a single point of failure. We get around this by running a backup processor which is configured for failover if the primary processor ever goes down. We also have a pager alert set to fire if circuit breaker events are no longer being processed.

The last part of this picture is the Circuit Breaker admin panel or dashboard, which engineers within Hipmunk can use to view the state of all circuit breakers and make adjustments in real time.

The API

We wanted to keep the API as simple as possible to encourage the adoption of circuit breakers across the engineering organization.

A user of the monitor may instantiate their Circuit Breaker with a unique string key that identifies the service. We have a convention that these strings are dotted and follow the structure 'company.product.service'. e.g.

# This circuit breaker monitors our own Natural Language Processing for hotels service.
circuit_breaker = CircuitBreaker('hipmunk.nlp.hotels')

While other circuit breaker libraries might allow for a myriad of bewildering configuration options (trip on number of concurrent failures, trip on ratio of failures, time to recovery, history window size and so forth), we defined a very small set of customization options&emdash;and carefully picked defaults for the other options so that, to date, no user has asked to change them.

That circuit breaker class can be used as a decorator for your function:

@circuit_breaker
def my_func(self):
    # the following piece of code might raise an exception if it fails
    response = self.make_remote_service_call(timeout=timedelta(seconds=2))

    if not response.json['success']:
        raise Exception('Any exception raised here will log a failure.')

    return response.json

When this circuit breaker has tripped, you may want to run a piece of code as a fallback—to return a sane default or attempt to fulfil the request locally, if possible. This is precisely how easy it is to set that up:

@my_func.fallback
def my_func_fallback(self):
    log.debug(
        'The circuit has tripped, and so we are returning something safe.')
    return {'success': False}

And that, for the most part, is all there is to it. We also define a couple of additional methods, circuit_breaker.async and circuit_breaker.async_fallback, since our codebase makes extensive use of Tornado's coroutines, but that is almost an implementation footnote at this point.

Takeaways

We are still in the process of rolling out Circuit Breakers across all the services with which we integrate at Hipmunk, but we have already seen a great return on investment on the project. Since we no longer have to respond to intermittent errors of our partners, we are able to keep bursts of errors out of our logs and continue to focus on important errors.

The key factor that allowed us to rapidly build this tool was to reuse the technologies we were already using within Hipmunk. To facilitate easy adoption, we focused on making the interface for developers using this technology as simple as possible, even at the expense of customizability.

Finally, having a set of emails and a control panel increases the visibility of the internal state of this tool, and grows the confidence and reliance that other developers place on this tool.