Common microservice failures and how DoorDash mitigates them (2024)

DoorDash is an on-demand food delivery service that connects customers with local restaurants through its app and website. It’s currently one of the largest food marketplaces in the US with almost 37 millions users. The platform enables users to browse menus, place orders, and have meals delivered directly to their doorstep.

In 2020, with the constant increase in their user base, the team decided to move from a Django monolith to a microservice architecture. This allowed for better scalability options, shorter waits for tests completion, faster deployment times and increased developer velocity. They wrote a great blog post on how they managed the transition. But this change also brought a lot of complexity with it.

Common microservice failures and how DoorDash mitigates them (1)

The new architecture introduced other types of issues which we’re going to talk about in this article. We’re going to have a look at some of the common pitfalls and anti-patterns that appear in a microservice architecture, how Doordash solved them at a local level and how they’re attempting to mitigate them at a global level.

Common pitfalls with microservice architectures

1. Cascading Failure - A cascading failure happens when the failure of one service leads to the failure of other dependent services. This can cause a chain reaction, potentially bringing down the entire system.

DoorDash had an outage of this kind that they talked about in this blog post. In their case, the chain of failure started from a seemingly innocuous database maintenance, which increased the database latency. The latency then bubbled up to the upstream services, causing errors from timeouts and resource exhaustion. The increased error rates triggered a misconfigured circuit breaker, which stopped traffic between a lot of unrelated services, resulting in an outage with a wide blast radius.

Why it happens:

Tight Coupling: Services are too dependent on each other, leading to a domino effect.
Lack of Isolation: Failures in one service are not contained and propagate to others.
Resource Exhaustion: Failure in a critical service can lead to resource exhaustion (e.g., CPU, memory) in dependent services.

Common microservice failures and how DoorDash mitigates them (2)

2. Retry Storm - A retry storm occurs when a service failure triggers multiple retries from dependent services, which can overwhelm the failing service even more. Retries can worsen the issue when the downstream services are unavailable or slow, leading to work amplification, as each failed request is retried multiple times, which can cause an already degraded service to deteriorate further.

Why it happens:

Uncontrolled Retries: Services automatically retry failed requests without considering the state of the failing service.
Lack of Backoff: Retries happen too frequently, without appropriate delay, exacerbating the problem.

Common microservice failures and how DoorDash mitigates them (3)

3. Death Spiral - A death spiral happens when the system starts to fail under load, and the attempts to handle the failure (like retries or additional resource allocation) further degrade the system’s performance, leading to a vicious cycle of deteriorating performance.

We’ve seen earlier how issues can spread vertically through different services dependant on one another. But they can also spread horizontally, inside the service cluster, from one node to another.

Why this happens:

Resource Contention: Excessive retries and fallback operations consume more resources, reducing availability for normal operations.
Unbalanced Load: Efforts to handle failures, like adding more instances, might lead to other parts of the system becoming overloaded.

Common microservice failures and how DoorDash mitigates them (4)

4. Metastable Failure - Metastable failure refers to a state where the system appears to be stable under low load but becomes unstable and fails when the load increases beyond a certain threshold. Metastable failures occur in open systems with an uncontrolled source of load where a trigger causes the system to enter a bad state that persists even when the trigger is removed. When any of several potential triggers cause the system to enter a metastable state, a feedback loop sustains the failure, keeping the system in this state until a significant (usually manual) corrective action is applied.

For example, an initial trigger such as a surge in users, might cause one of the backend services to load shed and start responding to certain requests with 429 (rate limit). The callers will then retry their calls, but these retries, combined with requests from new users, overwhelm the backend service even more, leading to further load shedding. This results in a positive feedback loop where calls are continuously retried (along with new calls), get rate limited, and then retried again, perpetuating the cycle.

The above is called a Thundering Herd problem and is one example of a Metastable failure.

Why this happens:

Insufficient Capacity: The system is not designed to handle peak loads, leading to instability.
Hidden Bottlenecks: Bottlenecks that are not apparent under normal conditions become critical under high load.
Latency Sensitivity: Small increases in latency can have a disproportionate effect on the system's stability.
Request Retries: Retrying failed requests is widely used to mask transient issues. However, it also results in work amplification, which can lead to additional failures

Common microservice failures and how DoorDash mitigates them (5)

Solutions at a local level

There are a couple of known techniques which are used to solve the above, some of which are being used by DoorDash as well:

Exponential Backoff - by gradually increasing the delay between retries to reduce the load

Common microservice failures and how DoorDash mitigates them (6)

Retry Limits - by setting a maximum limit on the number of retries to prevent endless retry loops.
Circuit Breakers - by activating and reducing the number of calls a dependent service makes to another one. Some examples of how circuit breakers are used by other companies:
- Netflix's Hystrix library: Netflix developed Hystrix as a library to manage failures within their distributed system. When Hystrix identifies that a remote service is down or unresponsive, it activates the circuit breaker, stopping further requests from being sent to the problematic service. It can also offer alternative responses or retry the request after a designated period.
- AWS ECS: ECS employs a circuit breaker pattern to automatically isolate failing services, thus preventing cascading failures within applications. This pattern also reduces latency and resource consumption during recovery from a service failure. It effectively ensures high availability and reliability for containerized applications.
- SoundCloud: SoundCloud also uses circuit breaker to handle failures in their distributed architecture.

Bulkheads - by isolating critical resources (like thread pools, database connections, or service instances) ensuring that failures or high resource usage in one microservice do not impact others sharing the same resources.
Timeouts & Fallbacks - by defining the maximum acceptable duration for a microservice to respond to a request and preventing indefinite waiting and and resource consumption when a service is slow or unresponsive.
Load Shedding - by prioritising and limiting incoming requests to prevent overload.

Common microservice failures and how DoorDash mitigates them (8)

Graceful Degradation - by minimising the amount of work that needs to be done, degrading non-essential functionalities or reduce service levels during times of high load, resource scarcity, or service unavailability. This can be achieved by determining essential functionalities that must remain operational even under degraded conditions, simplifying non-critical features to prioritise core functionalities and implementing fallback strategies to maintain basic service levels when primary functionalities are unavailable.
Capacity Planning - by analysing historical data, user patterns, and trends to forecast future demand for each microservice and determining the scalability requirements based on expected growth, seasonal fluctuations, and special events.

Shortcomings of the local countermeasures

The current localised mechanisms face similar limitations:

They rely on metrics specific to the individual service to determine its health. However, many types of failures involve interactions across multiple components, requiring a comprehensive system-wide perspective to effectively address overload conditions.
They use general metrics and heuristics to assess the system health, which may lack precision. For instance, high latency alone may not indicate service overload; it could stem from slow downstream services.
Their corrective actions are constrained. Operating within the local service, these mechanisms can only take local actions which may not be optimal for restoring the system health, as the root cause of the issue might lie elsewhere.

Solutions at a global level

One limitation of load shedding, circuit breakers, and graceful degradation is their narrow perspective within the system. These tools assess factors like their own resource usage, immediate dependencies, and incoming request volume. However, they lack the ability to adopt a global view of the entire system and decide based on that.

Aperture, an open-source system for reliability management, goes beyond local solutions by implementing a centralised load control. It provides a unified system for managing loads across multiple services during outages. It has 3 main components:

Observe: Aperture gathers reliability metrics from each node and consolidates them using Prometheus.
Analyse: A standalone Aperture controller continuously monitors these metrics and detects deviations from Service Level Objectives (SLOs).
Actuate: Upon detecting anomalies, the Aperture controller triggers policies tailored to observed patterns and applies actions on each node, such as load shedding or distributed rate limiting.

Aperture utilizes YAML-based policies that guides its actions during system disruptions. When an alert is triggered, Aperture automatically executes actions based on these configured policies. Some of the actions it offers include distributed rate limiting and concurrency control (also known as load shedding). By maintaining a centralized oversight and control of the entire system, Aperture enables various strategies to mitigate outages. For instance, it can be configured to implement a policy that throttles traffic to an upstream service when a downstream service is overwhelmed, preventing excessive requests from reaching the problematic area and thereby enhancing system responsiveness and cost-efficiency.

Common microservice failures and how DoorDash mitigates them (9)

DoorDash deployed Aperture within one of their core services and conducted artificial request simulations to evaluate its performance. They discovered that Aperture effectively operated as a robust and user-friendly global rate limiter and load shedding solution, providing a concurrency limiting algorithm which minimises the impact of unexpected load or latency.

The Aperture blog also provides good use cases of how their solution can be used to solve production problems:

Conclusion

We had a look into the different pitfalls that arise in distributed systems, what are the triggers that might cause them, what are some of the localised solutions to prevent them from happening and what can be done at a more global level.

In the next articles, we’re going to take a deeper look into how Aperture works, how we can configure it and how to define a set of policies for a given use case.