3. Metastable Failures in Distributed Systems
Introduction
metastable failures 亚稳态故障
a failure pattern in distributed systems
Currently, metastable failures manifest themselves as black swan events(黑天鹅事故)
- they are outliers(异常事故) because nothing in the past points to their possibility
- have a severe impact
- much easier to explain in hindsight(事后) than to predict.
Although instances of metastable failures can look different at the surface, deeper analysis shows that they can be understood within the same framework.
By reviewing experiences from a decade of operating hyperscale distributed systems, we identify a class of failures that can disrupt them, even when there are no hardware failures, configuration errors, or software bugs.

Metastable failures occur in open systems with an uncontrolled source of load where a trigger causes the system to enter a bad state that persists even when the trigger is removed.
sustaining effect 持续效应
- work amplification
- decreased overall efficiency
these 2 prevents the system from leaving the bad state
we call this bad state a metastable failure state.
Leaving a metastable failure state requires a strong corrective push, such as rebooting the system or dramatically reducing the load
The lifecycle of a metastable failure involves three phases, as shown in above figure.
- A system starts in a stable state.
- 量变:The load rises above a certain threshold—implicit and invisible— the system enters a vulnerable state
- 质变:The vulnerable system is healthy, but may fall into an unrecoverable metastable state due to a trigger.
In fact, many production systems choose to run in the vulnerable state all the time because it has much higher efficiency than the stable state.
When one of many potential triggers causes the system to enter the metastable state, a feedback loop sustains the failure, causing the system to remain in the failure state until a big enough corrective action is applied.
It is common for an outage that involves a metastable failure to be initially blamed on the trigger, but the true root cause is the sustaining effect.
The goal of this vision paper is to change that by
- establishing metastable failures as a class of failures
- analyzing their common traits(特征) and characteristics
- proposing new research directions in identifying, preventing, and recovering from metastable failures
Case Study
The sustaining effect is almost always associated with exhaustion of some resource
Feedback loops associated with resource exhaustion are often created by features that improve efficiency and reliability in the steady state
QPS (queries per second): a measure of the load rate
Request Retries
Retrying failed requests is widely used to mask transient issues. However, it also results in work amplification, which can lead to additional failures.
Closely related to request retry is request failover, where a failure detector is used to route requests to only healthy replicas (导致恢复节点). Failover doesn’t result in request amplification on its own because each request is processed only once, but it can cause failures to cascade. When replicas are sharded differently, a particularly pernicious form of this contagion causes a transient point failure to grow into a total outage.
Look-aside Cache
Slow Error Handling
metastable failure states can also arise when the processing of a request is less efficient in the failure state
If a trigger causes the system to run out of any of the resources that are used by the error handling code, then error handling will make the shortage more severe.
Link Imbalance
Metastable failures can hinge on (取决于) a confluence(各领域融合) of implementation details, such that no one person has enough knowledge to figure it out. This can make them challenging to diagnose even after they appear.
It turns out that the sustaining effect matches the same pattern as the other metastable failures we’ve examined. The key is that there is a mechanism by which resource exhaustion on the congested link causes it to be preferred for future requests, leading to more congestion.
Approaches to Handling Metastability
Trigger vs. (<) Root Cause
We consider the root cause of a metastable failure to be the sustaining feedback loop, rather than the trigger.
There are many triggers that can lead to the same failure state, so addressing the sustaining effect is much more likely to prevent future outages.
Change of Policy during Overload
One way to weaken or break the feedback loops is to ensure that goodput remains high even during overload. This can be done by changing routing and queueing policies during an overload.
e.g. FIFO -> LIFO
A major challenge with adaptive policies is coordination (这个词和 schedule 的区别是这个词侧重资源的分配,而后者侧重时间顺序安排), as retry and failover decisions are made by each client.
The best decisions are made using global information, but the communication required to distribute status information can be a new way in which a failure can have a sustaining effect.
Another fundamental challenge for adaptive policies lies in accurately differentiating persistent overload from load spikes.
- Circuit Breaker: when terribly overload -> reject all requests
- Shed Load: server intentionally discard normal user’s request (only respond to paid users)
Prioritization
Another way to retain efficiency when a resource is exhausted is to use priorities.
Using a lower priority for retried queries would avoid perpetuating(永存) the feedback loop
Some Problems that Cannot Resolve
The challenge here is that priority systems only manage some of the resources in the system, and they can allow or even encourage policies with high work amplification.
Perhaps more importantly, not all architectures are equally amenable to implementing a priority system, which takes experience to realize.
Not all architectures are equally amenable (顺从,适合) to implementing a priority system, which takes experience to realize.
Another lesson is that the software structure encodes implicit (隐式的) priorities.
Stress Tests
Good stress test should focus on a portion of the production infrastructure
it also enables safely draining the target clusters if a metastable failure occurs.
Organizational Incentives (激励措施)
Cause: Optimizations that apply only to the common case
Result: Exacerbate feedback loops because they lead to the system being operated at a larger multiple of the threshold between stable and vulnerable states
Incentivizing application changes that reduce cold cache misses, on the other hand, yields a true capacity win.
Fast Error Paths
Distributed systems should also have highly-optimized error paths (for error handling the process should be efficient and low cost)
- Use error logging thread
- use bounded-sized and lock-free queue
- if a queue is full, then only use a counter to count the number of errs
- Throttle stack trace (expensive information)
- when errors are too many, a sample is enough
Outlier Hygiene 异常点治理
Some small errors may have already indicate the later errors to occur.
Autoscaling
Elastic systems are not immune to metastable failure states, but scaling up to maintain a capacity buffer reduces the vulnerability to most triggers.
Discussion and Research Directions
We must learn to operate in the vulnerable state by achieving two separate goals:
- designing systems that avoid metastable failures while operating efficiently
- requires a comprehensive approach that ranges from detecting vulnerable states and potential failures to curtailing(削减) the impact of sustaining effects.
- Detecting vulnerable states is difficult due to the sheer size of the systems and all the different processes affecting them.
- Predicting failures is even harder since we need to identify the vulnerable state correctly and foresee the potential trigger and its intensity.
- developing mechanisms to recover from metastable failures as quickly as possible in cases that cannot be avoided.
How to Avoid (research directions, not review)
develop software frameworks for building dfs that make problematic feedback loops impossible, or at least discoverable?
Work Amplification
Designing systems to avoid metastable failures will require a systematic understanding of where the largest instances of work amplification occur.
Feedback Loops
The strength of the loop depends on a host of constant factors from the environment, such as cache hit rate.
We don’t need to eliminate every loop, just weaken the strongest ones.
accurately identifying vulnerabilities in existing systems
Characteristic Metric
There is often a metric that is affected by the trigger and that only returns to normal after the metastable failure resolves. We call such a metric characteristic and visualize it as a dimension in which it is unsafe to significantly deviate.
These metrics will spike during the request surge that follows a resource outage
A characteristic metric can give insight into the state of the feedback loop (the memory component of a metastable failure) directly or indirectly.
Characteristic metrics we have observed in production are queueing delay, request latency, load level, working set size, cache hit rate, page faults, swapping, timeout rates, thread counts, lock contention, connection counts, and operation mix.
We expect that research into a systematic way to find unknown metastable failures will involve identifying the important characteristic metrics of a system.
give a meaningful estimate of the probability that a novel metastable failure will occur
Warning Sign
Characteristic Metric could be set to define a range of safe values, exiting the range triggers an alarm and maybe an automated intervention
The idea of alerting on internal metrics is not new, but the framework of metastability can allow us to learn the right metrics and thresholds without experiencing major outages
Hidden Capacity
