2. Predictive and Adaptive Failure Mitigation to Avert Production Cloud VM Interruptions

When a failure occurs in production systems, the highest priority is to quickly mitigate(缓解) it.

Failure Mitigation (FM) is done in a reactive and ad-hoc way, namely taking some fixed actions only after a severe symptom is observed.

Propose a preventive and adaptive failure mitigation service, NARYA, that is integraed in a production cloud, Microsoft Azure’s compute platform

Narya predicts imminent(迫在眉睫的) host failures based on multi-layer system signals
then decides smart mitigation actions
goal: avert(避免) VM failures
Narya’s decision engine takes a novel online experimentation approach to continually explore the best mitigation action.
- enhances the adaptive decision capability through reinforcement learning.

Introduction

key question

once a failure occurs, how to quickly detect and mitigate it so the system can continue running?

Mitigating a failure here means attempting to make the failure symptom disappear without necessarily diagnosing and fixing the underlying bugs first.

Problem: for large cloud infrastructure (like Ali-Cloud), only employing post-failure(失败后的) detection and mitigation techniques is insufficient.

Cloud systems should also design techniques to address the question of, whether a failure may be imminent, and if so, what preventive actions should be taken to avert this failure?

Current Work and Limits

Several recent works tackle the failure prediction problem in the context of disk failures. But they focus on prediction alone, with the goal of alerting operators or providing allocation hints

Narya

An end-to-end service with predictive and smart failure mitigation fully integrated in the Azure compute platform for its Virtual Machine (VM) host environment.

Goal of Narya: prevent VM failures ahead of time and enhance the self-managing capability of the Azure compute platform for providing smooth VM experience to customers

Challenge:

(背景原因:)With multi-tenancy, heterogeneous infrastructure components, and diverse customer workloads, it is difficult to(操作难度) comprehensively categorize different failure scenarios in a large cloud system beforehand and determine good mitigation actions (or their parameters), especially without trying it
the cloud system is constantly changing (software/hardware updates, customer workload changes), some mitigation action that worked well in the past may no longer be optimal.

Nothing is accessable or assessable before we try it: explorations with production workload is indispensable(必不可少的) to determine the (near-)optimal failure mitigation action.

should ensure that the actions taken maximize the expected effectiveness (minimize the potential customer impact) over time

A/B Testing

Narya predicts whether host nodes in the production fleet will likely fail and then leverages A/B testing to continually experiment with different mitigation actions, measure the benefits, and discover optimal actions.

Background and Motivation

Traditional System

A traditional system’s operation cycle is as follows: a failure is detected; developers diagnose the failure and find out the root cause; a patch is written; the system is re-deployed.

Cloud System

operating in this exact sequence is problematic

because the time it takes to identify the root cause and develop a fix is usually long and exceeds the downtime budget.

Instead, once a failure is detected, some mitigation action like restart (重启解决 90% 的问题, 重启两次解决 99% 的问题) will be applied first without necessarily knowing the bug.

Target System and Goal

Our specific target system is the VM host environment, a node

The node is backed by locally attached disks and remote virtual disks. Each node is connected to various compute services, together referred to as controller, that is responsible for provisioning resources and performing management actions such as creating and destroying VMs.

Currently Azure already has monitor (real-time detect the failure), but we want to predict

The end goal is to avoid future VM failure events.

Are Failures Predictable

2 Basic Requirements:

the imminent failure is not abrupt
there is telemetry recorded to indicate the degradation

Hardware Wear Out: predict the failure by the age of the hardware, combined with other system signals such as workload patterns

Software Resource Leak: including memory/file handle/network ports leak, is a common type of predictable software failure. Predict them using the resource usage trend. If failures are correlated with certain hidden factors such as timeout settings, bugs related to timers, and release schedule, they may also occur on a predictable basis.

Why Static Mitigation is Insufficient

If we use fixed plan for VM failures, then we follow these steps

block allocation on the node;
1. problem here: Blocking allocation results in capacity pressure while for some predicted failures, avoiding allocation may be better.
try to live migrate VMs;
1. Some failures may be too severe to do live migration (e.g., broken disks). Forced migration causes unnecessary customer impact if nodes are still healthy after 7 days.
wait for 7 days for short-lived VMs to be destroyed by customers;
force migration of remaining VMs;
mark the node offline and send it for repair;
1. Marking nodes offline is also suboptimal when capacity is low

Overview

Narya advances the current practice of failure mitigation in two ways:

replacing existing static and ad-hoc mitigation assignment to adaptive and systematic decision algorithms;
transforming the traditionally reactive, post-failure mitigation activity to proactive failure avoidance mechanisms.

Workflow

Optimization Metric

objective is to reduce and minimize the overall customer impact caused by node failures on the fleet

\text{AIR (Annual Interrupt Rate)} = \frac{\text{VM interruption count in } T}{\text{Total VM lifetime in } T} \times 365 \, \text{days} \times 100 \, \text{VMs}

$T$ is any given measured interval duration in days
VM interruption in this paper mainly refers to reboots or loss of heartbeats

Optimization Background in Azure

First, long-duration incidents are now rare in Azure. VM interruptions become more common that require addressing

Second, short VM interruptions can significantly disrupt user experiences, e.g., for gaming type applications.

Third, for VMs that run applications like databases, even if the VM only experiences a short interruption, the applications take time to recover, which translates into a longer user-perceived interruption.

Fourth, based on communications with customers, customers can be more annoyed if their VMs get frequently interrupted when compared to a single longer-time interruption.

Challenges

First, failure mitigation has to act with incomplete information since the underlying root cause is not known.

Second, due to the massive scale of a cloud system, there are many factors to consider in the decision logic.

Third, our experience suggests that when incorporating failure prediction into a production cloud system, false positives are unavoidable due to the complex system environment, large number of noisy signals, unexpected customer workloads, etco

Lastly, failure mitigation is a mission critical procedure. If not designed well, a decision engine may do more harm than good. Ensuring safety should be a top priority for Narya.

Predicting Node Failures

describe two prediction methods Narya uses:
(1) static threshold rules written by domain experts;
(2) machine learning model-based prediction

Input Signals

Narya consumes telemetry signals from the entire stack of the host environment to make informed prediction.

Hardware/Firmware: the monitoring agents collect low-level logs from disk SMART attributes, memory (e.g., uncorrectable errors), CPU (e.g., machine check error), motherboard (e.g., bus error), etc.

A higher-level source of signals comes from device drivers, e.g., timeout events

Repetition of such events is often an indicator of an imminent failure

Control-Plane Operations: For example, repetitive VM creation operation errors could indicate serious host issues even if the host still appears to be running. Such signals help reduce the observability gap

Rule-based Prediction

Analyze the common failure patterns and the available telemetry signals to predict failures that have significant customer impact.

rules are manually written in json/py/cpp, they are simple and easy to understand

works best for definitive signals that indicate some severe issue with high confidence

Drawbacks/Limitations

Since many failure signals are not definitive, rule-based prediction cannot cover a wide range of imminent failures. In addition, the prediction may come late and do not provide enough lead time for the mitigation engine.

Learning-based Prediction

It can predict many complex host failures. It also can predict earlier, thus leaving longer time for the mitigation engine to react.

Use supervised learning, which is similar to the previous works, but a main difference is that we focus on overall host health and failures that result in customer impact, instead of failures of individual components.

Learning Labels

Negative Label: normal state
Positive Label: dangerous to failure state

Short Time: For Narya, the host view of a failure is different from individual components’ view. The host failures could be unresponsive host, VM creation failure, host OS crash, etc. These temporary failure could happen much earlier than the permanent failure of a component.

System-View: certain faults might not be a problem to the source component but could be problematic from host’s view.

To get accurate and useful prediction result, we only use host failures that result in customer impact and are later con- firmed to be caused by some hardware component faults during diagnosis.

For training, it use the time unit hour, which means that it attach positive label for $t-1$ back to $t - n$ , and that before $t-(n+1)$ it is labeled as negative. The variable $n$ is called the horizon (usurally production horizon is set ot 7 days)

we looked at the feature distribution of failed nodes and measured the same distribution of healthy nodes.

Machine Learning Model

Binary Classifier: given signals, labels, host metadata, Narya’s predictor output hte failure probability of a host (we use 0.5 as the cutoff).

Gradient boosted tree model (commonly used supervised learning method)
- combines decisions from a sequence of simple decision trees with a model ensembling(集成的) technique called gradient boosting
Attention-based deep learning model
- reducing the feature engineering efforts by directly learning the features
- aim to learn both spatial features and temporal features
  - Spatial: compare one component to its neighbors, e.g., one host often has multiple disks configured under RAID 0 (一排硬盘同时写入无 replica), thus they are expected to perform similarly -> if one disk is slower, then considered as imminent host failure
- The temporal features characterize changes in components over time

dimension adapter layer to unify the dimension of signals from different sources
employ a spatial information encoder based on self-attention
1. calculates weights of a component’s neighbors
2. the weighted sum of the neighbors’ feature vector represent its spatial information
the temporal information encoder
1. consists of positional encoding, self-attention, and location-based attention layers.
employ a fusion layer to do binary classification.

Mitigation Actions

Table 2 lists the main primitive actions in Azure. Mitigating a failure often requires multiple primitive actions

An aggressive goal for Narya is to explore the actions arbitrarily and figure out the optimal combination. But this could potentially bring significant customer impact. Instead, Narya mitigation engine focuses on exploring pre-defined composite actions

Followings are implementations of the primitive actions

Live Migration

Goal: moves a running VM from one host to another with minimum disruptions.

Migration process involves transfer of the VM’s memory, processor and virtual device state

The LM engine iteratively copies the VM’s memory pages while maintaining a dirty page set for the VM on the source host.
Based on the dirty page rate, network bandwidth, the engine determines the maximum iterations to stop the VM
After the VM is stopped, the LM engine synchronizes the dirty state with the target and resumes the VM on the target host

VM Preserving Soft Reboot

Used to preserve the VM state across a reboot of the host OS

The host OS kernel is reloaded into memory, the VM memory and device state are persisted to the newly loaded kernel.

The persisted state is restored and the rest of the state in the prior kernel are discarded

The restored VM experiences a brief pause similar to the live migration

Service Healing

used to restore the service availability of unhealthy or faulty VMs

Live Migration can fail or cannot be applied due to certain constraints such as network boundary

Service healing works for more general scenarios: The faulted VMs will be isolated by powering down or disconnecting from network. The controller generates a new assignment of the VM to healthy nodes

Mark Unallocatable

blocks allocation of new VMs to a host for some time T

often used before other primitive actions(like LM) are taken (recall eecs482 final’s last problem use the pin and unpin to protect the disk io lock).

Avoid

informs the allocator to try to avoid adding new VMs on this host (weaker constraint)

Blocking allocation has a strong impact on capacity since the host is not eligible for getting new VMs.

Thus, the number of hosts that can be marked unallocatable at the same time is limited (also recall the eecs482 final, pin has a upper bound, but not the exact same reason here).

NoOp

he controller does not take any action. This is the baseline to measure the benefits of prediction and taking actions.

Decision Logic for Adaptive Mitigation

Online Experimentation with A/B Testing

Narya System Design and Implementation

describe the system support for Narya

The Narya system must be able to process the massive signals and requests from the entire fleet with low latency and reliability.

Failure Predictor

deploys various agents in each node to monitor the health of the host environment. The Narya predictor ingests(摄取) health signals from these monitoring agents and runs rule based prediction and ML-based prediction

Rule-based prediction has low cost and high priority. Thus its prediction logic is executed directly in the host

ML-based prediction inspects much more signals such as performance counters and runs more complex prediction logic. Thus, the ML predictor is implemented as a centralized service.

It(ML-based) collects raw signals from monitoring agents using micro-batches (small groups) and incrementally processes them. Open source technologies are used for ML modeling (LightGBM + PyTorch).

Pub/Sub Service 发布订阅模式服务 (Publish+Subscribe)

A mitigation request is created if a node is predicted to fail with high probability.

The predictor publishes the request along with metadata information about the host (e.g., hardware generations, OS version) to a central pub/sub service, which we implement on top of Kafka.

Mitigation Engine

Internally, it is composed of four major microservices. These microservices communicate with each other and other services in Azure using REST APIs.

Create Mitigation Job

The Request Handler microservice consumes(Kafka Consumer) mitigation requests from the Pub/Sub service.

Upon receiving a mitigation request, it creates a mitigation job with a job Id. This job Id is used by other micro-services to track the mitigation and query its progress.

Instantiate Mitigation Policy

For a new mitigation job, the Policy Generator creates a mitigation policy, which maps the information from the request to the action to take. (represented as a decision tree)

2 types of nodes:

Selection node, which chooses the tree node to visit next based on some C# predicate;
Action node, which executes a user-defined C# function.

The decision tree structure allows us to easily specify the decision logic.

The policy generator then applies safety constraints on the retrieved exploration setting to obtain an adjusted action probability distribution.

Additionally, the mitigation policy allows imposing rate limit for a tree node to avoid excessive mitigation that could cause capacity issue or cascading failures

Walk Policy Tree

Generator traverses the policy tree in DFS order and creates an action plan.

The generator performs many steps such as checking predicates, checking rate limit conditions, etc.

Carry out Action Plan

The Action Orchestrator microservice is responsible for carrying out the action plan asynchronously from the policy tree walk session

This step involves making API calls to the corresponding compute managers since different actions may be implemented by different managers

Log Action

The logging format for Bandit learning is special since it requires not only recording the chosen action but also the associated probability.

In particular, the mitigation engine will log the action timestamp, experiment name, model type, model name, model version, action distributions, chosen action, chosen action parameters, etc.

Track Node Health

The Health Tracker tracks node and VM health information during the mitigation process

Learner

a centralized component in Narya, learns the effect of mitigation action across different data center regions

Advantage:

observing more data points and hence more confidence in the cost estimation
a mitigation effect change in certain region due to software/firmware updates could be quickly learned and applied to other regions rolling out the same updates.

2 main jobs:

Cost collection
- retrieves the mitigation engine’s decisions from the logs
- then correlated with the VM availability measurements and other important information
- determine the cost of the mitigation action for training
Bandit model training
- The output model of the learner is a categorical distribution, which the model server can easily draw samples from