4. Metastable Failures in the Wild
Introduction
这篇文章是基于 MetaStable Failures in Distributed Systems 的升级, in the wild 指的是不可控的实际世界
In this work, we study the prevalence(患病率) of such failures in the wild by scouring(冲刷) over publicly available incident reports from many organizations, ranging from hyperscalers to small companies
In this paper, we make four contributions that extend the work of Bronson et al. and increase our understanding of metastable failures:
A study of metastable failures in the wild that confirms metastable fai ...
0. Kubernetes (multipass + k3s + helm) + ChaosMesh
本文参考了 极客网(GeekHours-Kubernetes) 的笔记, 配置环境采用 macOS Sequoia + m3 (Silicone) 如果电脑配置不同, 建议参考上述网址进行下载配置
环境配置及基本原理讲解
单节点 k8s 环境部署
12brew install minikubeminikube start
多节点 k8s 环境部署
在单物理机上部署多个节点, 要么采用 docker 容器思路, 要么采用 虚拟机思路, 由于 kubernetes 本身并不是 docker 衍生品, 这里采用 虚拟机思路来实现 (想要利用 docker 实现的可以参考 kind 项目), 但是我们配置的虚拟机目的也应该是达到类似容器的轻量级, 命令行访问环境的条件即可, 因此我们采用 multipass 项目和 k3s 项目进行配置
multipass 轻量级虚拟机
这是由 Canonical 公司 (Ubuntu 母公司) 开发的一个项目, 支持通过命令行设置来进行控制虚拟机的配置和 vm 集群状态查询
123456789101112131415161718192021# 下载这个指令b ...
5. ZooKeeper: Wait-free Coordination for Internet-scale Systems
Overview
ZooKeeper, a service for coordinating processes of distributed applications.
aims to provide a simple and high performance kernel for building more complex coordination primitives at the client
The interface exposed by ZooKeeper has the wait-free aspects of shared registers with an event-driven mechanism similar to cache invalidations of distributed file systems to provide a simple, yet powerful coordination service.
Configuration
Configuration is one of the most basic forms of coordina ...
2. Predictive and Adaptive Failure Mitigation to Avert Production Cloud VM Interruptions
When a failure occurs in production systems, the highest priority is to quickly mitigate(缓解) it.
Failure Mitigation (FM) is done in a reactive and ad-hoc way, namely taking some fixed actions only after a severe symptom is observed.
Propose a preventive and adaptive failure mitigation service, NARYA, that is integraed in a production cloud, Microsoft Azure’s compute platform
Narya predicts imminent(迫在眉睫的) host failures based on multi-layer system signals
then decides smart mitigation actions
go ...
1. AIOpsLab: A Holistic Framework to Evaluate AI Agents for Enabling Autonomous Clouds
AI for IT Operations (AIOps) aims to automate complex operational tasks, such as fault localization and root cause analysis
traditional: ddressing isolated operational tasks
LLM and AI agents: enabling end-to-end and multitask automation
Target: self-healing cloud systems, a paradigm we term AgentOps
AIOpsLab
a framework that not only deploys micro-service cloud environments, injects faults, generates workloads, and exports telemetry(遥测得的) data but also orchestrates these components and provid ...
0. Hadoop Distributed File System
Consistency 一致性
CAP Consistency
所有节点对同一份数据, 在同一时刻具有相同的视图
Transaction Consistency
事务开始前和结束后, 数据库必须处于一个合法的状态
数据复制中的一致性模型
如下表
一致性类型
定义
特点
强一致性(Strong Consistency)
所有读操作总能读取到最新写入的数据
类似单机行为, 用户视角简单但性能代价高
线性一致性(Linearizability)
操作结果看起来是按全局时间顺序排列
是强一致性的一种更严格形式
顺序一致性(Sequential Consistency)
各节点操作顺序一致, 但不保证全局时序
稍弱, 允许不同读者看到写入顺序不同但一致的版本
因果一致性(Causal Consistency)
如果一个操作因另一个而起, 它们必须按因果顺序执行
不相关的操作可乱序, 提高并发性
会话一致性(Session Consistency)
一个客户端在一个会话内的所有操作是顺序一致的
用户体验更好, 适用于移动端等临时连接系统
最终 ...
1. Kafka
Introduction
Event Streaming
the practice of capturing data in real-time from event sources like databases, sensors, mobile devices, cloud services, and software applications in the form of streams of events
routing the event streams to different destination technologies as needed
ensures a continuous flow and interpretation of data so that the right information is at the right place, at the right time
Kafka’s Event Stream Purpose
To publish (write) and subscribe to (read) streams of events, in ...
1. Map Reduce, Simplified Data Processing on Large Clusters
link for this paper: mapreduce
link for mit cs6.824 lecture: lecture 1
([[mapreduce-osdi04.pdf#page=1&selection=10,0,12,10&color=yellow|MapReduce is a programming model and an associated implementation for processing and generating large data sets.]])
contribution:
可以运行在 commodity machines 上面, scalable
在上千台机器上运行大量数据 (# terabytes)
programmer 不需知道很多并行相关的知识, easy to use
hides detail for parallelization, fault-tolerance, locality optimization, and load balancing
many problems are easily ...
9. 数字取证 Digital-Forensics
数字取证流程(Digital Forensics Process)
四个核心阶段
识别(Identification)
确定存储关键数据的物理/数字对象(如计算机, 硬盘, 移动设备, 外接媒体等);
收集(Collection)
保护证据完整性, 建立链式监管(Chain of Custody), 记录证据哈希(Hash of Image)以验证来源;
分析(Analysis)
检查文件系统, 日志, 内存等, 恢复删除文件(Deleted Files)或残留数据(Slack Space);
报告(Reporting)
形成专家报告, 提供法律证据(Legal Evidence);
数据收集与保存(Data Collection & Preservation)
数据来源
computer
other harddrive
monitor
keyboard and mouse
media (dvd, cd, usb)
printer
digital forensics did not replace traditional (physical) ...
