1. Boot and System Management Daemons
Boot Process Overview
近年来, boot 的程序逐步从较为复杂的 BIOSs 简化为 UEFI 程序
而最近的系统 采用的是一个 system manager daemon systemd 而非传统的 UNIX init, systemd 通过添加 dependency management 来精简 (streamline) 开机流程, 因为 dependency 机制可以允许并发开机的请求
在 bootstrapping (即 boot) 期间, kernel 会被读入到 mem 中并且开始执行
![[Pasted image 20250523131345.png]]
在系统完全开机之前 文件系统会被 check 一次, 并且系统守护进程 daemon 会开始运行
这些指令 (shell scripts) 统一称为 init scripts
System Firmware 硬件基础
机器开机的时候, CPU 会被固件层执行 boot code (存储在 ROM 中), 在诸如虚拟机等的
virtual 环境中, 这个也是虚拟的但是概念相近
系统固件(system ...
3. Metastable Failures in Distributed Systems
Introduction
metastable failures 亚稳态故障
a failure pattern in distributed systems
Currently, metastable failures manifest themselves as black swan events(黑天鹅事故)
they are outliers(异常事故) because nothing in the past points to their possibility
have a severe impact
much easier to explain in hindsight(事后) than to predict.
Although instances of metastable failures can look different at the surface, deeper analysis shows that they can be understood within the same framework.
By reviewing experiences fr ...
4. Metastable Failures in the Wild
Introduction
这篇文章是基于 MetaStable Failures in Distributed Systems 的升级, in the wild 指的是不可控的实际世界
In this work, we study the prevalence(患病率) of such failures in the wild by scouring(冲刷) over publicly available incident reports from many organizations, ranging from hyperscalers to small companies
In this paper, we make four contributions that extend the work of Bronson et al. and increase our understanding of metastable failures:
A study of metastable failures in the wild that confirms metastable fai ...
0. Kubernetes (multipass + k3s + helm) + ChaosMesh
本文参考了 极客网(GeekHours-Kubernetes) 的笔记, 配置环境采用 macOS Sequoia + m3 (Silicone) 如果电脑配置不同, 建议参考上述网址进行下载配置
环境配置及基本原理讲解
单节点 k8s 环境部署
12brew install minikubeminikube start
多节点 k8s 环境部署
在单物理机上部署多个节点, 要么采用 docker 容器思路, 要么采用 虚拟机思路, 由于 kubernetes 本身并不是 docker 衍生品, 这里采用 虚拟机思路来实现 (想要利用 docker 实现的可以参考 kind 项目), 但是我们配置的虚拟机目的也应该是达到类似容器的轻量级, 命令行访问环境的条件即可, 因此我们采用 multipass 项目和 k3s 项目进行配置
multipass 轻量级虚拟机
这是由 Canonical 公司 (Ubuntu 母公司) 开发的一个项目, 支持通过命令行设置来进行控制虚拟机的配置和 vm 集群状态查询
123456789101112131415161718192021# 下载这个指令b ...
5. ZooKeeper: Wait-free Coordination for Internet-scale Systems
Overview
ZooKeeper, a service for coordinating processes of distributed applications.
aims to provide a simple and high performance kernel for building more complex coordination primitives at the client
The interface exposed by ZooKeeper has the wait-free aspects of shared registers with an event-driven mechanism similar to cache invalidations of distributed file systems to provide a simple, yet powerful coordination service.
Configuration
Configuration is one of the most basic forms of coordina ...
2. Predictive and Adaptive Failure Mitigation to Avert Production Cloud VM Interruptions
When a failure occurs in production systems, the highest priority is to quickly mitigate(缓解) it.
Failure Mitigation (FM) is done in a reactive and ad-hoc way, namely taking some fixed actions only after a severe symptom is observed.
Propose a preventive and adaptive failure mitigation service, NARYA, that is integraed in a production cloud, Microsoft Azure’s compute platform
Narya predicts imminent(迫在眉睫的) host failures based on multi-layer system signals
then decides smart mitigation actions
go ...
1. AIOpsLab: A Holistic Framework to Evaluate AI Agents for Enabling Autonomous Clouds
AI for IT Operations (AIOps) aims to automate complex operational tasks, such as fault localization and root cause analysis
traditional: ddressing isolated operational tasks
LLM and AI agents: enabling end-to-end and multitask automation
Target: self-healing cloud systems, a paradigm we term AgentOps
AIOpsLab
a framework that not only deploys micro-service cloud environments, injects faults, generates workloads, and exports telemetry(遥测得的) data but also orchestrates these components and provid ...
0. Hadoop Distributed File System
Consistency 一致性
CAP Consistency
所有节点对同一份数据, 在同一时刻具有相同的视图
Transaction Consistency
事务开始前和结束后, 数据库必须处于一个合法的状态
数据复制中的一致性模型
如下表
一致性类型
定义
特点
强一致性(Strong Consistency)
所有读操作总能读取到最新写入的数据
类似单机行为, 用户视角简单但性能代价高
线性一致性(Linearizability)
操作结果看起来是按全局时间顺序排列
是强一致性的一种更严格形式
顺序一致性(Sequential Consistency)
各节点操作顺序一致, 但不保证全局时序
稍弱, 允许不同读者看到写入顺序不同但一致的版本
因果一致性(Causal Consistency)
如果一个操作因另一个而起, 它们必须按因果顺序执行
不相关的操作可乱序, 提高并发性
会话一致性(Session Consistency)
一个客户端在一个会话内的所有操作是顺序一致的
用户体验更好, 适用于移动端等临时连接系统
最终 ...
1. Kafka
Introduction
Event Streaming
the practice of capturing data in real-time from event sources like databases, sensors, mobile devices, cloud services, and software applications in the form of streams of events
routing the event streams to different destination technologies as needed
ensures a continuous flow and interpretation of data so that the right information is at the right place, at the right time
Kafka’s Event Stream Purpose
To publish (write) and subscribe to (read) streams of events, in ...
