Yuchen You

Updated2025-05-28|operating_system•distributed_sys•chaos_system•system_failure

Introduction 这篇文章是基于 MetaStable Failures in Distributed Systems 的升级, in the wild 指的是不可控的实际世界 In this work, we study the prevalence(患病率) of such failures in the wild by scouring(冲刷) over publicly available incident reports from many organizations, ranging from hyperscalers to small companies In this paper, we make four contributions that extend the work of Bronson et al. and increase our understanding of metastable failures: A study of metastable failures in the wild that confirms metastable fai ...

0. Kubernetes (multipass + k3s + helm) + ChaosMesh

Updated2025-05-27

本文参考了极客网(GeekHours-Kubernetes) 的笔记, 配置环境采用 macOS Sequoia + m3 (Silicone) 如果电脑配置不同, 建议参考上述网址进行下载配置环境配置及基本原理讲解单节点 k8s 环境部署 12brew install minikubeminikube start 多节点 k8s 环境部署在单物理机上部署多个节点, 要么采用 docker 容器思路, 要么采用虚拟机思路, 由于 kubernetes 本身并不是 docker 衍生品, 这里采用虚拟机思路来实现 (想要利用 docker 实现的可以参考 kind 项目), 但是我们配置的虚拟机目的也应该是达到类似容器的轻量级, 命令行访问环境的条件即可, 因此我们采用 multipass 项目和 k3s 项目进行配置 multipass 轻量级虚拟机这是由 Canonical 公司 (Ubuntu 母公司) 开发的一个项目, 支持通过命令行设置来进行控制虚拟机的配置和 vm 集群状态查询 123456789101112131415161718192021# 下载这个指令b ...

5. ZooKeeper: Wait-free Coordination for Internet-scale Systems

Updated2025-06-28

Overview ZooKeeper, a service for coordinating processes of distributed applications. aims to provide a simple and high performance kernel for building more complex coordination primitives at the client The interface exposed by ZooKeeper has the wait-free aspects of shared registers with an event-driven mechanism similar to cache invalidations of distributed file systems to provide a simple, yet powerful coordination service. Configuration Configuration is one of the most basic forms of coordina ...

2. Predictive and Adaptive Failure Mitigation to Avert Production Cloud VM Interruptions

Updated2025-05-16|distributed_sys

When a failure occurs in production systems, the highest priority is to quickly mitigate(缓解) it. Failure Mitigation (FM) is done in a reactive and ad-hoc way, namely taking some fixed actions only after a severe symptom is observed. Propose a preventive and adaptive failure mitigation service, NARYA, that is integraed in a production cloud, Microsoft Azure’s compute platform Narya predicts imminent(迫在眉睫的) host failures based on multi-layer system signals then decides smart mitigation actions go ...

1. AIOpsLab: A Holistic Framework to Evaluate AI Agents for Enabling Autonomous Clouds

Updated2025-05-16|distributed_sys

AI for IT Operations (AIOps) aims to automate complex operational tasks, such as fault localization and root cause analysis traditional: ddressing isolated operational tasks LLM and AI agents: enabling end-to-end and multitask automation Target: self-healing cloud systems, a paradigm we term AgentOps AIOpsLab a framework that not only deploys micro-service cloud environments, injects faults, generates workloads, and exports telemetry(遥测得的) data but also orchestrates these components and provid ...

0. Hadoop Distributed File System

Updated2025-06-23|distributed_sys

Consistency 一致性 CAP Consistency 所有节点对同一份数据, 在同一时刻具有相同的视图 Transaction Consistency 事务开始前和结束后, 数据库必须处于一个合法的状态数据复制中的一致性模型如下表一致性类型定义特点强一致性(Strong Consistency) 所有读操作总能读取到最新写入的数据类似单机行为, 用户视角简单但性能代价高线性一致性(Linearizability) 操作结果看起来是按全局时间顺序排列是强一致性的一种更严格形式顺序一致性(Sequential Consistency) 各节点操作顺序一致, 但不保证全局时序稍弱, 允许不同读者看到写入顺序不同但一致的版本因果一致性(Causal Consistency) 如果一个操作因另一个而起, 它们必须按因果顺序执行不相关的操作可乱序, 提高并发性会话一致性(Session Consistency) 一个客户端在一个会话内的所有操作是顺序一致的用户体验更好, 适用于移动端等临时连接系统最终 ...

2. The Design of a Practical System for Fault-Tolerant Virtual Machines

Updated2025-05-11|distributed_sys

1. Kafka

Updated2025-05-12|distributed_sys

Introduction Event Streaming the practice of capturing data in real-time from event sources like databases, sensors, mobile devices, cloud services, and software applications in the form of streams of events routing the event streams to different destination technologies as needed ensures a continuous flow and interpretation of data so that the right information is at the right place, at the right time Kafka’s Event Stream Purpose To publish (write) and subscribe to (read) streams of events, in ...

1. Map Reduce, Simplified Data Processing on Large Clusters

Updated2025-05-09|distributed_sys

link for this paper: mapreduce link for mit cs6.824 lecture: lecture 1 ([[mapreduce-osdi04.pdf#page=1&selection=10,0,12,10&color=yellow|MapReduce is a programming model and an associated implementation for processing and generating large data sets.]]) contribution: 可以运行在 commodity machines 上面, scalable 在上千台机器上运行大量数据 (# terabytes) programmer 不需知道很多并行相关的知识, easy to use hides detail for parallelization, fault-tolerance, locality optimization, and load balancing many problems are easily ...

9. 数字取证 Digital-Forensics

Updated2025-04-03|cybersecurity|cyber_security

数字取证流程(Digital Forensics Process) 四个核心阶段识别(Identification) 确定存储关键数据的物理/数字对象(如计算机, 硬盘, 移动设备, 外接媒体等); 收集(Collection) 保护证据完整性, 建立链式监管(Chain of Custody), 记录证据哈希(Hash of Image)以验证来源; 分析(Analysis) 检查文件系统, 日志, 内存等, 恢复删除文件(Deleted Files)或残留数据(Slack Space); 报告(Reporting) 形成专家报告, 提供法律证据(Legal Evidence); 数据收集与保存(Data Collection & Preservation) 数据来源 computer other harddrive monitor keyboard and mouse media (dvd, cd, usb) printer digital forensics did not replace traditional (physical) ...