avatar
Articles
113
Tags
52
Categories
8

Home
Archives
Tags
Categories
About
Yuchen You
Search
Home
Archives
Tags
Categories
About

Yuchen You

6. Fail Stutter Fault Tolerance
Updated2026-03-26|distributed_sys•operating_system•system_failure
Introduction 背景问题缺陷 Byzantine 模型假设组件可能出现任意错误甚至恶意行为, 过于复杂难以实施; Fail-stop 模型假设组件要么正常工作, 要么完全停止, 虽然简单易用, 但忽视了现代硬件中常见的"性能故障"; 本文特点 Fail-Stutter 模型: a realistic and yet tractable fault model that accounts for both absolute failure and a new range of performance failures common in modern components. 引入了一个更现实但可操作的故障模型, 结合 fail-stop 和"性能故障"(performance faults); 性能故障指的是组件虽然没完全失效, 但表现不稳定或变慢, 如磁盘速率下降, 缓存命中率降低等; 这种模型认为组件既可以"失败", 也可以"卡顿"或"表现异常"; 需要的模型性质 ( ...
5. Fail-Slow at Scale
Updated2026-03-26|distributed_sys•operating_system•system_failure
Introduction 这篇论文 Fail-Slow at Scale: Evidence of Hardware Performance Faults in Large Production Systems 由Haryadi S. Gunawi等人发表在第16届USENIX文件与存储技术会议(FAST 2018)上 系统性地揭示了在大规模生产系统中"fail-slow"(故障-缓慢)硬件现象的存在与影响;这类故障指的是硬件仍在运行, 但其性能严重退化, 未达到应有的标准; 硬件类型包括: disk, SSD CPU memory network components 几十年来, 系统领域已经形成了如下成熟的硬件故障模型 (fault model) • Fail-stop(故障停止): 设备完全停止工作并发送明确的错误信号; • Fail-partial(部分失效): 设备部分失效, 部分功能仍然正常; • Fail-transient(瞬时性失效): 偶尔出现错误但会自行恢复; • Corruption fault(数据损坏) 和 Byzantine fau ...
3. Metastable Failures in Distributed Systems
Updated2026-03-26|distributed_sys•operating_system•chaos_system•system_failure
Introduction metastable failures 亚稳态故障 a failure pattern in distributed systems Currently, metastable failures manifest themselves as black swan events(黑天鹅事故) they are outliers(异常事故) because nothing in the past points to their possibility have a severe impact much easier to explain in hindsight(事后) than to predict. Although instances of metastable failures can look different at the surface, deeper analysis shows that they can be understood within the same framework. By reviewing experiences fr ...
4. Metastable Failures in the Wild
Updated2026-03-26|distributed_sys•operating_system•chaos_system•system_failure
Introduction 这篇文章是基于 MetaStable Failures in Distributed Systems 的升级, in the wild 指的是不可控的实际世界 In this work, we study the prevalence(患病率) of such failures in the wild by scouring(冲刷) over publicly available incident reports from many organizations, ranging from hyperscalers to small companies In this paper, we make four contributions that extend the work of Bronson et al. and increase our understanding of metastable failures: A study of metastable failures in the wild that confirms metastable fai ...
2. Predictive and Adaptive Failure Mitigation to Avert Production Cloud VM Interruptions
Updated2026-03-26|distributed_sys
When a failure occurs in production systems, the highest priority is to quickly mitigate(缓解) it. Failure Mitigation (FM) is done in a reactive and ad-hoc way, namely taking some fixed actions only after a severe symptom is observed. Propose a preventive and adaptive failure mitigation service, NARYA, that is integraed in a production cloud, Microsoft Azure’s compute platform Narya predicts imminent(迫在眉睫的) host failures based on multi-layer system signals then decides smart mitigation actions go ...
1. AIOpsLab: A Holistic Framework to Evaluate AI Agents for Enabling Autonomous Clouds
Updated2026-03-26|distributed_sys
AI for IT Operations (AIOps) aims to automate complex operational tasks, such as fault localization and root cause analysis traditional: ddressing isolated operational tasks LLM and AI agents: enabling end-to-end and multitask automation Target: self-healing cloud systems, a paradigm we term AgentOps AIOpsLab a framework that not only deploys micro-service cloud environments, injects faults, generates workloads, and exports telemetry(遥测得的) data but also orchestrates these components and provid ...
9. 数字取证 Digital-Forensics
Updated2026-03-26|cybersecurity|cyber_security
数字取证流程(Digital Forensics Process) 四个核心阶段 识别(Identification) 确定存储关键数据的物理/数字对象(如计算机, 硬盘, 移动设备, 外接媒体等); 收集(Collection) 保护证据完整性, 建立链式监管(Chain of Custody), 记录证据哈希(Hash of Image)以验证来源; 分析(Analysis) 检查文件系统, 日志, 内存等, 恢复删除文件(Deleted Files)或残留数据(Slack Space); 报告(Reporting) 形成专家报告, 提供法律证据(Legal Evidence); 数据收集与保存(Data Collection & Preservation) 数据来源 computer other harddrive monitor keyboard and mouse media (dvd, cd, usb) printer digital forensics did not replace traditional (physical) ...
4. 数据链路层
Updated2026-03-26|cybersecurity|network
性质 基本职能 成帧 Framing: encapsulate network layer data 数据链路层会将来自网络层(第三层)的数据进行分组, 添加头部和尾部信息, 以形成一个个帧(Frame) 链路访问 Link Access: Medium access control (MAC) protocol defines when to transmit frames 在多点接入的介质 (例如共享式以太网, 无线网络等) 中, 多个设备可能需要同时传输数据 需要负责决定什么时候由哪一个设备向链路发送数据 可靠交付 Reliable Delivery: Primarily for mediums with high error rates (wireless) 某些网络环境中可能会受到干扰而导致帧丢失或损坏 数据链路层可以提供重传, 确认等机制, 确保帧能可靠地到达接收方 虽然TCP/IP协议栈中大多由传输层(TCP)来保证可靠性, 但一些链路层协议(如PPP)也会提供基本的可靠性功能 差错检测与纠正 Error detection and corre ...
6. File System
Updated2026-03-26|cs_basic|operating_system
文件系统抽象 File System Abstraction attribute Hardware Reality OS abstraction interface heterogeneous uniform storage objects a few (disks) many (files) name structure simple numeric name (id) rich name (symbolic, hierarchical, unified) access speed slow fast crash resilience unreliable reliable 硬件层到 os 的转变方式 Heterogeneity 原因: many i/o devices, each with its own idiosyncrasy Solution: abstraction build a common interface (Application 层和 File System 层之间设置统一接口 POSIX) write device d ...
8. 隐私权限 privacy
Updated2026-03-26|cybersecurity|cyber_security
数据隐私概述 数据生成与收集 数据爆炸: 每天生成约2.5千亿字节数据(如社交媒体互动, 在线购物记录) 数据聚合者: 如Acxiom和Oracle, 整合多源数据构建用户画像并出售 隐私的定义 经典定义: 隔离权(Louis Brandeis): 免受他人侵扰的权利 控制权: 选择何时, 如何共享个人信息的控制能力 保密权(Richard Posner): 隐藏可能对自身不利的信息 自由基础: 隐私是言论自由与个人自主的前提 隐私侵犯案例 医疗数据泄露 案例: Jane Doe因雇主获知其携带亨廷顿病基因被解雇 Kate Smith基因突变检测结果导致健康保险费飙升 商业数据滥用 Target预测怀孕: 通过购物模式预测用户怀孕状态, 误向未成年少女发送婴儿用品优惠券 Strava热图泄露军事基地: 用户运动轨迹数据暴露美军在叙利亚, 阿富汗的军事基地位置 匿名化失效 GIC链接攻击: 通过马萨诸塞州选民登记数据(公开)与匿名医疗数据关联, 重新识别州长 William Weld 关键信息: 出生日期 + 5位 zip 可识别69%美国人 ...
1…567…12
avatar
Yuchen You (Wesley)
Articles
113
Tags
52
Categories
8
Follow Me
Announcement
This is my Blog
Recent Post
Vpn from WireGuard Impl — 从论文批判到 Go 手写隧道2026-05-31
Tree of Thought: 不再是 left-to-right 单向思维架构2026-05-26
0. 从 Minimax 到 MCTS - 经典博弈树搜索基础2026-05-26
Unix 电源管理2026-05-25
ReAct + Reflexion - Reasoning Acting and Verbal Reinforcement Learning2026-05-21
Categories
  • agentsys6
  • cs_basic25
  • cybersecurity12
  • eecs2817
  • math9
  • mlsys4
  • network1
  • os1
Tags
os p_np distributed_sys chaos_system security memory cloud_incidents kernel self_improvement log_analysis gpu system_failure go mcts algorithm virtual_machine ml_training reasoning llm computer_composition planning pl llm_agent power-management unix rca cuda memory_management cyber_security java icmp gc database vpn search reflection schedule network structure computability
Archives
  • May 202612
  • April 20263
  • March 20262
  • January 20261
  • December 20254
  • November 20253
  • October 20255
  • September 202516
Info
Article :
113
UV :
PV :
Last Update :
©2020 - 2026 By Yuchen You (Wesley)
Framework Hexo|Theme Butterfly
welcome to my blog!
Search
Loading the Database