avatar
Articles
102
Tags
24
Categories
5

Home
Archives
Tags
Categories
About
Yuchen You
Search
Home
Archives
Tags
Categories
About

Yuchen You

5. Fail-Slow at Scale
Updated2026-03-26|distributed_sys•operating_system•system_failure
Introduction 这篇论文 Fail-Slow at Scale: Evidence of Hardware Performance Faults in Large Production Systems 由Haryadi S. Gunawi等人发表在第16届USENIX文件与存储技术会议(FAST 2018)上 系统性地揭示了在大规模生产系统中"fail-slow"(故障-缓慢)硬件现象的存在与影响;这类故障指的是硬件仍在运行, 但其性能严重退化, 未达到应有的标准; 硬件类型包括: disk, SSD CPU memory network components 几十年来, 系统领域已经形成了如下成熟的硬件故障模型 (fault model) • Fail-stop(故障停止): 设备完全停止工作并发送明确的错误信号; • Fail-partial(部分失效): 设备部分失效, 部分功能仍然正常; • Fail-transient(瞬时性失效): 偶尔出现错误但会自行恢复; • Corruption fault(数据损坏) 和 Byzantine fau ...
3. Metastable Failures in Distributed Systems
Updated2026-03-26|distributed_sys•operating_system•chaos_system•system_failure
Introduction metastable failures 亚稳态故障 a failure pattern in distributed systems Currently, metastable failures manifest themselves as black swan events(黑天鹅事故) they are outliers(异常事故) because nothing in the past points to their possibility have a severe impact much easier to explain in hindsight(事后) than to predict. Although instances of metastable failures can look different at the surface, deeper analysis shows that they can be understood within the same framework. By reviewing experiences fr ...
4. Metastable Failures in the Wild
Updated2026-03-26|distributed_sys•operating_system•chaos_system•system_failure
Introduction 这篇文章是基于 MetaStable Failures in Distributed Systems 的升级, in the wild 指的是不可控的实际世界 In this work, we study the prevalence(患病率) of such failures in the wild by scouring(冲刷) over publicly available incident reports from many organizations, ranging from hyperscalers to small companies In this paper, we make four contributions that extend the work of Bronson et al. and increase our understanding of metastable failures: A study of metastable failures in the wild that confirms metastable fai ...
2. Predictive and Adaptive Failure Mitigation to Avert Production Cloud VM Interruptions
Updated2026-03-26|distributed_sys
When a failure occurs in production systems, the highest priority is to quickly mitigate(缓解) it. Failure Mitigation (FM) is done in a reactive and ad-hoc way, namely taking some fixed actions only after a severe symptom is observed. Propose a preventive and adaptive failure mitigation service, NARYA, that is integraed in a production cloud, Microsoft Azure’s compute platform Narya predicts imminent(迫在眉睫的) host failures based on multi-layer system signals then decides smart mitigation actions go ...
1. AIOpsLab: A Holistic Framework to Evaluate AI Agents for Enabling Autonomous Clouds
Updated2026-03-26|distributed_sys
AI for IT Operations (AIOps) aims to automate complex operational tasks, such as fault localization and root cause analysis traditional: ddressing isolated operational tasks LLM and AI agents: enabling end-to-end and multitask automation Target: self-healing cloud systems, a paradigm we term AgentOps AIOpsLab a framework that not only deploys micro-service cloud environments, injects faults, generates workloads, and exports telemetry(遥测得的) data but also orchestrates these components and provid ...
9. 数字取证 Digital-Forensics
Updated2026-03-26|cybersecurity|cyber_security
数字取证流程(Digital Forensics Process) 四个核心阶段 识别(Identification) 确定存储关键数据的物理/数字对象(如计算机, 硬盘, 移动设备, 外接媒体等); 收集(Collection) 保护证据完整性, 建立链式监管(Chain of Custody), 记录证据哈希(Hash of Image)以验证来源; 分析(Analysis) 检查文件系统, 日志, 内存等, 恢复删除文件(Deleted Files)或残留数据(Slack Space); 报告(Reporting) 形成专家报告, 提供法律证据(Legal Evidence); 数据收集与保存(Data Collection & Preservation) 数据来源 computer other harddrive monitor keyboard and mouse media (dvd, cd, usb) printer digital forensics did not replace traditional (physical) ...
4. 数据链路层
Updated2026-03-26|cybersecurity|network
性质 基本职能 成帧 Framing: encapsulate network layer data 数据链路层会将来自网络层(第三层)的数据进行分组, 添加头部和尾部信息, 以形成一个个帧(Frame) 链路访问 Link Access: Medium access control (MAC) protocol defines when to transmit frames 在多点接入的介质 (例如共享式以太网, 无线网络等) 中, 多个设备可能需要同时传输数据 需要负责决定什么时候由哪一个设备向链路发送数据 可靠交付 Reliable Delivery: Primarily for mediums with high error rates (wireless) 某些网络环境中可能会受到干扰而导致帧丢失或损坏 数据链路层可以提供重传, 确认等机制, 确保帧能可靠地到达接收方 虽然TCP/IP协议栈中大多由传输层(TCP)来保证可靠性, 但一些链路层协议(如PPP)也会提供基本的可靠性功能 差错检测与纠正 Error detection and corre ...
6. File System
Updated2026-03-26|cs_basic|operating_system
文件系统抽象 File System Abstraction attribute Hardware Reality OS abstraction interface heterogeneous uniform storage objects a few (disks) many (files) name structure simple numeric name (id) rich name (symbolic, hierarchical, unified) access speed slow fast crash resilience unreliable reliable 硬件层到 os 的转变方式 Heterogeneity 原因: many i/o devices, each with its own idiosyncrasy Solution: abstraction build a common interface (Application 层和 File System 层之间设置统一接口 POSIX) write device d ...
8. 隐私权限 privacy
Updated2026-03-26|cybersecurity|cyber_security
数据隐私概述 数据生成与收集 数据爆炸: 每天生成约2.5千亿字节数据(如社交媒体互动, 在线购物记录) 数据聚合者: 如Acxiom和Oracle, 整合多源数据构建用户画像并出售 隐私的定义 经典定义: 隔离权(Louis Brandeis): 免受他人侵扰的权利 控制权: 选择何时, 如何共享个人信息的控制能力 保密权(Richard Posner): 隐藏可能对自身不利的信息 自由基础: 隐私是言论自由与个人自主的前提 隐私侵犯案例 医疗数据泄露 案例: Jane Doe因雇主获知其携带亨廷顿病基因被解雇 Kate Smith基因突变检测结果导致健康保险费飙升 商业数据滥用 Target预测怀孕: 通过购物模式预测用户怀孕状态, 误向未成年少女发送婴儿用品优惠券 Strava热图泄露军事基地: 用户运动轨迹数据暴露美军在叙利亚, 阿富汗的军事基地位置 匿名化失效 GIC链接攻击: 通过马萨诸塞州选民登记数据(公开)与匿名医疗数据关联, 重新识别州长 William Weld 关键信息: 出生日期 + 5位 zip 可识别69%美国人 ...
7. 访问权限控制攻击 Access Control
Updated2026-03-26|cybersecurity|cyber_security
访问控制基础 Access Control Basics 核心概念 安全模型(Security Model) 系统抽象, 用于描述和制定安全策略 三要素: 主体(Subject), 客体(Object), 操作(Operation) 主体 Subject: users, Android Apps, Web Origins 客体: resources, including 文件, 目录, 数据库表, 设备(如UNIX文件, 进程) 操作: 读, 写, 执行, 调用等 安全策略(Security Policy) 定义访问控制矩阵, 明确主体对客体的权限 主体/客体 文件1 文件2 Alice 读 读/写 Bob 读 无权限 安全机制(Security Mechanism) 实现安全策略的技术(如操作系统内核, 加密) 2. 核心原则 最小权限原则(Principle of Least Privilege) 用户/程序**仅**拥有完成任务的必要权限 优势: 限制意外或恶意操作的影响范围 完全中介原则(Principl ...
1…456…11
avatar
Yuchen You (Wesley)
Articles
102
Tags
24
Categories
5
Follow Me
Announcement
This is my Blog
Recent Post
kubernetes2026-03-29
ZeRO - memory optimizations toward training trillion parameter models2026-03-29
Megatron-LM - Training Multi-Billion Parameter Language Models Using Model Parallelism2026-03-26
Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve2026-03-26
GPipe - efficient training of giant neural networks using pipeline parallelism2026-03-26
Categories
  • cs_basic25
  • cybersecurity12
  • eecs2817
  • math9
  • mlsys4
Tags
unix sql network algorithm chaos_system ml_training container schedule distributed_sys p_np system_failure computability mlsys memory computer_composition virtual_machine cuda operating_system cyber_security structure gpu kernel Consensus database
Archives
  • March 20266
  • January 20261
  • December 20254
  • November 20253
  • October 20255
  • September 202516
  • August 20253
  • June 20251
Info
Article :
102
UV :
PV :
Last Update :
©2020 - 2026 By Yuchen You (Wesley)
Framework Hexo|Theme Butterfly
welcome to my blog!
Search
Loading the Database