5. ZooKeeper: Wait-free Coordination for Internet-scale Systems

Overview

ZooKeeper, a service for coordinating processes of distributed applications.

aims to provide a simple and high performance kernel for building more complex coordination primitives at the client

The interface exposed by ZooKeeper has the wait-free aspects of shared registers with an event-driven mechanism similar to cache invalidations of distributed file systems to provide a simple, yet powerful coordination service.

Configuration

Configuration is one of the most basic forms of coordination.

Configuration is just a list of operational parameters for the system processes, whereas more sophisticated systems have dynamic configuration parameters.

Group membership and leader election are also common in distributed systems (recall Kafka leader failure): often processes need to know which other processes are alive and what those processes are in charge of

Locks constitute a powerful coordination primitive

One approach to coordination is to develop services for each of the different coordination needs. (some for queuing, some for leader election or configuration respectively). Services that implement more powerful primitives can be used to implement less powerful ones. (Locks can then be used to implement leader election, group membership, etc.)

moved away from implementing specific primitives on the server side
we opted for exposing an API that enables application developers to implement their own primitives.

Such a choice led to the implementation of a coordination kernel that enables new primitives without requiring changes to the service core.

moved away from blocking primitives, such as locks
- Blocking primitives for a coordination service can cause, among other problems, slow or faulty clients to impact negatively the performance of faster clients
implements an API that manipulates simple wait-free data objects organized hierarchically as in file systems
- mplementing wait-free data objects, however, differentiates ZooKeeper significantly from systems based on blocking primitives such as locks.

Wait-Free + Order Guarantees

Guaranteeing both FIFO client ordering of all operations and linearizable writes enables an efficient implementation of the service and it is sufficient to implement coordination primitives of interest to our applications

FIFO on Client Side

Implement ZooKeeper using a simple pipelined architecture that allows us to have hundreds or thousands of requests outstanding while still achieving low latency

Guaranteeing FIFO client order enables clients to submit operations asynchronously.

Read/Write Linearizability

we implement a leader-based atomic broadcast protocol, called Zab

A typical workload of a ZooKeeper application, however, is dominated by read operations -> it becomes desirable to scale read throughput

Servers process read operations locally, and we do not use Zab to totally order them

Client Caching

increase the performance of reads

For example, it is useful for a process to cache the identifier of the current leader instead of probing ZooKeeper every time it needs to know the leader.

ZooKeeper uses a watch mechanism to enable clients to cache data without managing the client cache directly.

Major Contributions

Coordination Kernel: wait-free coordination service with relaxed consistency guarantees for use in distributed systems.
Coordination Recipes: how ZooKeeper can be used to build higher level coordination primitives, even blocking and strongly consistent primitives.
Experience with Coordination: share some of the ways that we use ZooKeeper and evaluate its performance.

ZooKeeper Service

In this section, we first provide a high-level view of the ZooKeeper service. We then discuss the API that clients use to interact with ZooKeeper.

Terminology

client: user of the ZooKeeper service
server: a process providing the ZooKeeper service
znode: an in-memory data node in the ZooKeeper data, which is organized in a hierarchical namespace referred to as the data tree.
session: clients establish a session when they connect to ZK and obtain a session handle

Service Overview

ZooKeeper provides to its clients the abstraction of a set of data nodes (znodes), organized according to a hierarchical namespace(similar to unix file system)

ZNodes

There are 2 types of znodes that client can create:

Regular: Clients manipulate regular znodes by creating and deleting them explicitly
Ephemeral (短暂的): Clients create such znodes, and they either delete them explicitly, or let the system remove them automatically when the session that creates them terminates (deliberately or due to a failure).

When creating a new znode, a client can set a sequential flag. If set, it has a monotonically increasing counter appended. If $n$ is the new znode and $p$ is the parent znode, then sequence value of $n$ is never smaller than the value in the name of any other sequential znode ever created under $p$

Watches

ZooKeeper implements watches to allow(enabled by a watch flag) clients to receive timely notifications of changes without requiring polling.

Watches are one-time triggers associated with a session; they are unregistered once triggered or the session closes.

Watches indicate that a change has happened, but do not provide the change. Also it will compress the multi-times events to 1-time event notification

Linearizability

This definition is given by the faaaaaaaaaaamous paper, or you could hear this from Morris’ explanation:

A history is linearizable if there exists a total (global) order of operations, that matches real time reads sees write in the order.

or to translate it into Mandarin:

线性一致的历史记录必须与请求的实际时间匹配;这里的真实意思是在实际时间中, 某个请求如果在另一个请求结束之后才开始, 那么在我们构建用于证明线性一致的序列中, 后来的请求都必须在先来的请求之后

overview

简单且高效的 kernel, 可用于建设更加复杂的 client 调度系统, 包含:

group messaging 元件
shared registers
distributed lock services
- replicated
- centralized

整体上这个 interface 呈现通过 shared registers 达到 wait-free 特征, 并且是一个 event-driven 机制 (类似于 distributed file system’s cache invalidation)
对于每个用户而言确保 FIFO 的请求执行顺序 (linearizability)

Introduction

Wait-Free 的背景思路

实践过程中, 最常见的, 严重的问题是:

Configuration (分散的, 动态的)
Group membership and Election
包括知道其他某个进程是否正常以及职能
为了达到分布一致性, 这里往往会用到 lock, 也有很多研究是关于 queue 的设计以及 leader election 优化的, 但是 zookeeper 的思路是避开在 server 端设计高级的 coordinate primitives, 而是选择在 client 端提供 api 让使用者自己设计

这个想法催生了 coordination kernel 的设计, 也就是需要支持新的 primitive 的同时并不需要改变 system core

在设计的时候并不考虑 blocking primitives 如 locks, 因为 blocking 会导致各种问题如 slow, faulty influence faster clients, 也就是说在传统的阻塞设计中, 用户之间的自耦合会导致各种意想不到的消极设计效果, 因此 zookeeper 确保整个过程在 client 端看来是 wait-free 的

在 server 端我们应当确保每个用户请求是 FIFO 的, 并且能实现操作顺序保障

FIFO client ordering: 对所有操作成立
Linearizable write: 仅对 write 操作成立

ZooKeeper 用多个并行的 server 来确保高可用性和高性能, 也就是说, ZooKeeper 使用一个 pipeline 的架构来实现低延迟处理大量请求, 同时这个 pipeline 的存在天然确保了 FIFO 顺序结构, 也允许本地 client 异步 (asynchronously) 提交控制指令, 当然这个是 desirable 的结果

为了确保 update 指令具有 linearizability 特性, 我们使用一额 leader-based atomic broadcast 协议, 即 Zab 协议

在 ZooKeeper 的实践中发现, 主要的任务流是 read 指令, 为了提升 read 的性能, 要确保能放大(scale) 读流, 其中包括优化 caching: ZK 使用 watch mechanism 来支持 client 端的 cache 而并不需要手动管理; 也就是说本地通过 watch update 来进行同步; 传统结构 Chubby 的设计中, 如果收到 update, 就会导致向所有 cache 持有者发送同步命令, 在收到所有 ack 之前会在实际上 block update order, 从而导致 slow/faulty, ZooKeeper 致力于解决这一问题

The ZooKeeper Service

Service Overview

ZK 提供的是一系列 data nodes 的抽象 (znode), 并且通过一个层级式的命名空间来进行管理 (data tree).

由用户创建的 znode 可以分为两类:

Regular: 用户显式地控制 (create/delete) 这些节点
Ephemeral (短暂, 临时): 用户只负责 create 这些节点，然后要么显式地删除或者让系统自动删除 (在 session 结束的时候或者 failure 发生的时候)
并且在创建一个新的 znode 的时候，一个用户可以设置一个 sequential flag, 也就是用一个单增函数来唯一命名一个 znode

ZK 的 watch 机制允许 client 能按时收到一个 update report, 而不需要任何主动的 polling, 也就是说当 client 需要读取一个文件的时候会向该内存区注册一个 watch

Data Model

总体而言 ZK 的建模就是一个 filesystem 加上一个简化的 api 来支持数据的读写，以及一个键值表来支持层级 key 数据; 层级命名空间支持申请新的子树空间来设置 access 权限

znode 本身不是支持常见的数据存储的 (HDFS 的 datanode) 而是将 client app 抽象映射, 即每个被访问的路径都会有唯一一个对应的 znode (通过 ephemeral znode 的动态明明机制来确保强一致性); 两个进程访问同一个 znode 的时候如果都是 read 那么就会访问同一个 znode 地址 (即直接共享 data), 如果有一个进程写入了数据，那么就会直接覆盖原有的节点并且版本号自增，然后异步通知另一个进程

其实应该将 znode 类比于为了分布式异步同步强化的 unix inode 设计

Session

client 连接到 ZK 的时候会启动一个 session; 每个 session 都有一个 timeout, 如果超过了这个 timeout 而 server 还没有收到任何信息，就会认为 client faulty; 一个 session 在 client 之后显式地关闭会话或者 ZK 检测到 faulty 而关闭 session;

在一个会话内，client 会收到一系列 state change 信号，并且 session 允许 client 在同一个 zookeeper 整体内不同 server 之间的转移保持 transparent

Client API

create(path, data, flags) 创建一个 znode, flags 指定client 选择 znode 的类型, regular/ephemeral 并且支持设置 sequential flag
delete(path, version) 当 version 匹配的时候删除对应路径
exists(path, watch) 如果路径存在返回 true, watch 用来给指定 znode 加上 watch
getData(path, watch) 返回 data 和 metadata (包括 version 等信息), watch 和 exists() 对 watch 的使用机制一致
setData(path, data, version): 在版本匹配，路径存在的时候写入数据
getChildren(path, watch) 返回指定 znode 的 child 集合
sync(path) 等待所有 updates pending 都聚集到 server 之后再执行下一步

ZK 采用的是无句柄(handle)访问机制, 也就是说，对于一个文件的访问处理并不需要以 open 开始且以 close() 结尾, 所以这也注定了访问文件必须要带上文件的完整路径而不能变成一个文件对象

每次更新指令都会需要设定一个 version 检查，如果版本不匹配就不会执行这个操作，也就是说两个线程同时提交 update 请求，只有先执行的能成功，后执行的就会返回失败，这种同步叫做 “optimistic concurrency”; version=-1 表示不检查版本

ZooKeeper Guarantees

有两个 ordering guarantees

Linearizable Write: 遵从线性一致性
FIFO client order: 所有的 request 对于某一个 client 而言都是 FIFO 遵循其发送的顺序

Linearizability

这里实际上我们称为 A-linearizability (asynchronous linearizability): 在最早的线性一致定义中，client 一次只能有一个有影响力的操作, 但是在 ZK 中允许有多个操作同时发生，因此我们可以选择:

保证对同一个 client 的操作指令没有确切的顺序
保证对同一个 client 的操作指令具有 FIFO 顺序

Case Study: 传统锁结构的死锁

当一个 replicated 系统发生 leader failure 的情况下，经过选举产生了新的 leader, 这个时候 leader 就要广播一个 conf 更新指令，那么如果在这个更新过程中新的 leader 也 fail 之后，就会出现还没有被更新的进程无法解锁，从而导致卡死的问题, 且不好界定剩余部分会被更新到哪个版本 (特别是作为 conf 本身的写入就不是 atomic 的，所以写失败了会导致 conf 部分更新很危险)

参考 EECS482 中提及的利用 log 机制 (实际上是 start 和 end 两个标识符) 来确保 atomic 更新，这里 ZK 采用 /config/ready 文件来达到同样目的，比如在 update all config 的时候:

rm /config/ready
update all config files (non-atomic)
touch /config/ready
同时要求所有 replica 都需要对 ready 文件添加一个 watch (exists(/config/ready)) 来确保 ready 的存在与更新能及时收到反馈

上述问题仍然存在一个缺陷: 即 replica 在收到新 elected leader 信号之后直接开始查看 ready 文件状态，此步骤先于 leader 更新 config 文件夹，而导致在 leader 更新过程中读取到未知状态的数据，但是这个问题可以通过限制 notification 的顺序来补救: 必须先更新完数据再发送 watch 的提示信号

还有一个问题就是 client 有出了 zk 之外的和 server 通信的 channel, 那么两个 replica 之间也有可能存在不同步的问题，但是可以通过上一段的方法来解决