Published on:

Published on:

Published on:
Tags: Algorithm

Published on:

## CONSPRO[7] - Providing High Availability Using Lazy Replication

Published on:

author = {Ladin, Rivka and Liskov, Barbara and Shrira, Liuba and Ghemawat, Sanjay},
title = {Providing High Availability Using Lazy Replication},
journal = {ACM Transactions on Computer Systems},
issue_date = {Nov. 1992},
volume = {10}, number = {4}, year = {1992}, pages = {360--391},
publisher = {ACM},

### Symbol table

$uid$
$label$
log
rep_ts
val
val_ts
inval

### Motivation

To achieve high availability, most large-scale services are implemented as a replicated state machine. Consistency of replicated state can be guaranteed as long as service requests are performed in the same order at all replicas, but this approach is too expensive. Meanwhile, some services can tolerate a weaker, causal ordering. The best example is, among others, conversations in social networks, which consists of multiple causal chains linked by "Reply" relation. The paper aims to support such causal ordering using lazy replication:

A service request (update or query) is executed at just one replica; updating of other replicas happens by lazy exchange of "gossip" messages.

### The Replicated Causal Operations Service

#### System model

The replicated service is comprised of a fixed number of replicas and an arbitrary number of clients. For each client, a proxy called front end (residing at the same client node) is responsible for all the communications between it and replicas. Thus, a client issues a request by making a local call to its front end, which in turn sends a call message to one (in normal operation; things can go complicated in case of node failure and network delay) replica. The replica executes the request and sends a reply message to the front end. Replicas communicate among themselves by lazily exchanging gossip messages.

• Question 1: What is the lifespan of the front end? The whole application, a session, or defined by users (e.g., using pairs of begin_fe and end_fe)?
• Ans: Please refer to the client library component in [Lloyd@SOSP'11-COPS].

#### Specification

The service supports two kinds of requests: update and query. Each update request is associated with an unique id $uid$, generated by the service and included in its reply message. A $label$ is simply a collection of uids of updates. Causal relation among requests are explicitly captured by assigning a label to each request. In addition, the result value of a query may reflect other updates than those enforced in its label parameter. The interfaces of update and query requests are thus defined as follows:

• update(prev:label, op) returns (uid)
• query(prev:label, op) returns (new-label:label, value)

The (sequential) specification of the causal operations service is defined by the following requirements on query requests.

Let $q$ be a query.

1. q.new-label $\supseteq$ q.prev
2. If u.uid $\in$ q.new-label, then $\forall$ update $v$ such that dep(u,v), v.uid $\in$ q.new-label
3. q.value = q.op(Val(q.new-label))

The first clause formalizes the fact that a query may "see" other updates than those enforced in its prev:label parameter.
The second clause states that the returned new-label is dependency-complete: If $u$ is reflected in it, then so are all the updates that $u$ depends on. The predicate dep(u,v) meaning that update $u$ causally depends on update $v$ can be evaluated by $dep(u,v) \equiv v.uid \in u.prev$.
The third clause lies at the heart of the correctness of the causal operations service. It specifies how to compute the value to return.

• Question 2: What's the use of new-label returned by a query?
• Ans: It can be used as a prev:label parameter of another request to implement, for instance, monotonic reading.

#### Implementation

Causality tracking and causality checking are two key issues in implementing a causal operations service. As mentioned above, both of them are achieved by the mechanisms of $uid$ and $label$. For efficiency, this paper employs multipart timestamp as a compact representation for them. Multipart timestamp here is actually vector clock, one entry per replica, supporting standard compare and merge operations. In the following, the implementation of the service is described both in terms of multipart timestamp and of $uid$ and $label$. We first consider the normal operation in which a front end will always contact the same "preferred" replica. We then handle with the duplication of requests due to node failures and network delay.

• Question 3: What is the [TBD] of the
• Ans: In normal operation, #replica : #front end = 1 : $m \ge 1$; Otherwise, #replica : #front end = $n \ge 1$ : $m \ge 1$;
##### In normal operation

Each replica keeps a record of all the updates it has received from both front ends and other replicas. It also maintains a local timestamp, rep_ts, that summarize its knowledge about these updates. The state of each replica, val, is generated by a collection of executed updates, reflected in another timestamp val_ts. Once an update is executed, its $uid$ is merged into val_ts. Each replica guarantees to execute all updates in dependency order.

Processing an update call message:
Replica $i$ performs the following actions when it receives a call message for update $u$ from a front end:

1. Update knowledge: rep_ts[i]++;
2. Compute $uid$ (ts) for $u$: u.prev[i] := rep_ts[i]; ts := u.prev;
3. Construct record for $u$ and add it to the local log: r := <U,i,ts>;
4. Execute $u$ if possible: If u.prev <= val_ts, then val := apply(val, u.op); val_ts := merge(val_ts, r.ts);.
5. Sent a reply message with uid (i.e., r.ts).

Implementation issue 1: These five actions should be performed atomically. Notice that we don't wait on the condition u.prev <= val_ts in step (4). Instead we defer it to the part of "Processing a gossip message".
Q1: Construct a scenario in which the local replica cannot process an update immediately (i.e., the condition u.prev <= val_ts of (4) does not hold).
Q2: Why does we try to execute an update (if possible) here since they will be performed when gossip messages are being processed?

Ans2: What if no comming gossip messages at all?

Because the uid (r.ts) of an update is computed from its prev:label, it also captures the causal dependencies.

Processing a query call message:
When a replica receives a call message for query $q$ from a front end, it checks whether $q$ is ready to be performed by comparing q.prev with val_ts:
Wait until q.prev <= val_ts, then returns <val_ts,apply(val, q.op)>.

Implementation issue 2: Start a separate thread for each query and wait on the condition q.prev <= val_ts parallelly.

Processing a gossip message:
A gossip message contains m.ts, the sender's timestamp (i.e., rep_ts), and m.log, the sender's log. Notice that gossip messages are used to exchange information about update requests received by all replicas, it is not necessary to include val or val_ts, which represents the update requests have been processed.
The replica $i$ performs the following actions when it receives a gossip message $m$ from replica $j$.

1. Merge logs: $log := log \cup \{ r \in m.log \mid \lnot (r.ts \le rep\_ts) \}$
2. Update knowledge: rep_ts := merge(rep_ts, m.ts)
3. Asynchronously apply update requests:
1. Identify the updates that are ready: $comp := \{ r \in log \mid type(r) = U \land r.prev \le rep\_ts \land r.isComputed = F \}$
2. Apply them in dependency order: execute val := apply(val, r.op); val_ts := merge(val_ts, r.ts); for each update $r$ in comp in topological order. (Note: $r$ is applied only if it has not been applied yet; See the "Optimization" section.)

Implementation issue 3: The size of m.log. Please refer to Question 7.

The following question concerns about the correctness of the implementation of the replicated service.

• Quesiton 4: Does the multipart timestamp mechanism introduce extra (false) causal dependencies among two independent requests sent to the same replica from different front ends, due to the way uids are computed?
• Ans: No. Although the uids of these two requests (say, $u$ and $v$) are totally ordered (in terms of multipart timestamp), the causality checking of whether $u$ depends on $v$ is evaluated by comparing $v.uid$ and $u.prev$ (not $u.uid$). Going further: Is it possible for $v.ts \le u.prev$ but $u$ does not depend on $v$? [TBD] How to prove this? (First, $uid$ and $prev$ do not change once they are generated. Secondly, new $uid$ is not smaller than any older one.)
##### With duplicates

Due to node failures and network delay, the response for a request may be too slow (by timeout) and the client (front end) wants to send it to other replicas. Thus, there may be multiple call messages (and $uids$) for a single update at a time in the system. Despite these duplications, each update can be performed at most once at each replica. To this end, each update is assigned a unique call identifier, $cid$, by its front end.

• Question 5: There being also duplications of a query, why don't we assign a (system-wide) unique identifier to it?
• Ans: Because query requests do not modify states of replicas and do not need to be propagated.

Each replica maintains a set (e.g., hashtable in implementation) of $cids$, inval, consisting of the updates that have been processed in this replica. The inval is updated whenever a new update $r$ has been processed: val := apply(val, r.op); inval := inval + {r.cid}; val_ts := merge(val_ts, r.ts);.

Before applying a ready update $r$, a dupliation checking is done: $r.cid \notin inval$. However, whether $r$ is a duplication or not, its $uid$ is merged into val_ts.

• Question 6: What if we don't merge the $uids$ for duplications of a update into val_ts?
• Ans: Consider the scenario illustrated in the following figure. If the replica $r_1$ does not merge the duplication of $U$ sent from the replica $r_2$ into its val_ts, another update $U'$ that depends on $U$ will never be processed by $r_1$.

#### Optimization

The logs in the log-exchange based implementation would grow without bound. An important performance issue is to control the size of the log. An update record can be deleted from the log at some replica if (1) it is known everywhere and (2) has been reflected in val locally.

To check the first condition, a replica should know what other replicas know. For this purpose, a replica maintains a table of timestamps, ts_table, of which ts_table[p] is the latest rep_ts of replica $p$ received via gossip messages. Thus, the predicate that replica $i$ knows that an update record $r$ is known everywhere is defined as

Remark: The parameter $r$ in the predicate $isknown(r)$ is an update record, instead of an update which may have multiple records.

For the second condition, we require the gossip protocol guarantee that when the receiver receives record $u$ from replica $i$, it has also received all records processed at $i$ before $u$. This requirement puts constraints on when and what to gossip.

• Question 7: When and what to gossip to meet the requirement of gossip protocol?
• Ans: When: at any time; What: If we assume TCP reliable communication, we can gossip a record with all its nearest dependencies as those in [Lloyd@SOSP'11-COPS].

Combine nearest deps in [Lloyd@SOSP'11-COPS] with $uid$ and $label$ in [Ladin@TOCS'92-Lazy replication]: At the front end, only label and uids of nearest deps are sent to replicas. The update records themselves are stored at replicas and propagated among them.

With this in mind, we can now discard obsolete update records in the part "Processing a gossip message":

(4). Update ts_table: ts_table(j) = m.ts
(5). Discard obsolete records: $log := log - \{ r \in log \mid type(r) = U \land isknown(r) \}$

• Implementation issue 4: In this paper, the proof of the correctness of the implementation relies on the assumption that gossip is processed atomically. However, I am not sure whether this is mandatory.

• Question 8: Compasion between ts_table and matrix clocks.

## [Re-Blogger] Using TLA+ for Teaching Distributed Systems

Published on:

This a reblogged article published on Blogger by Murat on August 12, 2014.

I am teaching CSE 4/586 Distributed Systems class again this Fall (Fall 2014). This is the course I have most fun teaching. (I would like to think my students also feel that way :-) I teach the course with emphasis on reasoning about the correctness of distributed algorithms. Here are the topics I cover in sequence:

1. Introduction, Syntax and semantics for distributed programs, predicate calculus
2. Safety and progress properties
3. Proof of program properties
4. Time: logical clocks, State: distributed snapshots
5. Mutual exclusion, Dining philosophers
6. Consensus, Paxos
7. Fault-tolerance, replication, rollback recovery, self-stabilization
8. Programming support for distributed systems
9. Data center computing and cloud computing
10. CAP theorem and NOSQL systems
11. Distributed/WAN storage systems

I put emphasis on reasoning about distributed algorithms because concurrency is very tricky; it truly humbles human brain. More than 3 actions in a distributed program and your intuitions will fail, you won't be able to hand-wave and apply operational reasoning on the program. You may think you could, but you would be very wrong (I know from first-hand experience).

I use invariant-based reasoning of program properties for the first 4 weeks exclusively. But this becomes less applicable when we move into more involved protocols in weeks 5 and beyond. This is where I give up being rigorous and make tell the class: "We could push things down the most rigorous invariant-based reasoning and predicate calculus level but we don't. Instead we give arguments in English, with the appreciation of how these arguments correspond to the proof rules in previous chapters." Yes, this is not very satisfying, but I didn't have much choice.

### TLA+

So for these reasons, the AWS TLA+ article got my attention recently. The article talked about how AWS successfully used invariant-based reasoning and formal methods (in particular TLA) for building robust distributed systems. TLA is a tool for specifying distributed algorithms/protocols and model checking them. AWS used TLA in many key projects: S3, DynamoDB, EBS, and a distributed lock manager. Here is the technical report by AWS. It is a very good read.

TLA+ is Leslie Lamport's brainchild. Of course you know Lamport if you are working on distributed systems. Lamport got a Turing award in 2013; he is famous for logical clocks, Paxos, and several other influential results in distributed systems. As a side-project, he wrote a wrapper around Knuth's TeX, called LaTeX ("La" for Lamport?), which is still the typesetting tool for almost all math/CSE academic papers. Lamport has always been a firm proponent of invariant-based reasoning for distributed algorithms and it seems like he has been dedicating most of his effort on prostelyzing TLA in recent years.

There are other successful model checkers (Spin, SMV, Promela), but TLA is more focused on supporting distributed algorithms reasoning. In addition, the PlusCal language (in the TLA+ toolkit) provides a high-level pseudo language to write distributed algorithms easily.

### How I went about learning TLA

This was a straightforward and easy process. This is the main page for TLA, where the other pages can be reached. To download the toolkit, I first went to this page which forwards to this download page.

Then I downloaded the Hyperbook and started following it. The chapters were all straightforward for me, because this is very similar to the material I teach in my 486/586 class for invariant-based reasoning of distributed algorithms. The hyperbook has a lot of examples and is the best place to start learning TLA.

After I got the hang of it, I decided to get my hands dirty with my own toy programs. I wrote TLA+ specifications for some simple coffee bean problems. Then using PlusCal, I wrote specifications for Dijkstra's stabilizing token ring algorithm. First without using process abstraction, then with the process abstraction when I finished Chapter 7 in Hyperbook. Finally I wrote specifications for Dijkstra's 3-state and 4-state token ring algorithms, which progressed very smoothly. Next, I will use it on Paxos (here is a TLA+ specification of epaxos) and my own work.

### Verdict

The guarded-command language I use for teaching 4/586 translates very easily to PlusCal, so TLA+ is a good fit for my course. I will start using it in my 4/586 class this coming semester. I think the students will enjoy having hands-on experience with reasoning about non-toy distributed protocols.

## [Alg] Quicksort: Recursive and Non-recursive Versions

Published on:

The non-recursive Quicksort algorithm is due to Leslie Lamport. You can find it in this video lecture (title: Thinking for Programmers, from time: 32:35).

### Introducing Quicksort

Quicksort is a well-known sorting algorithm developed by Tony Hoare in 1960. It is a typical "divide-and-conquer" algorithm with a key sub-procedure called partition for the "divide" part.

The informal specification of partition(A:array, lo:int, hi: int) is as follows:

### Recursive Quicksort

With the partition sub-procedure, the quick-sort algorithm can be written as

Sorting the entire array is accomplished by calling quick-sort(A, 0, length(A) - 1).

### Iterative Quicksort

The algorithm given above is recursive. Now we translate it into an iterative one. The very first idea is to unfold the tail recursions at the code level. This would not be an easy task. Another idea is to investigate the states of the recursive quick-sort algorithm and re-invent an iterative counterpart.

Initially, there is only an entire array. The first application of partition results in two subarrays. When the two subarrays are sorted separately and in place, the entire array is sorted. The second partition will divide one of the two subarrays into two smaller subarrays. Continuing on this, we obtain more and more subarrays with less and less elements to be sorted. An smallest subarray contains a single element and is sorted by default. In this way, we obtain an iterative Quicksort algorithm.

Similarly, you can call quick-sort-iter(A, 0, length(A) - 1) to sort the entire array.

### The Partition Procedure

You can find both the recursive and non-recursive versions of Quicksort at Github.

## Resources on Distributed Computing

Published on:

### Courses

The Lecture Notes (ppt) is elegant.
Topics will include the majority (we are going to shoot for all and see what happens) of the following: Global states and event ordering; Logical clocks; Vector clocks; Consistent cuts and global property detection; Rollback-recovery and message-logging protocols; State machine approach; Agreement protocols; Failure detectors; Replication and consistency; Byzantine fault tolerance; Atomic Commit

Lecture notes: Robust Concurrent Computing
It also provides a list of papers to read.

### Videos

#### From Leslie Lamport

• What is Computation: Dr. Leslie Lamport, Microsoft
• Thinking Above the Code Architects draw detailed blueprints before a brick is laid or a nail is hammered. Programmers and software engineers seldom do. A blueprint for software is called a specification. The need for extremely rigorous specifications before coding complex or critical systems should be obvious—especially for concurrent and distributed systems. This talk explains why some sort of specification should be written for any software.