测试中文博客

Published on:

诗经 (大标题)

正文(小标题)

伐木丁丁,鸟鸣嘤嘤 (这是粗体)。出自幽谷,迁于乔木 (这是斜体)。嘤其鸣矣,求其友声。相彼鸟

矣,犹求友声。矧伊人矣,不求友生?神之听之,终和且平。

於粲洒扫,陈馈八簋。既有肥牡,以速诸舅。宁适不来,微我有咎。

伐木于阪,酾酒有衍。笾豆有践,兄弟无远。民之失德,乾餱以愆。有酒

湑我,无酒酤我。坎坎鼓我,蹲蹲舞我。迨我暇矣,饮此湑矣。

这是添加图片

My Activities on Stack Sites (Answers)

Published on:

StackExchange: Theorectical Computer Science (cs.theory)

StackExchange: Computer Science Beta (cs.se)

StackExchange: Mathematics (math.se)

Algorithm-Zoo

Published on:
Tags: Algorithm

Search and Selection

Graph Algorithms

My Activities on Stack Sites (Questions)

Published on:

StackExchange: Computer Science beta

StackExchange: Mathematics

StackExchange: Theorectical Computer Science (cs.theory)

StackOverflow: (stackoverflow)

StackExchange: Tex (tex.stackexchange)

StackExchange: Mathematica

CONSPRO[7] - Providing High Availability Using Lazy Replication

Published on:

@article{Ladin92,
author = {Ladin, Rivka and Liskov, Barbara and Shrira, Liuba and Ghemawat, Sanjay},
title = {Providing High Availability Using Lazy Replication},
journal = {ACM Transactions on Computer Systems},
issue_date = {Nov. 1992},
volume = {10}, number = {4}, year = {1992}, pages = {360--391},
publisher = {ACM},
} Ref = {Ladin@TOCS'92-Lazy replication}

Symbol table



log
rep_ts
val
val_ts
inval

Motivation

To achieve high availability, most large-scale services are implemented as a replicated state machine. Consistency of replicated state can be guaranteed as long as service requests are performed in the same order at all replicas, but this approach is too expensive. Meanwhile, some services can tolerate a weaker, causal ordering. The best example is, among others, conversations in social networks, which consists of multiple causal chains linked by "Reply" relation. The paper aims to support such causal ordering using lazy replication:

A service request (update or query) is executed at just one replica; updating of other replicas happens by lazy exchange of "gossip" messages.

The Replicated Causal Operations Service

System model

The replicated service is comprised of a fixed number of replicas and an arbitrary number of clients. For each client, a proxy called front end (residing at the same client node) is responsible for all the communications between it and replicas. Thus, a client issues a request by making a local call to its front end, which in turn sends a call message to one (in normal operation; things can go complicated in case of node failure and network delay) replica. The replica executes the request and sends a reply message to the front end. Replicas communicate among themselves by lazily exchanging gossip messages.

  • Question 1: What is the lifespan of the front end? The whole application, a session, or defined by users (e.g., using pairs of begin_fe and end_fe)?
  • Ans: Please refer to the client library component in [Lloyd@SOSP'11-COPS].

Specification

The service supports two kinds of requests: update and query. Each update request is associated with an unique id , generated by the service and included in its reply message. A is simply a collection of uids of updates. Causal relation among requests are explicitly captured by assigning a label to each request. In addition, the result value of a query may reflect other updates than those enforced in its label parameter. The interfaces of update and query requests are thus defined as follows:

  • update(prev:label, op) returns (uid)
  • query(prev:label, op) returns (new-label:label, value)

The (sequential) specification of the causal operations service is defined by the following requirements on query requests.

Let be a query.

  1. q.new-label q.prev
  2. If u.uid q.new-label, then update such that dep(u,v), v.uid q.new-label
  3. q.value = q.op(Val(q.new-label))

The first clause formalizes the fact that a query may "see" other updates than those enforced in its prev:label parameter.
The second clause states that the returned new-label is dependency-complete: If is reflected in it, then so are all the updates that depends on. The predicate dep(u,v) meaning that update causally depends on update can be evaluated by .
The third clause lies at the heart of the correctness of the causal operations service. It specifies how to compute the value to return.

  • Question 2: What's the use of new-label returned by a query?
  • Ans: It can be used as a prev:label parameter of another request to implement, for instance, monotonic reading.

Programming model

Implementation

Causality tracking and causality checking are two key issues in implementing a causal operations service. As mentioned above, both of them are achieved by the mechanisms of and . For efficiency, this paper employs multipart timestamp as a compact representation for them. Multipart timestamp here is actually vector clock, one entry per replica, supporting standard compare and merge operations. In the following, the implementation of the service is described both in terms of multipart timestamp and of and . We first consider the normal operation in which a front end will always contact the same "preferred" replica. We then handle with the duplication of requests due to node failures and network delay.

  • Question 3: What is the [TBD] of the
  • Ans: In normal operation, #replica : #front end = 1 : ; Otherwise, #replica : #front end = : ;
In normal operation

Each replica keeps a record of all the updates it has received from both front ends and other replicas. It also maintains a local timestamp, rep_ts, that summarize its knowledge about these updates. The state of each replica, val, is generated by a collection of executed updates, reflected in another timestamp val_ts. Once an update is executed, its is merged into val_ts. Each replica guarantees to execute all updates in dependency order.

Processing an update call message:
Replica performs the following actions when it receives a call message for update from a front end:

  1. Update knowledge: rep_ts[i]++;
  2. Compute (ts) for : u.prev[i] := rep_ts[i]; ts := u.prev;
  3. Construct record for and add it to the local log: r := <U,i,ts>;
  4. Execute if possible: If u.prev <= val_ts, then val := apply(val, u.op); val_ts := merge(val_ts, r.ts);.
  5. Sent a reply message with uid (i.e., r.ts).

Implementation issue 1: These five actions should be performed atomically. Notice that we don't wait on the condition u.prev <= val_ts in step (4). Instead we defer it to the part of "Processing a gossip message".
Q1: Construct a scenario in which the local replica cannot process an update immediately (i.e., the condition u.prev <= val_ts of (4) does not hold).
Q2: Why does we try to execute an update (if possible) here since they will be performed when gossip messages are being processed?

Ans2: What if no comming gossip messages at all?

Because the uid (r.ts) of an update is computed from its prev:label, it also captures the causal dependencies.

Processing a query call message:
When a replica receives a call message for query from a front end, it checks whether is ready to be performed by comparing q.prev with val_ts:
Wait until q.prev <= val_ts, then returns <val_ts,apply(val, q.op)>.

Implementation issue 2: Start a separate thread for each query and wait on the condition q.prev <= val_ts parallelly.

Processing a gossip message:
A gossip message contains m.ts, the sender's timestamp (i.e., rep_ts), and m.log, the sender's log. Notice that gossip messages are used to exchange information about update requests received by all replicas, it is not necessary to include val or val_ts, which represents the update requests have been processed.
The replica performs the following actions when it receives a gossip message from replica .

  1. Merge logs:
  2. Update knowledge: rep_ts := merge(rep_ts, m.ts)
  3. Asynchronously apply update requests:
    1. Identify the updates that are ready:
    2. Apply them in dependency order: execute val := apply(val, r.op); val_ts := merge(val_ts, r.ts); for each update in comp in topological order. (Note: is applied only if it has not been applied yet; See the "Optimization" section.)

Implementation issue 3: The size of m.log. Please refer to Question 7.

The following question concerns about the correctness of the implementation of the replicated service.

  • Quesiton 4: Does the multipart timestamp mechanism introduce extra (false) causal dependencies among two independent requests sent to the same replica from different front ends, due to the way uids are computed?
  • Ans: No. Although the uids of these two requests (say, and ) are totally ordered (in terms of multipart timestamp), the causality checking of whether depends on is evaluated by comparing and (not ). lazy-replication Going further: Is it possible for but does not depend on ? [TBD] How to prove this? (First, and do not change once they are generated. Secondly, new is not smaller than any older one.)
With duplicates

Due to node failures and network delay, the response for a request may be too slow (by timeout) and the client (front end) wants to send it to other replicas. Thus, there may be multiple call messages (and $uids$) for a single update at a time in the system. Despite these duplications, each update can be performed at most once at each replica. To this end, each update is assigned a unique call identifier, , by its front end.

  • Question 5: There being also duplications of a query, why don't we assign a (system-wide) unique identifier to it?
  • Ans: Because query requests do not modify states of replicas and do not need to be propagated.

Each replica maintains a set (e.g., hashtable in implementation) of $cids$, inval, consisting of the updates that have been processed in this replica. The inval is updated whenever a new update has been processed: val := apply(val, r.op); inval := inval + {r.cid}; val_ts := merge(val_ts, r.ts);.

Before applying a ready update , a dupliation checking is done: . However, whether is a duplication or not, its is merged into val_ts.

  • Question 6: What if we don't merge the for duplications of a update into val_ts?
  • Ans: Consider the scenario illustrated in the following figure. If the replica does not merge the duplication of sent from the replica into its val_ts, another update that depends on will never be processed by . merge the duplicates

Optimization

The logs in the log-exchange based implementation would grow without bound. An important performance issue is to control the size of the log. An update record can be deleted from the log at some replica if (1) it is known everywhere and (2) has been reflected in val locally.

To check the first condition, a replica should know what other replicas know. For this purpose, a replica maintains a table of timestamps, ts_table, of which ts_table[p] is the latest rep_ts of replica received via gossip messages. Thus, the predicate that replica knows that an update record is known everywhere is defined as

Remark: The parameter in the predicate is an update record, instead of an update which may have multiple records.

For the second condition, we require the gossip protocol guarantee that when the receiver receives record from replica , it has also received all records processed at before . This requirement puts constraints on when and what to gossip.

  • Question 7: When and what to gossip to meet the requirement of gossip protocol?
  • Ans: When: at any time; What: If we assume TCP reliable communication, we can gossip a record with all its nearest dependencies as those in [Lloyd@SOSP'11-COPS].

Combine nearest deps in [Lloyd@SOSP'11-COPS] with and in [Ladin@TOCS'92-Lazy replication]: At the front end, only label and uids of nearest deps are sent to replicas. The update records themselves are stored at replicas and propagated among them.

With this in mind, we can now discard obsolete update records in the part "Processing a gossip message":

(4). Update ts_table: ts_table(j) = m.ts
(5). Discard obsolete records:

  • Implementation issue 4: In this paper, the proof of the correctness of the implementation relies on the assumption that gossip is processed atomically. However, I am not sure whether this is mandatory.

  • Question 8: Compasion between ts_table and matrix clocks.

[Re-Blogger] Using TLA+ for Teaching Distributed Systems

Published on:

This a reblogged article published on Blogger by Murat on August 12, 2014.

I am teaching CSE 4/586 Distributed Systems class again this Fall (Fall 2014). This is the course I have most fun teaching. (I would like to think my students also feel that way :-) I teach the course with emphasis on reasoning about the correctness of distributed algorithms. Here are the topics I cover in sequence:

  1. Introduction, Syntax and semantics for distributed programs, predicate calculus
  2. Safety and progress properties
  3. Proof of program properties
  4. Time: logical clocks, State: distributed snapshots
  5. Mutual exclusion, Dining philosophers
  6. Consensus, Paxos
  7. Fault-tolerance, replication, rollback recovery, self-stabilization
  8. Programming support for distributed systems
  9. Data center computing and cloud computing
  10. CAP theorem and NOSQL systems
  11. Distributed/WAN storage systems

I put emphasis on reasoning about distributed algorithms because concurrency is very tricky; it truly humbles human brain. More than 3 actions in a distributed program and your intuitions will fail, you won't be able to hand-wave and apply operational reasoning on the program. You may think you could, but you would be very wrong (I know from first-hand experience).

I use invariant-based reasoning of program properties for the first 4 weeks exclusively. But this becomes less applicable when we move into more involved protocols in weeks 5 and beyond. This is where I give up being rigorous and make tell the class: "We could push things down the most rigorous invariant-based reasoning and predicate calculus level but we don't. Instead we give arguments in English, with the appreciation of how these arguments correspond to the proof rules in previous chapters." Yes, this is not very satisfying, but I didn't have much choice.

TLA+

So for these reasons, the AWS TLA+ article got my attention recently. The article talked about how AWS successfully used invariant-based reasoning and formal methods (in particular TLA) for building robust distributed systems. TLA is a tool for specifying distributed algorithms/protocols and model checking them. AWS used TLA in many key projects: S3, DynamoDB, EBS, and a distributed lock manager. Here is the technical report by AWS. It is a very good read.

TLA+ is Leslie Lamport's brainchild. Of course you know Lamport if you are working on distributed systems. Lamport got a Turing award in 2013; he is famous for logical clocks, Paxos, and several other influential results in distributed systems. As a side-project, he wrote a wrapper around Knuth's TeX, called LaTeX ("La" for Lamport?), which is still the typesetting tool for almost all math/CSE academic papers. Lamport has always been a firm proponent of invariant-based reasoning for distributed algorithms and it seems like he has been dedicating most of his effort on prostelyzing TLA in recent years.

There are other successful model checkers (Spin, SMV, Promela), but TLA is more focused on supporting distributed algorithms reasoning. In addition, the PlusCal language (in the TLA+ toolkit) provides a high-level pseudo language to write distributed algorithms easily.

How I went about learning TLA

This was a straightforward and easy process. This is the main page for TLA, where the other pages can be reached. To download the toolkit, I first went to this page which forwards to this download page.

Then I downloaded the Hyperbook and started following it. The chapters were all straightforward for me, because this is very similar to the material I teach in my 486/586 class for invariant-based reasoning of distributed algorithms. The hyperbook has a lot of examples and is the best place to start learning TLA.

For the PlusCal language reference I downloaded this.

After I got the hang of it, I decided to get my hands dirty with my own toy programs. I wrote TLA+ specifications for some simple coffee bean problems. Then using PlusCal, I wrote specifications for Dijkstra's stabilizing token ring algorithm. First without using process abstraction, then with the process abstraction when I finished Chapter 7 in Hyperbook. Finally I wrote specifications for Dijkstra's 3-state and 4-state token ring algorithms, which progressed very smoothly. Next, I will use it on Paxos (here is a TLA+ specification of epaxos) and my own work.

Verdict

The guarded-command language I use for teaching 4/586 translates very easily to PlusCal, so TLA+ is a good fit for my course. I will start using it in my 4/586 class this coming semester. I think the students will enjoy having hands-on experience with reasoning about non-toy distributed protocols.

[Alg] Quicksort: Recursive and Non-recursive Versions

Published on:

The non-recursive Quicksort algorithm is due to Leslie Lamport. You can find it in this video lecture (title: Thinking for Programmers, from time: 32:35).

Introducing Quicksort

Quicksort is a well-known sorting algorithm developed by Tony Hoare in 1960. It is a typical "divide-and-conquer" algorithm with a key sub-procedure called partition for the "divide" part.

The informal specification of partition(A:array, lo:int, hi: int) is as follows:

Procedure partition(A:array, lo:int, hi:int): pivot_index:int
@param A:array 
    the array to sort
@param lo:int, hi:int  
    A[lo .. hi] to be partitioned
@return pivot_index:int
    the index of the pivot element
@postcondition
    The subarray A[lo .. hi] is reordered and partitioned into two parts: 
  The left part A[lo .. pivot_index] with elements less than or equal to A[pivot_index]
  and the right part A[pivot_index .. hi] with elements all greater than A[pivot_index].

Recursive Quicksort

With the partition sub-procedure, the quick-sort algorithm can be written as

Algorithm quick-sort(A: array, lo: int, hi: int)
  if lo < hi:
    p := partition(A, lo, hi)
    quicksort(A, lo, p - 1)
    quicksort(A, p + 1, hi)

Sorting the entire array is accomplished by calling quick-sort(A, 0, length(A) - 1).

Iterative Quicksort

The algorithm given above is recursive. Now we translate it into an iterative one. The very first idea is to unfold the tail recursions at the code level. This would not be an easy task. Another idea is to investigate the states of the recursive quick-sort algorithm and re-invent an iterative counterpart.

Initially, there is only an entire array. The first application of partition results in two subarrays. When the two subarrays are sorted separately and in place, the entire array is sorted. The second partition will divide one of the two subarrays into two smaller subarrays. Continuing on this, we obtain more and more subarrays with less and less elements to be sorted. An smallest subarray contains a single element and is sorted by default. In this way, we obtain an iterative Quicksort algorithm.

Algorithm quick-sort-iter(A:array, lo:int, hi:int)
    @state
    range_array is an array (we use FIFO queue) of subarrays to be sorted; 
    each subarray is a triple of `(A, lo, hi)`.
    
  range_array := {(A, lo, hi)}
  
  while (range_array is not empty)
  {
    (A, lo', hi') := pop (and remove) a subarray from range_array
    if (lo' < hi')
    {
      p := partition(A, lo', hi)
      range_array := range_array + {(A, lo', p - 1), (A, p + 1, hi')}
    }
  }

Similarly, you can call quick-sort-iter(A, 0, length(A) - 1) to sort the entire array.

The Partition Procedure

You can find both the recursive and non-recursive versions of Quicksort at Github.

Resources on Distributed Computing

Published on:

People

Groups

SIG

Conferences

Journals

Magazines

Prizes

Courses

The Lecture Notes (ppt) is elegant.
Topics will include the majority (we are going to shoot for all and see what happens) of the following: Global states and event ordering; Logical clocks; Vector clocks; Consistent cuts and global property detection; Rollback-recovery and message-logging protocols; State machine approach; Agreement protocols; Failure detectors; Replication and consistency; Byzantine fault tolerance; Atomic Commit

Lecture notes: Robust Concurrent Computing
It also provides a list of papers to read.

Computer Science Ph.D. Thesis

Tools

Blogs

English

Chinese

Other Articles

Videos

From Leslie Lamport

  • What is Computation: Dr. Leslie Lamport, Microsoft
  • Thinking Above the Code Architects draw detailed blueprints before a brick is laid or a nail is hammered. Programmers and software engineers seldom do. A blueprint for software is called a specification. The need for extremely rigorous specifications before coding complex or critical systems should be obvious—especially for concurrent and distributed systems. This talk explains why some sort of specification should be written for any software.

General