diff --git a/text/2018-01-07-deque-proof.md b/text/2018-01-07-deque-proof.md new file mode 100644 index 0000000..2324607 --- /dev/null +++ b/text/2018-01-07-deque-proof.md @@ -0,0 +1,1041 @@ +# Summary + +This RFC proposes an implementation of concurrent work-stealing double-ended queue (deque) on +Crossbeam, and its correctness proof. this RFC has a [companion PR to `crossbeam-deque`][deque-pr]. + + + +# Motivation + +The current implementation of deque is not fully optimized yet, leaving non-negligible room for +performance improvement. However, it is unclear how to further optimize the current implementation +while guaranteeing its correctness. This RFC resolves this issue by proving an optimized +implementation. + + + +# Detailed design + +In this section, we present a pseudocode of deque, and prove that (1) it is safe in the C/C++ memory +consistency model, and (2) it is functionally correct, i.e. it acts like a deque. + + +## Semantics of C/C++ relaxed-memory concurrency + +For the semantics of C/C++ relaxed-memory concurrency on which the deque implementation is based, we +use the state-of-the-art [promising semantics][promising]. For a gentle introduction to the +semantics, we refer the reader to [this blog post][synch-patterns]. + +One caveat of the original promising semantics is its lack of support for memory allocation. In this +RFC, we use the following semantics for memory allocation: + +- In the original promising semantics, a timestamp was a non-negative quotient number: `Timestamp = + Q+`. Now we define `Timestamp = Option`, where `None` means the location is unallocated. + Suppose `None < Some(t)` for all `t`. + +- At the beginning, a thread's view is `None` for all the locations. When a thread allocates a + location, its view on the location becomes `0`. + +- If a thread is reading a location and it's view on the location is `None`, i.e. it is unallocated + in the thread's point of view, then the behavior is undefined. + +We believe this definition is compatible with all the results presented in the [the promising +semantics paper][promising]. We hope [its Coq formalization][promising-coq] be updated accordingly +soon. + + +## Proposed implementation + +A work-stealing deque is owned by the "owner" thread (`Deque`), which can push items to and pop +items from the "bottom" end of the deque. Other threads, which we call "stealers" (`Stealer`), +try to steal items from the "top" end of the deque. Due to CAS failure, a steal invocation may +return `Retry`, which basically means the invocation did nothing and you should retry. Deque's +interface is summarized as follows: + +```rust +impl Deque { // Send + !Sync + pub fn new() -> Self { ... } + pub fn stealer(&self) -> Stealer { ... } + + pub fn push(&self, value: T) { ... } + pub fn pop(&self) -> Option { ... } // None if empty +} + +impl Stealer { // Send + Sync + Clone + pub fn steal(&self) -> Steal { ... } +} + +enum Steal { + Data(T), + Empty, + Retry, // CAS failed +} +``` + +Chase and Lev proposed [an efficient implementation of deque][chase-lev] on sequentially consistent +memory model, and Lê et. al. proposed [its variant for ARM and C/C++ relaxed-memory +concurrency][chase-lev-weak]. This RFC presents an even more optimized variant for C/C++. + +A Chase-Lev deque consists of (1) the `bottom` index for the owner, (2) the `top` index for both the +owner and the stealers, and (3) the underlying `buffer` that contains items. When the buffer is too +small or too large, the owner may replace it with one with more appropriate size: + +```rust +struct Inner { + bottom: AtomicUsize, + top: AtomicUsize, + buffer: Ptr>, +} + +impl Inner { + ... +} +``` + +The `Deque` and `Stealer` structs have a counted reference to the inner implementation, and delegate +method invocations to it: + +```rust +struct Deque { + inner: Arc>, +} + +impl Deque { + fn push(...) { + self.inner.push(...) + } + + ... +} + +... +``` + +The pseudocode for `Inner`'s `fn push()`, `fn pop()`, and `fn steal()` is as follows: + +```rust +pub fn push(&self, value: T) { + 'L101: let b = self.bottom.load(Relaxed); + 'L102: let t = self.top.load(Acquire); + 'L103: let mut buffer = self.buffer.load(Relaxed, epoch::unprotected()); + + 'L104: let cap = buffer.get_capacity(); + 'L105: if b - t >= cap { + 'L106: self.resize(cap * 2); + 'L107: buffer = self.buffer.load(Relaxed, epoch::unprotected()); + 'L108: } + + 'L109: buffer.write(b % buffer.get_capacity(), value, Relaxed); + 'L110: self.bottom.store(b + 1, Release); +} + +pub fn pop(&self) -> Option { + 'L201: let b = self.bottom.load(Relaxed); + 'L202: self.bottom.store(b - 1, Relaxed); + 'L203: fence(SeqCst); + 'L204: let t = self.top.load(Relaxed); + 'L205: let len = b - t; + + 'L206: if len <= 0 { + 'L207: self.bottom.store(b, Relaxed); + 'L208: return None; + 'L209: } + + 'L210: let buffer = self.buffer.load(Relaxed, epoch::unprotected()); + 'L211: let mut value = Some(buffer.read((b - 1) % buffer.get_capacity(), Relaxed)); + + 'L212: if len == 1 { + 'L213: if self.top.compare_and_swap(t, t + 1, Release, Acquire).is_err() { + 'L214: mem::forget(value.take()); + 'L215: } + 'L216: self.bottom.store(b, Relaxed); + 'L217: return value; + 'L218: } + + 'L219: let cap = buffer.get_capacity(); + 'L220: if len < cap / 4 { + 'L221: self.resize(cap / 2); + 'L222: } + 'L223: value +} + +fn resize(&self, cap_new) { + 'L301: let b = self.bottom.load(Relaxed); + 'L302: let t = self.top.load(Relaxed); + 'L303: let old = self.buffer.load(Relaxed, epoch::unprotected()); + 'L304: let new = Buffer::new(cap_new); + + 'L305: ... // copy data from old to new + + 'L306: let guard = epoch::pin(); + 'L307: self.buffer.store(Owned::new(new).into_shared(guard), Release); + 'L308: guard.defer(move || old.into_owned()); +} + +pub fn steal(&self) -> Steal { + 'L401: let mut t = self.top.load(Relaxed); + + 'L402: let guard = epoch::pin_fence(); // epoch::pin(), but issue fence(SeqCst) even if it is re-entering + + 'L403: let b = self.bottom.load(Acquire); + 'L404: if b - t <= 0 { return Empty; } + + 'L405: let buffer = self.buffer.load(Acquire, &guard); + 'L406: let value = buffer.read(t % buffer.get_capacity(), Relaxed); + + 'L407: if self.top.compare_and_swap_weak(t, t + 1, Release, Relaxed).is_err() { + 'L408: mem::forget(value); + 'L409: return Retry; + 'L410: } + + 'L411: Data(value) +} +``` + + +## Safety + +We first prove that the proposed implementation is memory-safe, i.e. it does not invoke undefined +behavior due to illegal memory accesses. We discuss two possible causes of undefined behavior: data +race and invalid pointer dereference. + + +### Absence of Data Races + +All shared objects, except for the contents of `buffer`, are atomically accessed. For the contents +of `buffer`, the stealers only atomically reads from the contents, and the owner only reads from and +atomically writes to the contents. Thus there are no data races that invoke undefined behavior. + +In particular, the stealers should atomically read from the contents of `buffer` (`'L406`) because +`fn push(): 'L109` may concurrently writes to the same object. For example, the scheduler may +preempt a `steal()` invocation right after `'L402` so that `t` read in `'L401` may be arbitrarily +stale. Now, suppose that in a concurrent `push()` invocation, `b` equals to `t + +buffer.get_capacity()` and it is overwriting a value to `buffer`'s `(b % buffer.get_capacity())`-th +element. At the same time, the `steal()` invocation may wake up and read `buffer`'s `(t % +buffer.get_capacity())`-th element. + +What's the cost of atomic access to the buffer? It is worth noting that the deque's item may be +arbitrarily large, and [atomic accesses to large objects are probably blocking][cppatomic] in the +C/C++11 standard library. But actually this can be non-blocking, e.g. by interpreting a relaxed +store to a large object as concurrent relaxed stores to each word of the object. So we assume the +accesses to the contents of the buffer are non-blocking, and they add no additional synchronization +cost than plain accesses. + + + +### Absence of Invalid Pointer Dereference + + + +For pointer safety, it is sufficient to prove that every access to the contents of the buffer +pointed-to by `Inner`'s `self.buffer` is safe, since (1) the other shared objects, namely +`self.bottom`, `self.top`, and `self.buffer`, are just integers or pointers, and (2) accesses to +thread-local storage are straightforwardly safe. It is worth noting that while the deque is shared +among the owner and its stealers, only the owner is able to call `fn push()`, `fn pop()`, and `fn +resize()`, thereby modifying `self.bottom` and `self.buffer`. + +After `self.buffer` is initialized, it is modified only by the owner in `fn resize(): 'L307`. Since +the contents inside a buffer is always accessed modulo the buffer's capacity (`'L109`, `'L211`, +`'L406`) and the buffer's size is always nonzero, there are no buffer overruns. + +Thus it remains to prove that the buffer is not used after it has been freed. Thanks to Crossbeam, +we don't need to take care of all the details of memory reclamation; the [relaxed-memory +RFC][rfc-relaxed-memory] proves that as a user of Crossbeam, we only need to show: + +- When a buffer is deferred to be dropped (`'L308`), there are no longer references to the buffer in + the shared memory, and no concurrent mutators introduce a reference to the buffer in the memory + again. + + You can easily check these properties because the owner basically "owns" the buffer: only the + owner introduces/eliminates references to a buffer to/from the shared memory. Indeed, before a + buffer is deferred to be dropped at `'L308`, the only reference to the buffer via `self.buffer` is + removed at `'L307`. + +- With an `unprotected()` guard, we should not defer to drop an object that is concurrently accessed + by another thread, and should not access an object that is concurrently deferred to be dropped by + another thread. + + It is indeed the case because (1) a buffer is deferred to be dropped only by the owner with a + protected guard, and (2) stealers do not use an unprotected guard. + + + +## Functional Correctness + +A more interesting question is: does the implementation act like a deque? + + +### Specification: Linearizability and Synchronization + +In order to answer the question, we first define the sequential specification of a deque. A *deque +state* is a triple `(t, b, A)`, where `t` is the top index, `b` is the bottom index, and `A` is the +array of the values within the indexes `[t, b)`. A `push()` invocation increases the bottom index +`b`, and push the input value at the end of `A`. A `pop()` invocation decreases the bottom index `b` +and pop the last value from `A`, if `t < b`; otherwise, touches nothing and returns `EMPTY`. (As a +special case, if `t+1 = b`, you can increase the top index `t` instead of decreasing the bottom +index `b`.) A `steal()` invocation increases the top index `t` and pop the first value from `A`, if +`t < b`; otherwise, touches nothing and returns `EMPTY`. By `S -(I,R)-> S'` we denote the +*transition relation* described above, from a deque state `S` into `S'` by the invocation `I` +returning `R`. + +A shared object is **linearizable** as a deque if the following holds: + +> Suppose that multiple threads invoked methods (i.e. `push()`, `pop()`, `steal()`) on the shared +> object in a single execution. For those invocations that is not `steal()` returning `RETRY`, there +> exists a permutation `I_0`, ..., `I_(n-1)` of them, which we call **"linearization order"**, +> denoted by ` +> - (VIEW) For all `x` and `y`, if `I_x +> Furthermore, there exists a deque state `(t_i, b_i, A_i)` for each `i`, such that: +> +> - (SEQ) Initially, `(t_0, b_0, A_0) = (0, 0, [])` holds. Furthermore, for all `i`, `(t_i, b_i, +> A_i) -(signature(I_i),ret(I_i))-> (t_(i+1), b_(i+1), A_(i+1))` holds. In other words, the +> invocations `I_0`, ..., `I_(n-1)` return outputs *as if* the invocations are sequentially +> (i.e. non-concurrently) executed to a deque in the linearization order. + +Here, by `signature(I)` and `ret(I)` we mean the function call signature and the return value of +`I`. Also, by `I - (SYNC) If a value is `push()`ed by an invocation `I` and then `steal()`ed by an invocation `J`, +> then `view_beginning(I) <= view_end(J)` holds. + +Note that a similar property for a matching pair of `push()` and `pop()` is a corollary of +linearizability, as only the owner is able to call `push()` and `pop()` and the `push()` should be +called before the `pop()` in the "program order" (the execution order in a single thread). + + +#### View-based Definition of Linearizability + +The conditions `(VIEW)` and `(SEQ)` are a generalization of the conditions `L2` and `L1` in [the +traditional definition of linearizability][linearizability] to relaxed-memory concurrency. What is +unique in `(VIEW)` is that while the definition of ` i_1` is a legit linearization order, yet *morally* `i_1` should be ordered before `i_2` because +it is so in the program order. However, we do not regard it as a problem: if `i_1` and `i_2` do not +change the view, they are read-only, and should not change the state of the shared object. Thus it +doesn't matter whether `i_1` is considered to be before or after `i_2`. + + + + + + +### Construction of Linearization Order + +Consider the invocations to the deque in an execution. For the owner's invocations, the program +order is the only possible linearization among them. Let's assume that `O_0`, ..., `O_(o-1)` are the +owner's invocations, sorted according to the program order. We construct a full linearization order +by inserting stealers' invocations into it. + +Note that every owner's invocation writes to `bottom`, and every stealers' invocation reads from +`bottom` at `'L403`. Using `bottom`, we classify the stealers' invocations into "groups" `{G_i}` as +follows: + +- For a stealer's invocation `S`, if `S` reads the initial value from `bottom`, then `S ∈ + G_(-1)`. Otherwise, let `O_i` be such an owner's invocation that `S` read from `bottom` a value + written by` O_i`. + +- If `S` returns `EMPTY`, then `S ∈ G_i`. + +- Otherwise (i.e. `S` returns a value), let `j` be such an index that `O_(j+1)`, ..., `O_i` are + `pop()` taking the "irregular" path (i.e. not entering `'L219`) and `O_j` is not. (`j = -1` if + `O_0`, ..., `O_i` are all `pop()` taking the irregular path.) Then `S ∈ G_j`. + +We will insert the invocations in `G_i` between `O_i` and `O_(i+1)`. Inside a group `G_i`, we give +the linearization order as follows: + +- Let `STEAL^x` be the set of steal invocations that stole an element at the index `x`, and + `STEAL_EMPTY` be the set of steal invocations that returns `EMPTY`. Let `STEAL` be `Union_x + {STEAL^x}`. + +- Within `G_i`, place `STEAL` invocations first, and then `STEAL_EMPTY` invocations. + +- If there are multiple `STEAL` invocations in `G_i`, order `STEAL^x` before `STEAL^y` if `x < y`. + +- If there are multiple `STEAL_EMPTY` invocations in `G_i`, order `I` before `J` if `I (VIEW-LOC): If `view_beginning(I)[loc] < view_end(J)[loc]` for some `loc`, then it is not the + case that `J (IRREGULAR-TOP): We have [timestamp of `top = x`] <= `view_end(O_i)[top]`. + + First of all, the notation "[timestamp of `top = x`]" is well-defined, since `top` is + monotonically increasing by CAS at `'L213` or `'L407`, so that there can be only one write `top = + x` for each value `x`. For this reason, in the rest of this RFC, we intentionally confuse the + values of `top` with the timestamps of `top`. + + Let `y` be the value `O_i` read from `top` at `'L204`. Since `O_i` is taking the irregular path, + we have `x-1 <= y`. If `x <= y`, then the conclusion trivially follows. If `y = x-1`, then `O_i` + tries to compare-and-swap (CAS) `top` from `x-1` to `x` at `'L213`. Regardless of whether the CAS + succeeds or fails, `O_i` reads or writes `top >= x`. + + It is worth nothing that for this lemma to hold, it is necessary for the CAS at `'L213` to be + strong, i.e. the CAS does not spuriously fail. + +- > (IRREGULAR-STEAL): Let `S` be a successful `steal()` that reads from `bottom` a value written by + `O_i`, and `y` be the value `S` writes to `top` at `'L407`. Then `y < x` holds. + + If `S` read `bottom = x-1` written by `O_i` at `'L202`, then `y <= x-1` from the condition at + `'L404`. Otherwise, `S` read `bottom = x` written by `O_i` at `'L207` or `'L216`. We already know + `y <= x` from the condition at `'L404`, so suppose `y = x` and let's derive a contradiction. + + + Case `S` read `bottom = x` written at `'L207`. + + Then `O_i` read `top >= x` at `'L204`. In order for `S` to read `bottom = x` at `'L403` and + `O_i` to read `top >= x` at `'L204` at the same time, either `S`'s write to `top` at `'L407` + should be promised before reading `bottom` at `'L403`, or `O_i`'s write to `bottom` at `'L207` + should be promised before reading `top` at `'L204`. But this is impossible. For the former, + `S`'s write to `top` at `'L407` is a release-store. For the latter, in order for `O_i` to + promise to write `bottom = x` at `'L207`, `O_i` should have read `top = x-1` at `'L204` and then + execute the CAS at `'L213` in certification. But a certain future memory mandates the CAS to + fail, acquiring an arbitrarily high view, after which it is impossible to fulfill the + promise. In short, there is a semantic dependency from `O_i`'s read from `top` at `'L204` to + `O_i`'s write to `bottom` at `'L207`. + + It is worth noting that in architectures, such as ARM and Power, usually it is enough to show + that there is syntactic dependency from `O_i`'s read from `top` at `'L204` to `O_i`'s write to + `bottom` at `'L207`. We have to reason semantic dependency because we are targeting C-like + languages, where compilers may perform various optimizations. + + + Case `S` read `bottom = x` written at `'L216`. + + Then `O_i` executes the CAS at `'L213`. Since `S` writes `top = x`, it is impossible for the CAS + to succeed. Since the CAS is strong, `O_i` should have read `top >= x`. Then there is a + release-acquire synchronization from `S`'s write to `top` at `'L407` to `O_i`'s read from `top` + at `'L213`. Thus it is impossible for `S` to read from `bottom` at `'L403` the value `x` written + by `O_i` at `'L216`. + + +### Proof of `(VIEW)` + +For `(VIEW)`, it is sufficient to prove that: + +1. `(VIEW-OWNER-STEAL)`: for all `i <= j` and `S ∈ G_j`, it is not the case that `S i`, and for all + `k ∈ (i, j]`, `O_k` is `pop()` taking the irregular path. By `(IRREGULAR-STEAL)`, we have `y+1 < + b`. By the definition of `push()` and `pop()`, we have `b-1 <= x'`, and thus `y <= b-2 < b-1 <= x' + <= y'`. In either case, we have `y < y'`, and by `(VIEW-LOC)`, the conclusion follows. + + +## Proof of `(SEQ)` and `(SYNC)` + +Let `I_0`, ..., `I_(n-1)` be the invocations sorted according to the constructed linearization +order. In addition to `(SEQ)` and `(SYNC)`, we will simultaneously prove the following conditions +with the same existentially quantified values of `t_i`, `b_i`, `A_i`: + +> `(BOTTOM)`: If `I_i` is an owner invocation, then `b_i` equals to the value `I_i` read from +> `bottom` at `'L101` or `'L201`. +> +> `(TOP)`: In `I_0`, ..., `I_(i-1)`, there are `t_i` invocations that succeed in updating `top`. Let +> `T_0`, ..., `T_((t_i)-1)` be such invocations, sorted according to the order in `I`. Then `T_x` +> updates `top` from `x` to `x+1`. +> +> `(CONTENTS)`: for all `x ∈ [t_i, b_i)`, there exists a `push()` invocation into `x` in `I_0, ..., +> I_(i-1)`; and `A_i[x]` is the value inserted by the last such invocation. + +We prove that `{I_i}` satisfies `(SEQ)`, `(SYNC)`, `(BOTTOM)`, `(TOP)`, and `(CONTENTS)` by +induction on `n`: suppose `I_0`, ..., `I_(i-1)` satisfies those conditions, and let's prove that +there exists `b_(i+1)`, `t_(i+1)`, and `A_(i+1)` such that `I_i` also satisfies those conditions. We +prove for each case of `I_i`. + +- Case 1: `I_i` is `push()`. + + Quite obvious. + +- Case 2: `I_i` is `pop()` taking the regular path. + + Let `j` be such an index that `I_i = O_j`. Let `x` and `y` be the values `I_i` read from `bottom` + at `'L201` and `top` at `'L204`, respectively. By `(BOTTOM)`, we have `x = b_i`. Since `I_i` is + taking the regular path, we have `y+2 <= x`. We also prove `t_i <= y+1`. Consider the invocation + `I` that writes `y+2` to `top`, if exists. (Otherwise, by `(TOP)`, `t_i <= y+1` should hold.) If + `I` is `pop()`, then `I` should be linearized after `I_i` thanks to the coherence of `top`; if `I` + is `steal()`, then by the synchronization of the seqcst-fences from `I_i` to `I` via `top`, the + value `I` read from `bottom` at `'L403` should be coherence-after-or `WL_j`. Thus `I` is + linearized after `I_i`, and by `(TOP)`, `t_i <= y+1` holds. + + Then we have `t_i <= y+1 < y+2 <= x = b_i`, and it is legit to pop a value from the `bottom` end + of the deque and decrease `bottom`. + + Now we choose `b_(i+1) = b_i - 1`, `t_(i+1) = t_i`, and `A_(i+1) = A_i`, i.e. `I_i` pops a value + from the `bottom` end of the deque. It remains to prove that `I_i` returns the right value. Let + `O_k` be the last `push()` operation in `I_0`, ..., `I_(i-1)` that pushed to the index `x-1` and + writes `bottom = x`, and `v` be the value `O_k` pushed. Thanks to `(CONTENTS)`, it is sufficient + to prove that `I_i` returns `v`. + + For all `l`, let `WB_l` be the value of `buffer` at the beginning of the invocation `O_l`. Also, + for all `z`, let `WC_(l, z)` be the `z`-th contents of the buffer `WB_l` at the beginning of the + invocation `O_l`. These are well-defined since the pointer `buffer` and the contents of the buffer + are modified only by the owner. + + Let's prove by induction that for all `l ∈ (k, j]`, `WC_(l, (x-1) % size(WB_l)) = v`. Since `O_k` + just pushed a value to the index `x-1`, it trivially holds for the base case `l = k+1`. Now + suppose that it holds for `l = m` for some `m ∈ (k, j-1]` and prove that it holds for `l = + m+1`. Since `O_k` is the last operation that writes `bottom = x` before `O_j` writes `bottom = + x-1`, `O_m` is not a regular `pop()` that writes `bottom = x-1`. If `O_m` is resizing, then for + the values `z` and `w` that `O_m` read from `bottom` at `'L301` and `top` at `'L302`, + respectively, we have `z >= x` by the choice of `O_k` and `w <= y` by the coherence on + `top`. Since `y < x`, `WC_(m, (x-1) % size(WB_m)) = v` is copied to `WC_(m+1, (x-1) % + size(WB_(m+1)))`. + + Thus `I_i = O_j` returns `WC_(j, (x-1) % size(WB_j))`, which equals to `v`. + +- Case 3: `I_i` is `pop()` taking the irregular path. + + Let `j` be such an index that `I_i = O_j`. Let `x` and `y` be the values `I_i` read from `bottom` + at `'L201` and `top` at `'L204`, respectively. By `(BOTTOM)`, we have `x = b_i`. Since `I_i` takes + the irregular path, we have `y >= x-1`. + + + Case `y >= x`, or [`y = x-1` and the CAS at `'L213` fails]. + + If the former is the case, then `I_i` writes `bottom = x` at `'L208`. If the latter is the case, + then `I_i` read from `top` a value `>= x` at `'L213`, and then writes `bottom = x` at + `'L216`. In either case, it reads `top >= x` and then writes `bottom = x`. + + We prove `x <= t_i` as follows. Consider the invocation `I` that writes `x` to `top`. By + `(TOP)`, it is sufficient to prove that `I` is linearized before `I_i`. If `I` is `pop()`, then + it is so thanks to the coherence of `top`. Now suppose `I` is `steal()`. Let `O_k` be the first + `push()` invocation in `O_(j+1)`, ..., `O_(o-1)`. (If no such invocation exists, let `k = + o`. Also recall that `o` is the number of owner's method invocations.) By the coherence of + `top`, all of `O_j`, ..., `O_(k-1)` should read `top >= x` at `'L204`, and take the irregular + path. Thus it is sufficient to prove that the value `I` read from `bottom` at `'L403` should be + coherence-before `WF_k`. Suppose otherwise. Then there is a release-acquire synchronization from + `WF_k` to `I`'s read from `bottom` at `'L403`, so `O_j`'s read from `top` at `'L204` or `'L213` + is coherence-before `I`'s write to `top` at `'L407`. But this is impossible due to the coherence + of `top`. + + Thus we have `b_i = x <= t_i`, and `I_i` goes to `'L207`, restores the original value of + `bottom`, and returns `EMPTY`. We choose `b_(i+1) = b_i`, `t_(i+1) = t_i`, and `A_(i+1) = A_i`, + i.e. the deque is empty. + + + + + + + + + + + + + + + + Case `y = x-1` and the CAS at `'L213` succeeds. + + `I_i` updates `top` from `x-1` to `x` at `'L213` and writes `bottom = x` at `'L216`. Let's prove + that `x-1 <= t_i`. Consider the invocation `I` that writes `x-1` to `top`. By `(TOP)`, it is + sufficient to prove that `I` is linearized before `I_i`. If `I` is `pop()`, then it is so thanks + to the coherence of `top`. Now suppose `I` is `steal()`. Let `O_k` be the first `push()` + invocation in `O_(j+1)`, ..., `O_(o-1)`. (If no such invocation exists, let `k = o`.) By the + coherence of `top`, all of `O_j`, ..., `O_(k-1)` should be `pop()` taking the irregular + path. Thus it is sufficient to prove that the value `I` read from `bottom` at `'L403` should be + coherence-before `WF_k`. Suppose otherwise. Then there is a release-acquire synchronization from + `WF_k` to `I`'s read from `bottom` at `'L403`, so `O_j`'s update of `top` from `x-1` to `x` at + `'L213` is coherence-before `I`'s write to `top` at `'L407`. But this is impossible due to the + coherence of `top`. + + From `x-1 <= t_i` and the fact that `I_i` writes `x` to `top`, we have `t_i = x-1`, and thus + `t_i = x-1 = (b_i)-1`. We choose `b_(i+1) = b_i`, `t_(i+1) = t_i + 1`, and `A_(i+1) = A_i`, + i.e. `I_i` pops the value from the `bottom` end of the deque and increases `top`. It remains to + prove that `I_i` returns the right value, which is so for roughly the same reason above. + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +- Case 4: `I_i` is `steal()` and it returns a value. + + Let `x` and `y` be the values `I_i` read from `bottom` at `'L403` and `top` at `'L401`, + respectively. Then either `x = b_i` or `x = (b_i)-1` holds by the construction of linearization + order and the definition of `push()` and `pop()`. Since `I_i` returns a value, we have `y < x`. + let `k` be such an index that the value `I_i` reads from `bottom` at `'L403` is written by `O_k`. + + Let's prove that `y <= t_i`. Consider the invocation `I` that writes `y` to `top`. By `(TOP)`, it + is sufficient to prove that `I` is linearized before `I_i`. Suppose `I` is `pop()`. Let `j` be + such an index that `I = O_j`. From the fact that `I` writes `y` to `top`, we know that `I` is + `pop()` taking the irregular path. Since there is a release-acquire synchronization from `I`'s + write to `top` at `'L213` to `I_i`'s read from `top` at `'L401-'L402`, we have `j <= k`. Since `y + < x`, it is not the case that `O_j`, ..., `O_k` are all `pop()` taking the irregular path. Thus `I + = O_j` should be linearized before `I_i`. If `I` is `steal()`, then there is a release-acquire + synchronization from `I`'s write to `top` at `'L407` to `I_i`'s read from `top` at + `'L401-'L402`. Thus the value `I_i` read from `bottom` at `'L403` is coherence-after-or the value + `I` read from `bottom` at `'L403`. Thus `I` is linearized before `I_i`. + + From `y <= t_i` and the fact that `I_i` writes `y+1` to `top`, we have `t_i = y`. Thus `t_i = y <= + x-1 = (b_i)-1`, and it is legit to steal the value from the `top` end of the deque and increase + `top`. + + + Now we choose `b_(i+1) = b_i`, `t_(i+1) = t_i + 1`, and `A_(i+1) = A_i`, i.e. `I_i` steals a value + from the `top` end of the deque. It remains to prove that `I_i` returns the right value. + + Since `x > y`, there exists a `push()` invocation that pushed to the index `y` and writes `bottom + = y+1`. Let `O_k` be the last such an invocation in `I_0`, ..., `I_(n-1)`, and `v` be the value + `O_k` pushed. By `(CONTENTS)`, it is sufficient to prove that `O_k` is linearized before `I_i`, + and `I_i` returns `v`. + + Let's first prove that for all regular `pop()` invocation `O_j` that writes `bottom = y` at + `'L202`, `O_j` should be linearized before `I_i`. In order for `O_j` to enter the regular path, + `O_j` should have read from `top` a value `<= y-1` at `'L204`. Then by the synchronization of the + seqcst-fences from `O_j` to `I_i` via `top`, the value `I_i` read from `bottom` at `'L403` is + coherence-after-or `WF_j`. Thus `J` is linearized before `I_i`. + + Now let's prove that `O_k` is linearized before `I_i`. By the construction of the linearization + order, it is sufficient to prove that the value `I_i` read from `bottom` at `'L403` is + coherence-after-or `WF_k`. It is obvious if `O_k` is the only such an invocation that writes + `bottom = y+1` at `'L110`. Otherwise, let `O_l` be the last such another invocation. Then there + should be a regular `pop()` invocation `O_m` such that `l < m < k`, and `O_m` writes `bottom = y` + at `'L202`. Thus `O_m` is linearized before `I_i`, and the value `I_i` read from `bottom` is + coherence-after-or `WL_m`. Since `I_i` should have read `bottom >= y+1`, the value `I_i` read from + `bottom` at `'L403` should be coherence-after-or `WF_k`. + + Also, for all `l > k`, `b_l > y` should hold. This is because there are no regular `pop()` + invocations that writes `bottom = y` after `O_k`. + + + + + + + + + + + + + + + + + + + Let `O_l` be the owner invocation that writes `WB_(l+1)` to `buffer` that is read by `I_i`. Let + `z` and `w` be the values `O_k` read from `bottom` at `'L101` and `top` at `'L102`, respectively. + Since `I_i` is linearized after `O_k`, there is a release-acquire synchronization from `O_k`'s + write to `bottom` at `'L110` to `I_i`'s read from `bottom` at `'L403`. Thus we have `w <= y` + because `I_i` reads `top = y` once more at `'L407`, `WB_(l+1)` is coherence-after-or `WB_(k+1)`, + and the value `v` that `O_k` wrote to the buffer at `'L109` should be acknowledged by `I_i` at + `'L406`. Also, we have `view_beginning(O_k) <= view_end(I_i)`, as required by `(SYNC)`. + + + Case `l <= k`. + + Then `WB_(l+1) = WB_(k+1)`, and `I_i` read the return value from `WB_(l+1)[y % + size(WB_(l+1))]`. This value should be coherence-after-or the value `v` written by `O_k` at + `'L109`. Let's think of `O_m` that overwrites to `WB_(l+1)[y % size(WB_(l+1))]` after `O_k`. (If + such `O_m` doesn't exist, then `I_i` should have read `v` and the conclusion follows.) Since + `O_m` is after `O_k` and it read `bottom > y` at `'L101`, `O_m` is a `push()` operation that + read `top > y` at `'L102`. Then there is a release-acquire synchronization from `I_i`'s write to + `top` at `'L407` to `O_m`'s read from `top` at `'L102` so that `I_i`'s read from `WB_(l+1)[y % + size(WB_(l+1))]` at `'L406` happens before `O_m`'s write to the same location. Thus `I_i` should + have read `v` at `'L406`. + + + Case `k < l`. + + Let's prove by induction that for all `m ∈ [k+1, l+1]`, `WC_(m, y % size(WB_m)) = v`. By + assumption, it holds for `m = k+1`. Now suppose that it holds for `m = n` for some `n ∈ [k+1, + l+1)` and let's prove that it holds for `m = n+1`. Let `f` and `g` be the values `O_n` read from + `bottom` and `top`. Then we have `g <= w <= y < b_n = f`. Since `k < n`, `O_n` is not a regular + `pop()` that writes `bottom = y`. If `O_n` is resizing, then since `g <= y < f`, `WC_(n, y % + size(WB_n))` is copied to `WC_(n+1, y % size(WB_(n+1)))`. Thus we have `WC_(n+1, y % + size(WB_(n+1))) = v`. + + By the release-acquire synchronization from `O_l`'s write to `buffer` at `'L307` to `I_i`'s read + from `buffer` at `'L405`, the value `I_i` read from the buffer's content at `'L406` should be + coherence-after-or the value `WC_(l+1, y % size(WB_(l+1))) = v` that `O_l` wrote at + `'L305`. Similarly to the above case, `I_i`'s read happens before any overwrites to the same + location. Thus `I_i` should have read `v` at `'L406`. + + + + + + + + + + + + +- Case 5: `I_i` is `steal()` and it returns `EMPTY`. + + Let `x` and `y` be the values `I_i` read from `bottom` at `'L403` and `top` at `'L401`, + respectively. Since `I_i` returns `EMPTY`, `x <= y` holds. Also, either `x = b_i` or `x = (b_i)-1` + holds by the construction of linearization order and the definition of `push()` and `pop()`. + + It is sufficient to prove that `b_i <= t_i`, since then it will be legit to return `EMPTY` by + choosing `b_(i+1) = b_i`, `t_(i+1) = t_i`, and `A_(i+1) = A_i`. + + + Case `x = b_i`. + + We prove `y <= t_i` as follows. Consider the invocation `I` that writes `y` to `top`. By + `(TOP)`, it is sufficient to prove that `I` is linearized before `I_i`. Suppose `I` is + `pop()`. Let `j` be such an index that `I = O_j`. Then there is a release-acquire + synchronization from `I`'s write to `top` at `'L213` and `I_i` read from `top` at `'L401-'L402`, + and `x` should be coherence-after-or `WF_j`. Thus `I` should be linearized before `I_i`. If `I` + is `steal()`, then by a release-acquire synchronization from `I`'s write to `top` at `'L407` to + `I_i`'s read from `top` at `'L401-'L402`, `x` should be coherence-after-or the value `I` read + from `bottom` at `'L403`. Thus `I` should be linearized before `I_i`. + + Then we have `b_i = x <= y <= t_i`. + + + Case `x = (b_i)-1`. + + The value `x` should be read from `WF_j` for some `O_j` that is `pop()` taking the irregular + path. Let `k` be such an index that `O_j = I_k`. Since `I_i` returns `EMPTY`, we have `k < + i`. Let `w` be the value `O_j` read from `top` at `'L204`. Since `O_j` takes the irregular path, + we have `w >= x`. By `(BOTTOM)`, we have `b_k = x+1`. + + If `O_j` returns `EMPTY`, then we have `x+1 = b_k <= t_k <= t_i`. If `O_j` returns a value, then + by `(TOP)`, we have `x+1 = t_(k+1) <= t_i`. In either case, we have `b_i = x+1 <= t_i`. + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +# Discussion + +## Informal Proof + +The proof described in this RFC is informal. We leave its formal verification in a proof assistant +as a future work. + + +## Comparison to the Current Implementation + +A reader may notice that the pseudocode differs from [the current implementation in +Crossbeam][deque-current], which is faithfully implementing the deque described in [this +paper][chase-lev-weak]. More specifically, in the current implementation: + +- `push()` issues a release fence and a relaxed write instead of a release write (`'L110`); +- `pop()` loads the buffer earlier; +- `pop()`'s CAS uses seqcst/relaxed orderings for success/failure cases instead of release/acquire + orderings; +- `resize()` issues a release-swap to the old buffer with a new one, instead of just a release-write + (`'L307`); +- `steal()` issues a acquire-load from `top` instead of a relaxed-load (`'L401`); +- `steal()`'s CAS uses seqcst/relaxed orderings for success/failure cases instead of release/relaxed + orderings; +- `steal()`'s CAS is strong rather than (possibly) weak (`'L407`). + +The proposed implementation is more optimized in terms of orderings, except that the CAS in `pop()` +has a stronger ordering for the failure case. But in x86, Power, and ARM architectures, +release/acquire orderings for success/failure cases for a CAS is most likely more efficient than +seqcst/relaxed orderings. + + +## Reasoning Out-of-thin-air Behavior + +The current implementation of Chase-Lev deque is not linearizable in C11 as-is, because of so-called +"out-of-thin-air behaviors". Here is an example that is allowed in C11, but is not +linearizable. Suppose that the deque consists of a single value `42`, with `bottom = 1` and `top = +0`; two concurrent threads access the deque; and the owner thread calls `pop()`, and a stealer +thread calls `steal()` twice. Consider the following execution: + +```rust +// owner calls pop() +'L201: read(bottom, 1, Relaxed) +'L202: write(bottom, 0, Relaxed) +'L203: fence(SeqCst) +'L204: read(top, 1, Relaxed) // value from `'L407` +'L207: write(bottom, 1, Relaxed) +// return `Empty` + +|| + +// stealer calls steal() +'L401: read(top, 0, Relaxed) +'L402: fence(SeqCst) +'L404: read(bottom, 0, Acquire) // value from `'L202` +// return `Empty` + +// stealer calls steal() +'L401: read(top, 0, Relaxed) +'L402: fence(SeqCst) +'L404: read(bottom, 1, Acquire) // value from `'L207` +'L407: CAS(top, 0, 1, Release) // success +// return `42` +``` + +While this execution is allowed in C11, it is called out-of-thin-air because there is a genuine +dependency cycle among `'L204`, `'L207`, `'L404` in the second call to `steal()`, and `'L407`. There +is a [proposal to ban out-of-thin-air behaviors][n3710] in the C11 standards, but the proposal only +vaguely described what does "out-of-thin-air" mean. + +You can roughly regard the promising semantics as a precise definition of out-of-thin-air +behaviors. Indeed, the promising semantics forbids this execution. See the proof of +`(IRREGULAR-STEAL)` for more details. + + +## Optimal Orderings + +The orderings in the proposed implementation are Pareto-optimal: any lowering of any single ordering +breaks the linearizability. + +In particular, the failure ordering of the CAS at `'L213` should be acquire. Without it, the +following execution is possible, but is not linearizable: + +```rust +// owner calls pop() +'L201: read(bottom, 1, Relaxed) +'L202: write(bottom, 0, Relaxed) +'L203: fence(SeqCst) +'L204: read(top, 0, Relaxed) +'L213: CAS(top, 0, 1, Release) // fail; value from `'L407` +'L216: write(bottom, 1, Relaxed) +// return `Empty` + +|| + +// stealer calls steal() +'L401: read(top, 0, Relaxed) +'L402: fence(SeqCst) +'L404: read(bottom, 0, Relaxed) // value from `'L202` +// return `Empty` + +// stealer calls steal() +'L401: read(top, 0, Relaxed) +'L402: fence(SeqCst) +'L404: read(bottom, 1, Relaxed) // value from `'L216` +'L407: CAS(top, 0, 1, Release) // success +// return `42` +``` + +Note that this example is similar to the one above, but there is no dependency from `'L213` to +`'L216`. So in order to forbid this execution, we have to ensure the ordering of `'L213` and +`'L216`, e.g. by either (1) release-write at `'L216`, (2) requiring seqcst ordering for the success +case for the CAS at `'L213` (as done in [this paper][chase-lev-weak]), or (3) requiring acquire +ordering for the failure case for the CAS at `'L213`, or (4) issuing an acquire fence after the CAS +at `'L213` fails. We chose (3) in this RFC, because we believe it incurs the least performance +overhead. + +Even though release/acquire orderings are enough for the success/failure cases of the CAS at +`'L213`, C11 (and effectively also LLVM and Rust) requires that "The failure argument shall be no +stronger than the success argument." ([C11][c11] 7.17.7.4 paragraph 2). So instead, we chose (4) in +the companion implementation. This C11 requirement may be fail-safe for most use cases, but can +actually be slightly inefficient in this case. + +It is worth noting that the CAS at `'L213` should be strong. Otherwise, a similar execution to the +one above is possible, where the CAS at `'L213` reads `top = 0` and then spuriously fails. + + +## Comparison to Target-dependent Implementations + +Alternatively, we can write a deque for each target architecture in order to achieve better +performance. + +We believe the proposed implementation is the most efficient in the x86-TSO model. Though [this +paper][deque-bounded-tso] presents a variant of various deques in the "bounded x86-TSO" model, where +you don't need to issue the expensive `mfence` barrier (think: seqcst-fence) in `pop()`. + +For ARM/POWER, you can further optimize the compilation result of the proposed implementation as +follows: + +- `'L102` can be just plain load: `'L109` is the only synchronization target, and they have RW ctrl + dependency. + +- `'L405` can be just plain load: `'L406` is the only synchronization target, and they have RR addr + dependency. In an ideal world, this synchronizing dependency should be expressible in C11 using + the `Consume` ordering. + +- `'L404` can be just plain load, but `isync/isb` should be inserted right before `'L405`: `'L405`'s + read, `'L406`'s read, `'L407`'s read/write, and the end view of `steal()` in the successful case + are the synchronization targets, and they have RR/RW ctrl+`isync/isb` dependency. + + We believe [this paper][chase-lev-weak] has a bug in their ARMv7 implementation of Chase-Lev + deque. Roughly speaking, they used a plain load for `'L404`, and put ctrl+`isync/isb` right after + `'L406`. But in that case, the reads at `'L405-'L406` can be reordered before `'L404`. See + the [this tutorial][arm-power] §4.2 on [the MP+dmb+ctrl litmus test][mp+dmb+ctrl] for more + details. + + + +[chase-lev]: https://pdfs.semanticscholar.org/3771/77bb82105c35e6e26ebad1698a20688473bd.pdf +[chase-lev-weak]: http://www.di.ens.fr/~zappa/readings/ppopp13.pdf +[fence-free-stealing]: http://www.cs.tau.ac.il/~mad/publications/asplos2014-ffwsq.pdf +[deque-current]: https://github.com/crossbeam-rs/crossbeam-deque +[deque-bounded-tso]: http://www.cs.tau.ac.il/~mad/publications/asplos2014-ffwsq.pdf +[deque-pr]: https://github.com/jeehoonkang/crossbeam-deque/tree/deque-proof +[rfc-relaxed-memory]: https://github.com/crossbeam-rs/rfcs/blob/master/text/2017-07-23-relaxed-memory.md +[promising]: http://sf.snu.ac.kr/promise-concurrency/ +[promising-coq]: https://github.com/snu-sf/promising-coq +[synch-patterns]: https://jeehoonkang.github.io/2017/08/23/synchronization-patterns.html +[linearizability]: https://dl.acm.org/citation.cfm?id=78972 +[cppatomic]: http://en.cppreference.com/w/cpp/atomic/atomic +[n3710]: http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2013/n3710.html +[c11]: www.open-std.org/jtc1/sc22/wg14/www/docs/n1570.pdf +[mp+dmb+ctrl]: https://www.cl.cam.ac.uk/~pes20/arm-supplemental/arm033.html +[arm-power]: https://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test7.pdf