Atomics

Rust pretty blatantly just inherits C11’s memory model for atomics. This is notdue to this model being particularly excellent or easy to understand. Indeed,this model is quite complex and known to have several flaws.Rather, it is a pragmatic concession to the fact that everyone is pretty badat modeling atomics. At very least, we can benefit from existing tooling andresearch around C.

Trying to fully explain the model in this book is fairly hopeless. It’s definedin terms of madness-inducing causality graphs that require a full book toproperly understand in a practical way. If you want all the nitty-grittydetails, you should check out C’s specification (Section 7.17).Still, we’ll try to cover the basics and some of the problems Rust developersface.

The C11 memory model is fundamentally about trying to bridge the gap between thesemantics we want, the optimizations compilers want, and the inconsistent chaosour hardware wants. We would like to just write programs and have them doexactly what we said but, you know, fast. Wouldn’t that be great?

Compiler Reordering

Compilers fundamentally want to be able to do all sorts of complicatedtransformations to reduce data dependencies and eliminate dead code. Inparticular, they may radically change the actual order of events, or make eventsnever occur! If we write something like

x = 1;
y = 3;
x = 2;

The compiler may conclude that it would be best if your program did

x = 2;
y = 3;

This has inverted the order of events and completely eliminated one event.From a single-threaded perspective this is completely unobservable: after allthe statements have executed we are in exactly the same state. But if ourprogram is multi-threaded, we may have been relying on x to actually beassigned to 1 before y was assigned. We would like the compiler to beable to make these kinds of optimizations, because they can seriously improveperformance. On the other hand, we’d also like to be able to depend on ourprogram doing the thing we said.

Hardware Reordering

On the other hand, even if the compiler totally understood what we wanted andrespected our wishes, our hardware might instead get us in trouble. Troublecomes from CPUs in the form of memory hierarchies. There is indeed a globalshared memory space somewhere in your hardware, but from the perspective of eachCPU core it is so very far away and so very slow. Each CPU would rather workwith its local cache of the data and only go through all the anguish oftalking to shared memory only when it doesn’t actually have that memory incache.

After all, that’s the whole point of the cache, right? If every read from thecache had to run back to shared memory to double check that it hadn’t changed,what would the point be? The end result is that the hardware doesn’t guaranteethat events that occur in the same order on one thread, occur in the sameorder on another thread. To guarantee this, we must issue special instructionsto the CPU telling it to be a bit less smart.

For instance, say we convince the compiler to emit this logic:

initial state: x = 0, y = 1
THREAD 1        THREAD2
y = 3;          if x == 1 {
x = 1;              y *= 2;
                }

Ideally this program has 2 possible final states:

y = 3: (thread 2 did the check before thread 1 completed)
y = 6: (thread 2 did the check after thread 1 completed)

However there’s a third potential state that the hardware enables:

y = 2: (thread 2 saw x = 1, but not y = 3, and then overwrote y = 3)

It’s worth noting that different kinds of CPU provide different guarantees. Itis common to separate hardware into two categories: strongly-ordered and weakly-ordered. Most notably x86/64 provides strong ordering guarantees, while ARMprovides weak ordering guarantees. This has two consequences for concurrentprogramming:

Asking for stronger guarantees on strongly-ordered hardware may be cheap oreven free because they already provide strong guarantees unconditionally.Weaker guarantees may only yield performance wins on weakly-ordered hardware.
Asking for guarantees that are too weak on strongly-ordered hardware ismore likely to happen to work, even though your program is strictlyincorrect. If possible, concurrent algorithms should be tested onweakly-ordered hardware.

Data Accesses

The C11 memory model attempts to bridge the gap by allowing us to talk about thecausality of our program. Generally, this is by establishing a happensbefore relationship between parts of the program and the threads that arerunning them. This gives the hardware and compiler room to optimize the programmore aggressively where a strict happens-before relationship isn’t established,but forces them to be more careful where one is established. The way wecommunicate these relationships are through data accesses and atomicaccesses.

Data accesses are the bread-and-butter of the programming world. They arefundamentally unsynchronized and compilers are free to aggressively optimizethem. In particular, data accesses are free to be reordered by the compiler onthe assumption that the program is single-threaded. The hardware is also free topropagate the changes made in data accesses to other threads as lazily andinconsistently as it wants. Most critically, data accesses are how data raceshappen. Data accesses are very friendly to the hardware and compiler, but aswe’ve seen they offer awful semantics to try to write synchronized code with.Actually, that’s too weak.

It is literally impossible to write correct synchronized code using only dataaccesses.

Atomic accesses are how we tell the hardware and compiler that our program ismulti-threaded. Each atomic access can be marked with an ordering thatspecifies what kind of relationship it establishes with other accesses. Inpractice, this boils down to telling the compiler and hardware certain thingsthey can’t do. For the compiler, this largely revolves around re-ordering ofinstructions. For the hardware, this largely revolves around how writes arepropagated to other threads. The set of orderings Rust exposes are:

Sequentially Consistent (SeqCst)
Release
Acquire
Relaxed

(Note: We explicitly do not expose the C11 consume ordering)

TODO: negative reasoning vs positive reasoning? TODO: “can’t forget tosynchronize”

Sequentially Consistent

Sequentially Consistent is the most powerful of all, implying the restrictionsof all other orderings. Intuitively, a sequentially consistent operationcannot be reordered: all accesses on one thread that happen before and after aSeqCst access stay before and after it. A data-race-free program that usesonly sequentially consistent atomics and data accesses has the very niceproperty that there is a single global execution of the program’s instructionsthat all threads agree on. This execution is also particularly nice to reasonabout: it’s just an interleaving of each thread’s individual executions. Thisdoes not hold if you start using the weaker atomic orderings.

The relative developer-friendliness of sequential consistency doesn’t come forfree. Even on strongly-ordered platforms sequential consistency involvesemitting memory fences.

In practice, sequential consistency is rarely necessary for program correctness.However sequential consistency is definitely the right choice if you’re notconfident about the other memory orders. Having your program run a bit slowerthan it needs to is certainly better than it running incorrectly! It’s alsomechanically trivial to downgrade atomic operations to have a weakerconsistency later on. Just change SeqCst to Relaxed and you’re done! Ofcourse, proving that this transformation is correct is a whole other matter.

Acquire-Release

Acquire and Release are largely intended to be paired. Their names hint at theiruse case: they’re perfectly suited for acquiring and releasing locks, andensuring that critical sections don’t overlap.

Intuitively, an acquire access ensures that every access after it stays afterit. However operations that occur before an acquire are free to be reordered tooccur after it. Similarly, a release access ensures that every access before itstays before it. However operations that occur after a release are free to bereordered to occur before it.

When thread A releases a location in memory and then thread B subsequentlyacquires the same location in memory, causality is established. Every writethat happened before A’s release will be observed by B after its acquisition.However no causality is established with any other threads. Similarly, nocausality is established if A and B access different locations in memory.

Basic use of release-acquire is therefore simple: you acquire a location ofmemory to begin the critical section, and then release that location to end it.For instance, a simple spinlock might look like:

use std::sync::Arc;
use std::sync::atomic::{AtomicBool, Ordering};
use std::thread;
fn main() {
    let lock = Arc::new(AtomicBool::new(false)); // value answers "am I locked?"
    // ... distribute lock to threads somehow ...
    // Try to acquire the lock by setting it to true
    while lock.compare_and_swap(false, true, Ordering::Acquire) { }
    // broke out of the loop, so we successfully acquired the lock!
    // ... scary data accesses ...
    // ok we're done, release the lock
    lock.store(false, Ordering::Release);
}

On strongly-ordered platforms most accesses have release or acquire semantics,making release and acquire often totally free. This is not the case onweakly-ordered platforms.

Relaxed

Relaxed accesses are the absolute weakest. They can be freely re-ordered andprovide no happens-before relationship. Still, relaxed operations are stillatomic. That is, they don’t count as data accesses and any read-modify-writeoperations done to them occur atomically. Relaxed operations are appropriate forthings that you definitely want to happen, but don’t particularly otherwise careabout. For instance, incrementing a counter can be safely done by multiplethreads using a relaxed fetch_add if you’re not using the counter tosynchronize any other accesses.

There’s rarely a benefit in making an operation relaxed on strongly-orderedplatforms, since they usually provide release-acquire semantics anyway. Howeverrelaxed operations can be cheaper on weakly-ordered platforms.