LLVM Atomic Instructions and Concurrency Guide

Introduction

LLVM supports instructions which are well-defined in the presence of threads andasynchronous signals.

The atomic instructions are designed specifically to provide readable IR andoptimized code generation for the following:

Atomic and volatile in the IR are orthogonal; “volatile” is the C/C++ volatile,which ensures that every volatile load and store happens and is performed in thestated order. A couple examples: if a SequentiallyConsistent store isimmediately followed by another SequentiallyConsistent store to the sameaddress, the first store can be erased. This transformation is not allowed for apair of volatile stores. On the other hand, a non-volatile non-atomic load canbe moved across a volatile load freely, but not an Acquire load.

This document is intended to provide a guide to anyone either writing a frontendfor LLVM or working on optimization passes for LLVM with a guide for how to dealwith instructions with special semantics in the presence of concurrency. Thisis not intended to be a precise guide to the semantics; the details can getextremely complicated and unreadable, and are not usually necessary.

Optimization outside atomic

The basic 'load' and 'store' allow a variety of optimizations, but canlead to undefined results in a concurrent environment; see NotAtomic. Thissection specifically goes into the one optimizer restriction which applies inconcurrent environments, which gets a bit more of an extended descriptionbecause any optimization dealing with stores needs to be aware of it.

From the optimizer’s point of view, the rule is that if there are not anyinstructions with atomic ordering involved, concurrency does not matter, withone exception: if a variable might be visible to another thread or signalhandler, a store cannot be inserted along a path where it might not executeotherwise. Take the following example:

  1. /* C code, for readability; run through clang -O2 -S -emit-llvm to get
  2. equivalent IR */
  3. int x;
  4. void f(int* a) {
  5. for (int i = 0; i < 100; i++) {
  6. if (a[i])
  7. x += 1;
  8. }
  9. }

The following is equivalent in non-concurrent situations:

  1. int x;
  2. void f(int* a) {
  3. int xtemp = x;
  4. for (int i = 0; i < 100; i++) {
  5. if (a[i])
  6. xtemp += 1;
  7. }
  8. x = xtemp;
  9. }

However, LLVM is not allowed to transform the former to the latter: it couldindirectly introduce undefined behavior if another thread can access x atthe same time. That thread would read undef instead of the value it wasexpecting, which can lead to undefined behavior down the line. (This example isparticularly of interest because before the concurrency model was implemented,LLVM would perform this transformation.)

Note that speculative loads are allowed; a load which is part of a race returnsundef, but does not have undefined behavior.

Atomic instructions

For cases where simple loads and stores are not sufficient, LLVM providesvarious atomic instructions. The exact guarantees provided depend on theordering; see Atomic orderings.

load atomic and store atomic provide the same basic functionality asnon-atomic loads and stores, but provide additional guarantees in situationswhere threads and signals are involved.

cmpxchg and atomicrmw are essentially like an atomic load followed by anatomic store (where the store is conditional for cmpxchg), but no othermemory operation can happen on any thread between the load and store.

A fence provides Acquire and/or Release ordering which is not part ofanother operation; it is normally used along with Monotonic memory operations.A Monotonic load followed by an Acquire fence is roughly equivalent to anAcquire load, and a Monotonic store following a Release fence is roughlyequivalent to a Release store. SequentiallyConsistent fences behave as bothan Acquire and a Release fence, and offer some additional complicatedguarantees, see the C++11 standard for details.

Frontends generating atomic instructions generally need to be aware of thetarget to some degree; atomic instructions are guaranteed to be lock-free, andtherefore an instruction which is wider than the target natively supports can beimpossible to generate.

Atomic orderings

In order to achieve a balance between performance and necessary guarantees,there are six levels of atomicity. They are listed in order of strength; eachlevel includes all the guarantees of the previous level except forAcquire/Release. (See also LangRef Ordering.)

NotAtomic

NotAtomic is the obvious, a load or store which is not atomic. (This isn’treally a level of atomicity, but is listed here for comparison.) This isessentially a regular load or store. If there is a race on a given memorylocation, loads from that location return undef.

  • Relevant standard
  • This is intended to match shared variables in C/C++, and to be used in anyother context where memory access is necessary, and a race is impossible. (Theprecise definition is in LangRef Memory Model.)
  • Notes for frontends
  • The rule is essentially that all memory accessed with basic loads and storesby multiple threads should be protected by a lock or other synchronization;otherwise, you are likely to run into undefined behavior. If your frontend isfor a “safe” language like Java, use Unordered to load and store any sharedvariable. Note that NotAtomic volatile loads and stores are not properlyatomic; do not try to use them as a substitute. (Per the C/C++ standards,volatile does provide some limited guarantees around asynchronous signals, butatomics are generally a better solution.)
  • Notes for optimizers
  • Introducing loads to shared variables along a codepath where they would nototherwise exist is allowed; introducing stores to shared variables is not. SeeOptimization outside atomic.
  • Notes for code generation
  • The one interesting restriction here is that it is not allowed to write tobytes outside of the bytes relevant to a store. This is mostly relevant tounaligned stores: it is not allowed in general to convert an unaligned storeinto two aligned stores of the same width as the unaligned store. Backends arealso expected to generate an i8 store as an i8 store, and not an instructionwhich writes to surrounding bytes. (If you are writing a backend for anarchitecture which cannot satisfy these restrictions and cares aboutconcurrency, please send an email to llvm-dev.)

Unordered

Unordered is the lowest level of atomicity. It essentially guarantees that racesproduce somewhat sane results instead of having undefined behavior. It alsoguarantees the operation to be lock-free, so it does not depend on the databeing part of a special atomic structure or depend on a separate per-processglobal lock. Note that code generation will fail for unsupported atomicoperations; if you need such an operation, use explicit locking.

  • Relevant standard
  • This is intended to match the Java memory model for shared variables.
  • Notes for frontends
  • This cannot be used for synchronization, but is useful for Java and other“safe” languages which need to guarantee that the generated code neverexhibits undefined behavior. Note that this guarantee is cheap on commonplatforms for loads of a native width, but can be expensive or unavailable forwider loads, like a 64-bit store on ARM. (A frontend for Java or other “safe”languages would normally split a 64-bit store on ARM into two 32-bit unorderedstores.)
  • Notes for optimizers
  • In terms of the optimizer, this prohibits any transformation that transforms asingle load into multiple loads, transforms a store into multiple stores,narrows a store, or stores a value which would not be stored otherwise. Someexamples of unsafe optimizations are narrowing an assignment into a bitfield,rematerializing a load, and turning loads and stores into a memcpycall. Reordering unordered operations is safe, though, and optimizers shouldtake advantage of that because unordered operations are common in languagesthat need them.
  • Notes for code generation
  • These operations are required to be atomic in the sense that if you useunordered loads and unordered stores, a load cannot see a value which wasnever stored. A normal load or store instruction is usually sufficient, butnote that an unordered load or store cannot be split into multipleinstructions (or an instruction which does multiple memory operations, likeLDRD on ARM without LPAE, or not naturally-aligned LDRD on LPAE ARM).

Monotonic

Monotonic is the weakest level of atomicity that can be used in synchronizationprimitives, although it does not provide any general synchronization. Itessentially guarantees that if you take all the operations affecting a specificaddress, a consistent ordering exists.

  • Relevant standard
  • This corresponds to the C++11/C11 memory_order_relaxed; see thosestandards for the exact definition.
  • Notes for frontends
  • If you are writing a frontend which uses this directly, use with caution. Theguarantees in terms of synchronization are very weak, so make sure these areonly used in a pattern which you know is correct. Generally, these wouldeither be used for atomic operations which do not protect other memory (likean atomic counter), or along with a fence.
  • Notes for optimizers
  • In terms of the optimizer, this can be treated as a read+write on the relevantmemory location (and alias analysis will take advantage of that). In addition,it is legal to reorder non-atomic and Unordered loads around Monotonicloads. CSE/DSE and a few other optimizations are allowed, but Monotonicoperations are unlikely to be used in ways which would make thoseoptimizations useful.
  • Notes for code generation
  • Code generation is essentially the same as that for unordered for loads andstores. No fences are required. cmpxchg and atomicrmw are requiredto appear as a single operation.

Acquire

Acquire provides a barrier of the sort necessary to acquire a lock to accessother memory with normal loads and stores.

  • Relevant standard
  • This corresponds to the C++11/C11 memory_order_acquire. It should also beused for C++11/C11 memory_order_consume.
  • Notes for frontends
  • If you are writing a frontend which uses this directly, use with caution.Acquire only provides a semantic guarantee when paired with a Releaseoperation.
  • Notes for optimizers
  • Optimizers not aware of atomics can treat this like a nothrow call. It isalso possible to move stores from before an Acquire load or read-modify-writeoperation to after it, and move non-Acquire loads from before an Acquireoperation to after it.
  • Notes for code generation
  • Architectures with weak memory ordering (essentially everything relevant todayexcept x86 and SPARC) require some sort of fence to maintain the Acquiresemantics. The precise fences required varies widely by architecture, but fora simple implementation, most architectures provide a barrier which is strongenough for everything (dmb on ARM, sync on PowerPC, etc.). Puttingsuch a fence after the equivalent Monotonic operation is sufficient tomaintain Acquire semantics for a memory operation.

Release

Release is similar to Acquire, but with a barrier of the sort necessary torelease a lock.

  • Relevant standard
  • This corresponds to the C++11/C11 memory_order_release.
  • Notes for frontends
  • If you are writing a frontend which uses this directly, use with caution.Release only provides a semantic guarantee when paired with a Acquireoperation.
  • Notes for optimizers
  • Optimizers not aware of atomics can treat this like a nothrow call. It isalso possible to move loads from after a Release store or read-modify-writeoperation to before it, and move non-Release stores from after a Releaseoperation to before it.
  • Notes for code generation
  • See the section on Acquire; a fence before the relevant operation is usuallysufficient for Release. Note that a store-store fence is not sufficient toimplement Release semantics; store-store fences are generally not exposed toIR because they are extremely difficult to use correctly.

AcquireRelease

AcquireRelease (acq_rel in IR) provides both an Acquire and a Releasebarrier (for fences and operations which both read and write memory).

  • Relevant standard
  • This corresponds to the C++11/C11 memory_order_acq_rel.
  • Notes for frontends
  • If you are writing a frontend which uses this directly, use with caution.Acquire only provides a semantic guarantee when paired with a Releaseoperation, and vice versa.
  • Notes for optimizers
  • In general, optimizers should treat this like a nothrow call; the possibleoptimizations are usually not interesting.
  • Notes for code generation
  • This operation has Acquire and Release semantics; see the sections on Acquireand Release.

SequentiallyConsistent

SequentiallyConsistent (seq_cst in IR) provides Acquire semantics for loadsand Release semantics for stores. Additionally, it guarantees that a totalordering exists between all SequentiallyConsistent operations.

  • Relevant standard
  • This corresponds to the C++11/C11 memoryorder_seq_cst, Java volatile, andthe gcc-compatible __sync* builtins which do not specify otherwise.
  • Notes for frontends
  • If a frontend is exposing atomic operations, these are much easier to reasonabout for the programmer than other kinds of operations, and using them isgenerally a practical performance tradeoff.
  • Notes for optimizers
  • Optimizers not aware of atomics can treat this like a nothrow call. ForSequentiallyConsistent loads and stores, the same reorderings are allowed asfor Acquire loads and Release stores, except that SequentiallyConsistentoperations may not be reordered.
  • Notes for code generation
  • SequentiallyConsistent loads minimally require the same barriers as Acquireoperations and SequentiallyConsistent stores require Releasebarriers. Additionally, the code generator must enforce ordering betweenSequentiallyConsistent stores followed by SequentiallyConsistent loads. Thisis usually done by emitting either a full fence before the loads or a fullfence after the stores; which is preferred varies by architecture.

Atomics and IR optimization

Predicates for optimizer writers to query:

  • isSimple(): A load or store which is not volatile or atomic. This iswhat, for example, memcpyopt would check for operations it might transform.
  • isUnordered(): A load or store which is not volatile and at mostUnordered. This would be checked, for example, by LICM before hoisting anoperation.
  • mayReadFromMemory()/mayWriteToMemory(): Existing predicate, but notethat they return true for any operation which is volatile or at leastMonotonic.
  • isStrongerThan / isAtLeastOrStrongerThan: These are predicates onorderings. They can be useful for passes that are aware of atomics, forexample to do DSE across a single atomic access, but not across arelease-acquire pair (see MemoryDependencyAnalysis for an example of this)
  • Alias analysis: Note that AA will return ModRef for anything Acquire orRelease, and for the address accessed by any Monotonic operation.

To support optimizing around atomic operations, make sure you are using theright predicates; everything should work if that is done. If your pass shouldoptimize some atomic operations (Unordered operations in particular), make sureit doesn’t replace an atomic load or store with a non-atomic operation.

Some examples of how optimizations interact with various kinds of atomicoperations:

  • memcpyopt: An atomic operation cannot be optimized into part of amemcpy/memset, including unordered loads/stores. It can pull operationsacross some atomic operations.
  • LICM: Unordered loads/stores can be moved out of a loop. It just treatsmonotonic operations like a read+write to a memory location, and anythingstricter than that like a nothrow call.
  • DSE: Unordered stores can be DSE’ed like normal stores. Monotonic stores canbe DSE’ed in some cases, but it’s tricky to reason about, and not especiallyimportant. It is possible in some case for DSE to operate across a strongeratomic operation, but it is fairly tricky. DSE delegates this reasoning toMemoryDependencyAnalysis (which is also used by other passes like GVN).
  • Folding a load: Any atomic load from a constant global can be constant-folded,because it cannot be observed. Similar reasoning allows sroa withatomic loads and stores.

Atomics and Codegen

Atomic operations are represented in the SelectionDAG with ATOMIC_* opcodes.On architectures which use barrier instructions for all atomic ordering (likeARM), appropriate fences can be emitted by the AtomicExpand Codegen pass ifsetInsertFencesForAtomic() was used.

The MachineMemOperand for all atomic operations is currently marked as volatile;this is not correct in the IR sense of volatile, but CodeGen handles anythingmarked volatile very conservatively. This should get fixed at some point.

One very important property of the atomic operations is that if your backendsupports any inline lock-free atomic operations of a given size, you shouldsupport ALL operations of that size in a lock-free manner.

When the target implements atomic cmpxchg or LL/SC instructions (as most do)this is trivial: all the other operations can be implemented on top of thoseprimitives. However, on many older CPUs (e.g. ARMv5, SparcV8, Intel 80386) thereare atomic load and store instructions, but no cmpxchg or LL/SC. As it isinvalid to implement atomic load using the native instruction, butcmpxchg using a library call to a function that uses a mutex, atomicload must also expand to a library call on such architectures, so that itcan remain atomic with regards to a simultaneous cmpxchg, by using the samemutex.

AtomicExpandPass can help with that: it will expand all atomic operations to theproper _atomic* libcalls for any size above the maximum set bysetMaxAtomicSizeInBitsSupported (which defaults to 0).

On x86, all atomic loads generate a MOV. SequentiallyConsistent storesgenerate an XCHG, other stores generate a MOV. SequentiallyConsistentfences generate an MFENCE, other fences do not cause any code to begenerated. cmpxchg uses the LOCK CMPXCHG instruction. atomicrmw xchguses XCHG, atomicrmw add and atomicrmw sub use XADD, and allother atomicrmw operations generate a loop with LOCK CMPXCHG. Dependingon the users of the result, some atomicrmw operations can be translated intooperations like LOCK AND, but that does not work in general.

On ARM (before v8), MIPS, and many other RISC architectures, Acquire, Release,and SequentiallyConsistent semantics require barrier instructions for every suchoperation. Loads and stores generate normal instructions. cmpxchg andatomicrmw can be represented using a loop with LL/SC-style instructionswhich take some sort of exclusive lock on a cache line (LDREX and STREXon ARM, etc.).

It is often easiest for backends to use AtomicExpandPass to lower some of theatomic constructs. Here are some lowerings it can do:

  • cmpxchg -> loop with load-linked/store-conditionalby overriding shouldExpandAtomicCmpXchgInIR(), emitLoadLinked(),emitStoreConditional()
  • large loads/stores -> ll-sc/cmpxchgby overriding shouldExpandAtomicStoreInIR()/shouldExpandAtomicLoadInIR()
  • strong atomic accesses -> monotonic accesses + fences by overridingshouldInsertFencesForAtomic(), emitLeadingFence(), andemitTrailingFence()
  • atomic rmw -> loop with cmpxchg or load-linked/store-conditionalby overriding expandAtomicRMWInIR()
  • expansion to _atomic* libcalls for unsupported sizes.
  • part-word atomicrmw/cmpxchg -> target-specific intrinsic by overridingshouldExpandAtomicRMWInIR, emitMaskedAtomicRMWIntrinsic,shouldExpandAtomicCmpXchgInIR, and emitMaskedAtomicCmpXchgIntrinsic.

For an example of these look at the ARM (first five lowerings) or RISC-V (lastlowering) backend.

AtomicExpandPass supports two strategies for lowering atomicrmw/cmpxchg toload-linked/store-conditional (LL/SC) loops. The first expands the LL/SC loopin IR, calling target lowering hooks to emit intrinsics for the LL and SCoperations. However, many architectures have strict requirements for LL/SCloops to ensure forward progress, such as restrictions on the number and typeof instructions in the loop. It isn’t possible to enforce these restrictionswhen the loop is expanded in LLVM IR, and so affected targets may prefer toexpand to LL/SC loops at a very late stage (i.e. after register allocation).AtomicExpandPass can help support lowering of part-word atomicrmw or cmpxchgusing this strategy by producing IR for any shifting and masking that can beperformed outside of the LL/SC loop.

Libcalls: _atomic*

There are two kinds of atomic library calls that are generated by LLVM. Pleasenote that both sets of library functions somewhat confusingly share the names ofbuiltin functions defined by clang. Despite this, the library functions arenot directly related to the builtins: it is not the case that atomic_*builtins lower to atomic* library calls and __sync builtins lowerto _sync library calls.

The first set of library functions are named _atomic*. This set has been“standardized” by GCC, and is described below. (See also GCC’s documentation)

LLVM’s AtomicExpandPass will translate atomic operations on data sizes aboveMaxAtomicSizeInBitsSupported into calls to these functions.

There are four generic functions, which can be called with data of any size oralignment:

  1. void __atomic_load(size_t size, void *ptr, void *ret, int ordering)
  2. void __atomic_store(size_t size, void *ptr, void *val, int ordering)
  3. void __atomic_exchange(size_t size, void *ptr, void *val, void *ret, int ordering)
  4. bool __atomic_compare_exchange(size_t size, void *ptr, void *expected, void *desired, int success_order, int failure_order)

There are also size-specialized versions of the above functions, which can onlybe used with naturally-aligned pointers of the appropriate size. In thesignatures below, “N” is one of 1, 2, 4, 8, and 16, and “iN” is the appropriateinteger type of that size; if no such integer type exists, the specializationcannot be used:

  1. iN __atomic_load_N(iN *ptr, iN val, int ordering)
  2. void __atomic_store_N(iN *ptr, iN val, int ordering)
  3. iN __atomic_exchange_N(iN *ptr, iN val, int ordering)
  4. bool __atomic_compare_exchange_N(iN *ptr, iN *expected, iN desired, int success_order, int failure_order)

Finally there are some read-modify-write functions, which are only available inthe size-specific variants (any other sizes use a __atomic_compare_exchangeloop):

  1. iN __atomic_fetch_add_N(iN *ptr, iN val, int ordering)
  2. iN __atomic_fetch_sub_N(iN *ptr, iN val, int ordering)
  3. iN __atomic_fetch_and_N(iN *ptr, iN val, int ordering)
  4. iN __atomic_fetch_or_N(iN *ptr, iN val, int ordering)
  5. iN __atomic_fetch_xor_N(iN *ptr, iN val, int ordering)
  6. iN __atomic_fetch_nand_N(iN *ptr, iN val, int ordering)

This set of library functions have some interesting implementation requirementsto take note of:

  • They support all sizes and alignments – including those which cannot beimplemented natively on any existing hardware. Therefore, they will certainlyuse mutexes in for some sizes/alignments.
  • As a consequence, they cannot be shipped in a statically linkedcompiler-support library, as they have state which must be shared amongst allDSOs loaded in the program. They must be provided in a shared library used byall objects.
  • The set of atomic sizes supported lock-free must be a superset of the sizesany compiler can emit. That is: if a new compiler introduces support forinline-lock-free atomics of size N, the atomic_* functions must also have alock-free implementation for size N. This is a requirement so that codeproduced by an old compiler (which will have called the atomic_* function)interoperates with code produced by the new compiler (which will use nativethe atomic instruction).

Note that it’s possible to write an entirely target-independent implementationof these library functions by using the compiler atomic builtins themselves toimplement the operations on naturally-aligned pointers of supported sizes, and ageneric mutex implementation otherwise.

Libcalls: _sync*

Some targets or OS/target combinations can support lock-free atomics, but forvarious reasons, it is not practical to emit the instructions inline.

There’s two typical examples of this.

Some CPUs support multiple instruction sets which can be switched back and forthon function-call boundaries. For example, MIPS supports the MIPS16 ISA, whichhas a smaller instruction encoding than the usual MIPS32 ISA. ARM, similarly,has the Thumb ISA. In MIPS16 and earlier versions of Thumb, the atomicinstructions are not encodable. However, those instructions are available via afunction call to a function with the longer encoding.

Additionally, a few OS/target pairs provide kernel-supported lock-freeatomics. ARM/Linux is an example of this: the kernel provides afunction which on older CPUs contains a “magically-restartable” atomic sequence(which looks atomic so long as there’s only one CPU), and contains actual atomicinstructions on newer multicore models. This sort of functionality can typicallybe provided on any architecture, if all CPUs which are missing atomiccompare-and-swap support are uniprocessor (no SMP). This is almost always thecase. The only common architecture without that property is SPARC – SPARCV8 SMPsystems were common, yet it doesn’t support any sort of compare-and-swapoperation.

In either of these cases, the Target in LLVM can claim support for atomics of anappropriate size, and then implement some subset of the operations via libcallsto a sync* function. Such functions _must not use locks in theirimplementation, because unlike the atomic_* routines used byAtomicExpandPass, these may be mixed-and-matched with native instructions by thetarget lowering.

Further, these routines do not need to be shared, as they are stateless. So,there is no issue with having multiple copies included in one binary. Thus,typically these routines are implemented by the statically-linked compilerruntime support library.

LLVM will emit a call to an appropriate _sync routine if the targetISelLowering code has set the corresponding ATOMICCMPXCHG, ATOMIC_SWAP,or ATOMIC_LOAD operation to “Expand”, and if it has opted-into theavailability of those library functions via a call to initSyncLibcalls().

The full set of functions that may be called by LLVM is (for N being 1, 2,4, 8, or 16):

  1. iN __sync_val_compare_and_swap_N(iN *ptr, iN expected, iN desired)
  2. iN __sync_lock_test_and_set_N(iN *ptr, iN val)
  3. iN __sync_fetch_and_add_N(iN *ptr, iN val)
  4. iN __sync_fetch_and_sub_N(iN *ptr, iN val)
  5. iN __sync_fetch_and_and_N(iN *ptr, iN val)
  6. iN __sync_fetch_and_or_N(iN *ptr, iN val)
  7. iN __sync_fetch_and_xor_N(iN *ptr, iN val)
  8. iN __sync_fetch_and_nand_N(iN *ptr, iN val)
  9. iN __sync_fetch_and_max_N(iN *ptr, iN val)
  10. iN __sync_fetch_and_umax_N(iN *ptr, iN val)
  11. iN __sync_fetch_and_min_N(iN *ptr, iN val)
  12. iN __sync_fetch_and_umin_N(iN *ptr, iN val)

This list doesn’t include any function for atomic load or store; all knownarchitectures support atomic loads and stores directly (possibly by emitting afence on either side of a normal load or store.)

There’s also, somewhat separately, the possibility to lower ATOMIC_FENCE to__sync_synchronize(). This may happen or not happen independent of all theabove, controlled purely by setOperationAction(ISD::ATOMIC_FENCE, …).