Software Transactional Memory

This page is about pypy-stm, a special in-development version ofPyPy which can run multiple independent CPU-hungry threads in the sameprocess in parallel. It is a solution to what is known in the Pythonworld as the “global interpreter lock (GIL)” problem — it is animplementation of Python without the GIL.

“STM” stands for Software Transactional Memory, the technique usedinternally. This page describes pypy-stm from the perspective of auser, describes work in progress, and finally gives references to moreimplementation details.

This work was done by Remi Meier and Armin Rigo. Thanks to all donorsfor crowd-funding the work so far! Please have a look at the 2nd callfor donation.

What pypy-stm is for

pypy-stm is a variant of the regular PyPy interpreter. (Thisversion supports Python 2.7; see below for Python 3, CPython,and others.) With caveatslisted below, it should be in theory within 20%-50% slower than aregular PyPy, comparing the JIT version in both cases (but see below!).It is calledSTM for Software Transactional Memory, which is the internal techniqueused (see Reference to implementation details).

The benefit is that the resulting pypy-stm can execute multiplethreads of Python code in parallel. Programs running two threads ormore in parallel should ideally run faster than in a regular PyPy(either now, or soon as bugs are fixed).

  • pypy-stm is fully compatible with a GIL-based PyPy; you can useit as a drop-in replacement and multithreaded programs will run onmultiple cores.
  • pypy-stm provides (but does not impose) a special API to theuser in the pure Python module transaction. This module is basedon the lower-level module pypystm, but also provides somecompatibily with non-STM PyPy’s or CPython’s.
  • Building on top of the way the GIL is removed, we will talkabout How to write multithreaded programs: the 10‘000-feet viewand transaction.TransactionQueue.

…and what pypy-stm is not for

pypy-stm gives a Python without the GIL. This means that it isuseful in situations where the GIL is the problem in the first place.(This includes cases where the program can easily be modified to runin multiple threads; often, we don’t consider doing that preciselybecause of the GIL.)

However, there are plenty of cases where the GIL is not the problem.Do not hope pypy-stm to be helpful in these cases! This includesall programs that use multiple threads but don’t actually spend a lotof time running Python code. For example, it may be spending all itstime waiting for I/O to occur, or performing some long computation ona huge matrix. These are cases where the CPU is either idle, or insome C/Fortran library anyway; in both cases, the interpreter (eitherCPython or the regular PyPy) should release the GIL around theexternal calls. The threads will thus not end up fighting for theGIL.

Getting Started

pypy-stm requires 64-bit Linux for now.

Development is done in the branch stmgc-c8. If you are onlyinterested in trying it out, please pester us until we upload a recentprebuilt binary. The current version supports four “segments”, whichmeans that it will run up to four threads in parallel.

To build a version from sources, you first need to compile a customversion of gcc(!). See the instructions here:https://bitbucket.org/pypy/stmgc/src/default/gcc-seg-gs/(Note that these patches are being incorporated into gcc. It is likelythat future versions of gcc will not need to be patched any more.)

Then get the branch stmgc-c8 of PyPy and run:

  1. cd pypy/goal
  2. ../../rpython/bin/rpython -Ojit --stm

At the end, this will try to compile the generated C code by callinggcc-seg-gs, which must be the script you installed in theinstructions above.

Current status (stmgc-c7)

Warning

THIS PAGE IS OLD, THE REST IS ABOUT STMGC-C7 WHEREAS THE CURRENTDEVELOPMENT WORK IS DONE ON STMGC-C8

  • NEW: It seems to work fine, without crashing any more. Please reportany crash you find (or other bugs).
  • It runs with an overhead as low as 20% on examples like “richards”.There are also other examples with higher overheads –currently up to2x for “translate.py”– which we are still trying to understand.One suspect is our partial GC implementation, see below.
  • NEW: the PYPYSTM environment variable and thepypy/stm/print_stm_log.py script let you know exactly which“conflicts” occurred. This is described in the sectiontransaction.TransactionQueue below.
  • NEW: special transaction-friendly APIs (like stmdict),described in the section transaction.TransactionQueue below. Theold API changed again, mostly moving to different modules. Sorryabout that. I feel it’s a better idea to change the API earlyinstead of being stuck with a bad one later…
  • Currently limited to 1.5 GB of RAM (this is just a parameter incore.h – theoretically. In practice, increase it too much andclang crashes again). Memory overflows are not correctly handled;they cause segfaults.
  • NEW: The JIT warm-up time improved again, but is stillrelatively large. In order to produce machine code, the JIT needsto enter “inevitable” mode. This means that you will get badperformance results if your program doesn’t run for several seconds,where several can mean many. When trying benchmarks, be sure tocheck that you have reached the warmed state, i.e. the performanceis not improving any more.
  • The GC is new; although clearly inspired by PyPy’s regular GC, itmisses a number of optimizations for now. Programs allocating largenumbers of small objects that don’t immediately die (surely a commonsituation) suffer from these missing optimizations. (The bleedingedge stmgc-c8 is better at that.)
  • Weakrefs might appear to work a bit strangely for now, sometimesstaying alive throught gc.collect(), or even dying but thenun-dying for a short time before dying again. A similar problem canshow up occasionally elsewhere with accesses to some externalresources, where the (apparent) serialized order doesn’t match theunderlying (multithreading) order. These are bugs (partially fixedalready in stmgc-c8). Also, debugging helpers likeweakref.getweakrefcount() might give wrong answers.
  • The STM system is based on very efficient read/write barriers, whichare mostly done (their placement could be improved a bit inJIT-generated machine code).
  • Forking the process is slow because the complete memory needs to becopied manually. A warning is printed to this effect.
  • Very long-running processes (on the order of days) will eventuallycrash on an assertion error because of a non-implemented overflow ofan internal 28-bit counter.
  • The recursion detection code was not reimplemented. Infiniterecursion just segfaults for now.

Python 3, CPython, and others

In this document I describe “pypy-stm”, which is based on PyPy’s Python2.7 interpreter. Supporting Python 3 should take about half anafternoon of work. Obviously, what I don’t mean is that by tomorrowyou can have a finished and polished “pypy3-stm” product. General py3kwork is still missing; and general stm work is also still missing. Butthey are rather independent from each other, as usual in PyPy. Therequired afternoon of work will certainly be done one of these days nowthat the internal interfaces seem to stabilize.

The same is true for other languages implemented in the RPythonframework, although the amount of work to put there might vary, becausethe STM framework within RPython is currently targeting the PyPyinterpreter and other ones might have slightly different needs.But in general, all the tedious transformations are done by RPythonand you’re only left with the (hopefully few) hard and interesting bits.

The core of STM works as a library written in C (see reference toimplementation details below). It means that it can be used onother interpreters than the ones produced by RPython. Duhton is anearly example of that. At this point, you might think about adaptingthis library for CPython. You’re warned, though: as far as I cantell, it is a doomed idea. I had a hard time debugging Duhton, andthat’s infinitely simpler than CPython. Even ignoring that, you cansee in the C sources of Duhton that many core design decisions aredifferent than in CPython: no refcounting; limited support forprebuilt “static” objects; stm_read() and stm_write() macrocalls everywhere (and getting very rare and very obscure bugs if youforget one); and so on. You could imagine some custom special-purposeextension of the C language, which you would preprocess to regular C.In my opinion that’s starting to look a lot like RPython itself, butmaybe you’d prefer this approach. Of course you still have to worryabout each and every C extension module you need, but maybe you’d havea way forward.

User Guide

How to write multithreaded programs: the 10‘000-feet view

PyPy-STM offers two ways to write multithreaded programs:

  • the traditional way, using the thread or threading modules,described first.
  • using TransactionQueue, described next, as a way to hide thelow-level notion of threads.

The issues with low-level threads are well known (particularly in otherlanguages that don’t have GIL-based interpreters): memory corruption,deadlocks, livelocks, and so on. There are alternative approaches todealing directly with threads, like OpenMP. These approachestypically enforce some structure on your code. TransactionQueueis in part similar: your program needs to have “some chances” ofparallelization before you can apply it. But I believe that the scopeof applicability is much larger with TransactionQueue than withother approaches. It usually works without forcing a completereorganization of your existing code, and it works on any Pythonprogram which has got latent and imperfect parallelism. Ideally,it only requires that the end programmer identifies where thisparallelism is likely to be found, and communicates it to the systemusing a simple API.

Drop-in replacement

Multithreaded, CPU-intensive Python programs should work unchanged onpypy-stm. They will run using multiple CPU cores in parallel.

The existing semantics of the GIL (Global Interpreter Lock) areunchanged: although running on multiple cores in parallel, pypy-stmgives the illusion that threads are run serially, with switches onlyoccurring between bytecodes, not in the middle of them. Programs canrely on this: using shared_list.append()/pop() orshared_dict.setdefault() as synchronization mecanisms continues towork as expected.

This works by internally considering the points where a standard PyPy orCPython would release the GIL, and replacing them with the boundaries of“transactions”. Like their database equivalent, multiple transactionscan execute in parallel, but will commit in some serial order. Theyappear to behave as if they were completely run in this serializationorder.

transaction.TransactionQueue

In CPU-hungry programs, we can often easily identify outermost loopsover some data structure, or other repetitive algorithm, where each“block” consists of processing a non-trivial amount of data, and wherethe blocks “have a good chance” to be independent from each other. Wedon’t need to prove that they are actually independent: it is enoughif they are often independent — or, more precisely, if we thinkthey should be often independent.

One typical example would look like this, where the function func()typically invokes a large amount of code:

  1. for key, value in bigdict.items():
  2. func(key, value)

Then you simply replace the loop with:

  1. from transaction import TransactionQueue
  2.  
  3. tr = TransactionQueue()
  4. for key, value in bigdict.items():
  5. tr.add(func, key, value)
  6. tr.run()

This code’s behavior is equivalent. Internally, theTransactionQueue object will start N threads and try to run thefunc(key, value) calls on all threads in parallel. But note thedifference with a regular thread-pooling library, as found in manylower-level languages than Python: the function calls are not randomlyinterleaved with each other just because they run in parallel. Thebehavior did not change because we are using TransactionQueue.All the calls still appear to execute in some serial order.

A typical usage of TransactionQueue goes like that: at first,the performance does not increase.In fact, it is likely to be worse. Typically, this isindicated by the total CPU usage, which remains low (closer to 1 thanN cores). First note that it is expected that the CPU usage shouldnot go much higher than 1 in the JIT warm-up phase: you must run aprogram for several seconds, or for larger programs at least oneminute, to give the JIT a chance to warm up enough. But if CPU usageremains low even afterwards, then the PYPYSTM environment variablecan be used to track what is going on.

Run your program with PYPYSTM=logfile to produce a log file calledlogfile. Afterwards, use the pypy/stm/print_stm_log.pyutility to inspect the content of this log file. It produces outputlike this (sorted by amount of time lost, largest first):

  1. 10.5s lost in aborts, 1.25s paused (12412x STM_CONTENTION_WRITE_WRITE)
  2. File "foo.py", line 10, in f
  3. someobj.stuff = 5
  4. File "bar.py", line 20, in g
  5. someobj.other = 10

This means that 10.5 seconds were lost running transactions that wereaborted (which caused another 1.25 seconds of lost time by pausing),because of the reason shown in the two independent single-entrytracebacks: one thread ran the line someobj.stuff = 5, whereasanother thread concurrently ran the line someobj.other = 10 on thesame object. These two writes are done to the same object. Thiscauses a conflict, which aborts one of the two transactions. In theexample above this occurred 12412 times.

The two other conflict sources are STM_CONTENTION_INEVITABLE,which means that two transactions both tried to do an externaloperation, like printing or reading from a socket or accessing anexternal array of raw data; and STM_CONTENTION_WRITE_READ, whichmeans that one transaction wrote to an object but the other one merelyread it, not wrote to it (in that case only the writing transaction isreported; the location for the reads is not recorded because doing sois not possible without a very large performance impact).

Common causes of conflicts:

  • First of all, any I/O or raw manipulation of memory turns thetransaction inevitable (“must not abort”). There can be only oneinevitable transaction running at any time. A common case is ifeach transaction starts with sending data to a log file. You shouldrefactor this case so that it occurs either near the end of thetransaction (which can then mostly run in non-inevitable mode), ordelegate it to a separate transaction or even a separate thread.

  • Writing to a list or a dictionary conflicts with any read from thesame list or dictionary, even one done with a different key. Fordictionaries and sets, you can try the types transaction.stmdictand transaction.stmset, which behave mostly like dict andset but allow concurrent access to different keys. (What ismissing from them so far is lazy iteration: for example,stmdict.iterkeys() is implemented as iter(stmdict.keys());and, unlike PyPy’s dictionaries and sets, the STM versions are notordered.) There are also experimental stmiddict andstmidset classes using the identity of the key.

  • time.time() and time.clock() turn the transaction inevitablein order to guarantee that a call that appears to be later will reallyreturn a higher number. If getting slightly unordered results isfine, use transaction.time() or transaction.clock(). Thelatter operations guarantee to return increasing results only if youcan “prove” that two calls occurred in a specific order (for examplebecause they are both called by the same thread). In cases where nosuch proof is possible, you might get randomly interleaved values.(If you have two independent transactions, they normally behave as ifone of them was fully executed before the other; but usingtransaction.time() you might see the “hidden truth” that they areactually interleaved.)

  • transaction.threadlocalproperty can be used at class-level:

  1. class Foo(object): # must be a new-style class!
  2. x = transaction.threadlocalproperty()
  3. y = transaction.threadlocalproperty(dict)

This declares that instances of Foo have two attributes xand y that are thread-local: reading or writing them fromconcurrently-running transactions will return independent results.(Any other attributes of Foo instances will be globally visiblefrom all threads, as usual.) This is useful together withTransactionQueue for these two cases:

  • For attributes of long-lived objects that change during onetransaction, but should always be reset to some initial valuearound transaction (for example, initialized to 0 at the start ofa transaction; or, if used for a list of pending things to dowithin this transaction, it will always be empty at the end of onetransaction).
  • For general caches across transactions. With TransactionQueueyou get a pool of a fixed number N of threads, each running thetransactions serially. A thread-local property will have thevalue last stored in it by the same thread, which may come from arandom previous transaction. Basically, you get N copies of theproperty’s value, and each transaction accesses a random copy. Itworks fine for caches. In more details, the optional argument to threadlocalproperty()is the default value factory: in case no value was assigned in thecurrent thread yet, the factory is called and its result becomes thevalue in that thread (like collections.defaultdict). If nodefault value factory is specified, uninitialized reads raiseAttributeError.
  • In addition to all of the above, there are cases where write-writeconflicts are caused by writing the same value to an attribute againand again. See for example ea2e519614ab: this fixes two suchissues where we write an object field without first checking if wealready did it. The dont_change_any_more field is a flag set toTrue in that part of the code, but usually thisrtyper_makekey() method will be called many times for the sameobject; the code used to repeatedly set the flag to True, butnow it first checks and only does the write if it is False.Similarly, in the second half of the checkin, the methodsetup_block_entry() used to both assign the concretetypefields and return a list, but its two callers were different: onewould really need the concretetype fields initialized, whereasthe other would only need to get its result list — theconcretetype field in that case might already be set or not, butthat would not matter.

Note that Python is a complicated language; there are a number of lesscommon cases that may cause conflict (of any kind) where we might notexpect it at priori. In many of these cases it could be fixed; pleasereport any case that you don’t understand.

Atomic sections

The TransactionQueue class described above is based on atomicsections, which are blocks of code which you want to execute without“releasing the GIL”. In STM terms, this means blocks of code that areexecuted while guaranteeing that the transaction is not interrupted inthe middle. _This is experimental and may be removed in the future_if Software lock elision is ever implemented.

Here is a direct usage example:

  1. with transaction.atomic:
  2. assert len(lst1) == 10
  3. x = lst1.pop(0)
  4. lst1.append(x)

In this example, we are sure that the item popped off one end ofthe list is appened again at the other end atomically. It means thatanother thread can run len(lst1) or x in lst1 without anyparticular synchronization, and always see the same results,respectively 10 and True. It will never see the intermediatestate where lst1 only contains 9 elements. Atomic sections aresimilar to re-entrant locks (they can be nested), but additionally theyprotect against the concurrent execution of any code instead of justcode that happens to be protected by the same lock in other threads.

Note that the notion of atomic sections is very strong. If you writecode like this:

  1. with __pypy__.thread.atomic:
  2. time.sleep(10)

then, if you think about it as if we had a GIL, you are executing a10-seconds-long atomic transaction without releasing the GIL at all.This prevents all other threads from progressing at all. While it isnot strictly true in pypy-stm, the exact rules for when otherthreads can progress or not are rather complicated; you have to considerit likely that such a piece of code will eventually block all otherthreads anyway.

Note that if you want to experiment with atomic, you may have tomanually add a transaction break just before the atomic block. This isbecause the boundaries of the block are not guaranteed to be theboundaries of the transaction: the latter is at least as big as theblock, but may be bigger. Therefore, if you run a big atomic block, itis a good idea to break the transaction just before. This can be doneby calling transaction.hint_commit_soon(). (This may be fixed atsome point.)

There are also issues with the interaction of regular locks and atomicblocks. This can be seen if you write to files (which have locks),including with a print to standard output. If one thread tries toacquire a lock while running in an atomic block, and another threadhas got the same lock at that point, then the former may fail with athread.error. (Don’t rely on it; it may also deadlock.)The reason is that “waiting” for some condition tobecome true –while running in an atomic block– does not really makesense. For now you can work around it by making sure that, say, allyour prints are either in an atomic block or none of them are.(This kind of issue is theoretically hard to solve and may be thereason for atomic block support to eventually be removed.)

Locks

Not Implemented Yet

The thread module’s locks have their basic semantic unchanged. However,using them (e.g. in with my_lock: blocks) starts an alternativerunning mode, called Software lock elision. This means that PyPywill try to make sure that the transaction extends until the point wherethe lock is released, and if it succeeds, then the acquiring andreleasing of the lock will be “elided”. This means that in this case,the whole transaction will technically not cause any write into the lockobject — it was unacquired before, and is still unacquired after thetransaction.

This is specially useful if two threads run with my_lock: blockswith the same lock. If they each run a transaction that is long enoughto contain the whole block, then all writes into the lock will be elidedand the two transactions will not conflict with each other. As usual,they will be serialized in some order: one of the two will appear to runbefore the other. Simply, each of them executes an “acquire” followedby a “release” in the same transaction. As explained above, the lockstate goes from “unacquired” to “unacquired” and can thus be leftunchanged.

This approach can gracefully fail: unlike atomic sections, there is noguarantee that the transaction runs until the end of the block. If youperform any input/output while you hold the lock, the transaction willend as usual just before the input/output operation. If this occurs,then the lock elision mode is cancelled and the lock’s “acquired” stateis really written.

Even if the lock is really acquired already, a transaction doesn’t haveto wait for it to become free again. It can enter the elision-mode anywayand tentatively execute the content of the block. It is only at the end,when trying to commit, that the thread will pause. As soon as the realvalue stored in the lock is switched back to “unacquired”, it can thenproceed and attempt to commit its already-executed transaction (whichcan fail and abort and restart from the scratch, as usual).

Note that this is all not implemented yet, but we expect it to workeven if you acquire and release several locks. The elision-modetransaction will extend until the first lock you acquired is released,or until the code performs an input/output or a wait operation (forexample, waiting for another lock that is currently not free). In thecommon case of acquiring several locks in nested order, they will all beelided by the same transaction.

Miscellaneous functions

  • First, note that the transaction module is found in the filelib_pypy/transaction.py. This file can be copied around toexecute the same programs on CPython or on non-STM PyPy, withfall-back behavior. (One case where the behavior differs isatomic, which is in this fall-back case just a regular lock; sowith atomic only prevent other threads from entering otherwith atomic sections, but won’t prevent other threads fromrunning non-atomic code.)
  • transaction.getsegmentlimit(): return the number of “segments” inthis pypy-stm. This is the limit above which more threads will not beable to execute on more cores. (Right now it is limited to 4 due tointer-segment overhead, but should be increased in the future. Itshould also be settable, and the default value should depend on thenumber of actual CPUs.) If STM is not available, this returns 1.
  • pypy.thread.signals_enabled: a context manager that runs itsblock of code with signals enabled. By default, signals are onlyenabled in the main thread; a non-main thread will not receivesignals (this is like CPython). Enabling signals in non-mainthreads is useful for libraries where threads are hidden and the enduser is not expecting his code to run elsewhere than in the mainthread.
  • pypystm.exclusive_atomic: a context manager similar totransaction.atomic but which complains if it is nested.
  • transaction.is_atomic(): return True if called from an atomiccontext.
  • pypystm.count(): return a different positive integer every timeit is called. This works without generating conflicts. Thereturned integers are only roughly in increasing order; this shouldnot be relied upon.

More details about conflicts

Based on Software Transactional Memory, the pypy-stm solution isprone to “conflicts”. To repeat the basic idea, threads execute their codespeculatively, and at known points (e.g. between bytecodes) theycoordinate with each other to agree on which order their respectiveactions should be “committed”, i.e. become globally visible. Eachduration of time between two commit-points is called a transaction.

A conflict occurs when there is no consistent ordering. The classicalexample is if two threads both tried to change the value of the sameglobal variable. In that case, only one of them can be allowed toproceed, and the other one must be either paused or aborted (restartingthe transaction). If this occurs too often, parallelization fails.

How much actual parallelization a multithreaded program can see is a bitsubtle. Basically, a program not using transaction.atomic oreliding locks, or doing so for very short amounts of time, willparallelize almost freely (as long as it’s not some artificial examplewhere, say, all threads try to increase the same global counter and donothing else).

However, if the program requires longer transactions, it comeswith less obvious rules. The exact details may vary from version toversion, too, until they are a bit more stabilized. Here is anoverview.

Parallelization works as long as two principles are respected. Thefirst one is that the transactions must not conflict with eachother. The most obvious sources of conflicts are threads that allincrement a global shared counter, or that all store the result oftheir computations into the same list — or, more subtly, that allpop() the work to do from the same list, because that is also amutation of the list. (You can work around it withtransaction.stmdict, but for that specific example, some STM-awarequeue should eventually be designed.)

A conflict occurs as follows: when a transaction commits (i.e. finishessuccessfully) it may cause other transactions that are still in progressto abort and retry. This is a waste of CPU time, but even in the worstcase senario it is not worse than a GIL, because at least onetransaction succeeds (so we get at worst N-1 CPUs doing useless jobs and1 CPU doing a job that commits successfully).

Conflicts do occur, of course, and it is pointless to try to avoid themall. For example they can be abundant during some warm-up phase. Whatis important is to keep them rare enough in total.

Another issue is that of avoiding long-running so-called “inevitable”transactions (“inevitable” is taken in the sense of “which cannot beavoided”, i.e. transactions which cannot abort any more). Transactionslike that should only occur if you use atomic,generally because of I/O in atomic blocks. They work, but thetransaction is turned inevitable before the I/O is performed. For allthe remaining execution time of the atomic block, they will impedeparallel work. The best is to organize the code so that such operationsare done completely outside atomic.

(This is not unrelated to the fact that blocking I/O operations arediscouraged with Twisted, and if you really need them, you should dothem on their own separate thread.)

In case lock elision eventually replaces atomic sections, we wouldn’tget long-running inevitable transactions, but the same problem occursin a different way: doing I/O cancels lock elision, and the lock turnsinto a real lock. This prevents other threads from committing if theyalso need this lock. (More about it when lock elision is implementedand tested.)

Implementation

XXX this section mostly empty for now

Technical reports

STMGC-C7 is described in detail in a technical report.

A separate position paper gives an overview of our position aboutSTM in general.

Reference to implementation details

The core of the implementation is in a separate C library calledstmgc, in the c7 subdirectory (current version of pypy-stm) and inthe c8 subdirectory (bleeding edge version). Please see theREADME.txt for more information. In particular, the notion ofsegment is discussed there.

PyPy itself adds on top of it the automatic placement of read and writebarriers and of “becomes-inevitable-now” barriers, the logic tostart/stop transactions as an RPython transformation and assupporting C code, and the support in the JIT (mostly as atransformation step on the trace and generation of custom assemblerin assembler.py).

See also

See alsohttps://bitbucket.org/pypy/pypy/raw/default/pypy/doc/project-ideas.rst(section about STM).