Serializing accept(), AKA Thundering Herd, AKA the Zeeg Problem

One of the historical problems in the UNIX world is the “thundering herd”.

What is it?

Take a process binding to a networking address (it could be AF_INET,AF_UNIX or whatever you want) and then forking itself:

  1. int s = socket(...)
  2. bind(s, ...)
  3. listen(s, ...)
  4. fork()

After having forked itself a bunch of times, each process will generally startblocking on accept()

  1. for(;;) {
  2. int client = accept(...);
  3. if (client < 0) continue;
  4. ...
  5. }

The funny problem is that on older/classic UNIX, accept() is woken up ineach process blocked on it whenever a connection is attempted on the socket.

Only one of those processes will be able to truly accept the connection, theothers will get a boring EAGAIN.

This results in a vast number of wasted cpu cycles (the kernel scheduler has togive control to all of the sleeping processes waiting on that socket).

This behaviour (for various reasons) is amplified when instead of processes youuse threads (so, you have multiple threads blocked on accept()).

The de facto solution was placing a lock before the accept() call to serializeits usage:

  1. for(;;) {
  2. lock();
  3. int client = accept(...);
  4. unlock();
  5. if (client < 0) continue;
  6. ...
  7. }

For threads, dealing with locks is generally easier but for processes you haveto fight with system-specific solutions or fall back to the venerable SysV ipcsubsystem (more on this later).

In modern times, the vast majority of UNIX systems have evolved, and now thekernel ensures (more or less) only one process/thread is woken up on aconnection event.

Ok, problem solved, what we are talking about?

select()/poll()/kqueue()/epoll()/…

In the pre-1.0 era, uWSGI was a lot simpler (and less interesting) than thecurrent form. It did not have the signal framework and it was not able tolisten to multiple addresses; for this reason its loop engine was only callingaccept() in each process/thread, and thundering herd (thanks to modernkernels) was not a problem.

Evolution has a price, so after a while the standard loop engine of a uWSGIprocess/thread moved from:

  1. for(;;) {
  2. int client = accept(s, ...);
  3. if (client < 0) continue;
  4. ...
  5. }

to a more complex:

  1. for(;;) {
  2. int interesting_fd = wait_for_fds();
  3. if (fd_need_accept(interesting_fd)) {
  4. int client = accept(interesting_fd, ...);
  5. if (client < 0) continue;
  6. }
  7. else if (fd_is_a_signal(interesting_fd)) {
  8. manage_uwsgi_signal(interesting_fd);
  9. }
  10. ...
  11. }

The problem is now the wait_for_fds() example function: it will callsomething like select(), poll() or the more modern epoll() andkqueue().

These kinds of system calls are “monitors” for file descriptors, and they arewoken up in all of the processes/threads waiting for the same file descriptor.

Before you start blaming your kernel developers, this is the right approach, asthe kernel cannot know if you are waiting for those file descriptors to callaccept() or to make something funnier.

So, welcome again to the thundering herd.

Application Servers VS WebServers

The popular, battle tested, solid, multiprocess reference webserver is ApacheHTTPD.

It survived decades of IT evolutions and it’s still one of the most importanttechnologies powering the whole Internet.

Born as multiprocess-only, Apache had to always deal with the thundering herdproblem and they solved it using SysV ipc semaphores.

(Note: Apache is really smart about that, when it only needs to wait on asingle file descriptor, it only calls accept() taking advantage of modernkernels anti-thundering herd policies)

(Update: Apache 2.x even allows you to choose which lock technique to use,included flock/fcntl for very ancient systems, but on the vast majority of thesystem, when in multiprocess mode it will use the sysv semaphores)

Even on modern Apache releases, stracing one of its process (bound to multipleinterfaces) you will see something like that (it is a Linux system):

  1. semop(...); // lock
  2. epoll_wait(...);
  3. accept(...);
  4. semop(...); // unlock
  5. ... // manage the request

the SysV semaphore protect your epoll_wait from thundering herd.

So, another problem solved, the world is a such a beautiful place… but ….

SysV IPC is not good for application servers :(*

The definition of “application server” is pretty generic, in this case we referto one or more process/processes generated by an unprivileged (non-root) userbinding on one or more network address and running custom, highlynon-deterministic code.

Even if you had a minimal/basic knowledge on how SysV IPC works, you will knoweach of its components is a limited resource in the system (and in modern BSDsthese limits are set to ridiculously low values, PostgreSQL FreeBSD users knowthis problem very well).

Just run ‘ipcs’ in your terminal to get a list of the allocated objects in yourkernel. Yes, in your kernel. SysV ipc objects are persistent resources, theyneed to be removed manually by the user. The same user that could allocatehundreds of those objects and fill your limited SysV IPC memory.

One of the most common problems in the Apache world caused by the SysV ipcusage is the leakage when you brutally kill Apache instances (yes, you shouldnever do it, but you don’t have a choice if you are so brave/fool to hostunreliable PHP apps in your webserver process).

To better understand it, spawn Apache and killall -9 apache2. Respawn itand run ‘ipcs’ you will get a new semaphore object every time. Do you see theproblem? (to Apache gurus: yes I know there are hacky tricks to avoid that,but this is the default behaviour)

Apache is generally a system service, managed by a conscious sysadmin, soexcept few cases you can continue trusting it for more decades, even if itdecides to use more SysV ipc objects :)

Your application server, sadly, is managed by different kind of users, from themost skilled one to the one who should change job as soon as possible to theone with the site cracked by a moron wanting to take control of your server.

Application servers are not dangerous, users are. And application servers arerun by users. The world is an ugly place.

How application server developers solved it

Fast answer: they generally do not solve/care it

Note: we are talking about multiprocessing, we have already seen multithreadingis easy to solve.

Serving static files or proxying (the main activities of a webserver) isgenerally a fast, non-blocking (very deterministic under various points of view)activity. Instead, a web application is way slower and heavier, so, even onmoderately loaded sites, the amount of sleeping processes is generally low.

On highly loaded sites you will pray for a free process, and in non-loadedsites the thundering herd problem is completely irrelevant (unless you arerunning your site on a 386).

Given the relatively low number of processes you generally allocate for anapplication server, we can say thundering herd is a no-problem.

Another approach is dynamic process spawning. If you ensure your applicationserver has always the minimum required number of processes running you willhighly reduce the thundering herd problem. (check the family of –cheaper uWSGIoptions)

No-problem ??? So, again, what we are talking about ?

We are talking about “common cases”, and for common cases there are a plethoraof valid choices (instead of uWSGI, obviously) and the vast majority ofproblems we are talking about are non-existent.

Since the beginning of the uWSGI project, being developed by a hosting companywhere “common cases” do not exist, we cared a lot about corner-case problems,bizarre setups and those problems the vast majority of users never need to careabout.

In addition to this, uWSGI supports operational modes only common/available ingeneral-purpose webservers like Apache (I have to say Apache is probably theonly general purpose webserver as it allows basically anything in its processspace in a relatively safe and solid way), so lot of new problems combined withuser bad-behaviour arise.

One of the most challenging development phase of uWSGI was addingmultithreading. Threads are powerful, but are really hard to manage in theright way.

Threads are way cheaper than processes, so you generally allocate dozens ofthem for your app (remember, not used memory is wasted memory).

Dozens (or hundreds) of threads waiting for the same set of file descriptorsbring us back to a thundering herd problem (unless all of your threads areconstantly used).

For such a reason when you enable multiple threads in uWSGI a pthread mutex isallocated, serializing epoll()/kqueue()/poll()/select()… usage in eachthread.

Another problem solved (and strange for uWSGI, without the need of an option ;)

But…

The Zeeg problem: Multiple processes with multiple threads

On June 27, 2013, David Cramer wrote an interesting blog post (you may notagree with its conclusions, but it does not matter now, you can continue hatinguWSGI safely or making funny jokes about its naming choices or the number ofoptions).

http://cramer.io/2013/06/27/serving-python-web-applications

The problem David faced was such a strong thundering herd that its responsetime was damaged by it (non constant performance was the main result of itstests).

Why did it happen? Wasn’t the mutex allocated by uWSGI solving it?

David is (was) running uWSGI with 10 process and each of them with 10 threads:

  1. uwsgi --processes 10 --threads 10 ...

While the mutex protects each thread in a single process to call accept()on the same request, there is no such mechanism (or better, it is not enabledby default, see below) to protect multiple processes from doing it, so giventhe number of threads (100) available for managing requests, it is unlikelythat a single process is completely blocked (read: with all of its 10 threadsblocked in a request) so welcome back to the thundering herd.

How David solved it ?

uWSGI is a controversial piece of software, no shame in that. There are usersfiercely hating it and others morbidly loving it, but all agree that docs couldbe way better ([OT] it is good when all the people agree on something, but pullrequests on uwsgi-docs are embarrassingly low and all from the same people….come on, help us !!!)

David used an empirical approach, spotted its problem and decided to solve itrunning independent uwsgi processes bound on different sockets and configurednginx to round robin between them.

It is a very elegant approach, but it has a problem: nginx cannot know if theprocess on which is sending the request has all of its thread busy. It is aworking but suboptimal solution.

The best way would be having an inter-process locking (like Apache),serializing all of the accept() in both threads and processes

uWSGI docs sucks: –thunder-lock

Michael Hood (you will find his name in the comments of David’s post, too)signalled the problem in the uWSGI mailing-list/issue tracker some time ago, heeven came out with an initial patch that ended with the —thunder-lockoption (this is why open-source is better ;)

—thunder-lock is available since uWSGI 1.4.6 but never got documentation (ofany kind)

Only the people following the mailing-list (or facing the specific problem)know about it.

SysV IPC semaphores are bad how you solved it ?

Interprocess locking has been an issue since uWSGI 0.0.0.0.0.1, but we solvedit in the first public release of the project (in 2009).

We basically checked each operating system capabilities and chose thebest/fastest ipc locking they could offer, filling our code with dozens of#ifdef.

When you start uWSGI you should see in its logs which “lock engine” has beenchosen.

There is support for a lot of them:

  • pthread mutexes with _PROCESS_SHARED and _ROBUST attributes (modern Linux and Solaris)
  • pthread mutexes with _PROCESS_SHARED (older Linux)
  • OSX Spinlocks (MacOSX, Darwin)
  • Posix semaphores (FreeBSD >= 9)
  • Windows mutexes (Windows/Cygwin)
  • SysV IPC semaphores (fallback for all the other systems)

Their usage is required for uWSGI-specific features like caching, rpc and allof those features requiring changing shared memory structures (allocated withmmap() + _SHARED)

Each of these engines is different from the others and dealing with them hasbeen a pain and (more important) some of them are not “ROBUST”.

The “ROBUST” term is pthread-borrowed. If a lock is “robust”, it means if theprocess locking it dies, the lock is released.

You would expect it from all of the lock engines, but sadly only few of themworks reliably.

For this reason the uWSGI master process has to allocate an additional thread(the ‘deadlock’ detector) constantly checking for non-robust unreleased locksmapped to dead processes.

It is a pain, however, anyone will tell you IPC locking is easy should beaccepted in a JEDI school…

uWSGI developers are fu*!ing cowards

Both David Cramer and Graham Dumpleton (yes, he is the mod_wsgi author butheavily contributed to uWSGI development as well to the other WSGI servers,this is another reason why open source is better) asked why —thunder-lockis not the default when multiprocess + multithread is requested.

This is a good question with a simple answer: we are cowards who only careabout money.

uWSGI is completely open source, but its development is sponsored (in variousway) by the companies using it and by Unbit.it customers.

Enabling “risky” features by default for a “common” usage (likemultiprocess+multithread) is too much for us, and in addition to this, thesituation (especially on linux) of library/kernel incompatibilities is a realpain.

As an example for having ROBUST pthread mutexes you need a modern kernel with amodern glibc, but commonly used distros (like the centos family) have a mix ofolder kernels with newer glibc and the opposite too. This leads to theinability to correctly detect which is the best locking engine for a platform,and so, when the uwsgiconfig.py script is in doubt it falls back to the safestapproach (like non-robust pthread mutexes on linux).

The deadlock-detector should save you from most of the problem, but the“should” word is the key. Making a test suite (or even a single unit test) onthis kind of code is basically impossible (well, at least for me), so wecannot be sure all is in the right place (and reporting threading bugs is hardfor users as well as skilled developer, unless you work on pypy ;)

Linux pthread robust mutexes are solid, we are “pretty” sure about that, so youshould be able to enable —thunder-lock on modern Linux systems with a99.999999% success rates, but we prefer (for now) users consciously enable it

When SysV IPC semaphores are a better choice

Yes, there are cases on which SysV IPC semaphores gives you better results thansystem-specific features.

Marcin Deranek of Booking.com has been battle-testing uWSGI for months andhelped us with fixing corner-case situations even in the locking area.

He noted system-specific lock-engines tend to favour the kernel scheduler (whenchoosing which process wins the next lock after an unlock) instead of around-robin distribution.

As for their specific need for an equal distribution of requests amongprocesses is better (they use uWSGI with perl, so no threading is in place, butthey spawn lot of processes) they (currently) choose to use the “ipcsem” lockengine with:

  1. uwsgi --lock-engine ipcsem --thunder-lock --processes 100 --psgi ....

The funny thing (this time) is that you can easily test if the lock is workingwell. Just start blasting the server and you will see in the request logs howthe reported pid is different each time, while with system-specific locking thepids are pretty random with a pretty heavy tendency of favouring the last usedprocess.

Funny enough, the first problem they faced was the ipcsem leakage (when you arein emergency, graceful reload/stop is your enemy and kill -9 will be yoursilver bullet)

To fix it, the –ftok option is available allowing you to give a unique id tothe semaphore object and to reuse it if it is available from a previous run:

  1. uwsgi --lock-engine ipcsem --thunder-lock --processes 100 --ftok /tmp/foobar --psgi ....

–ftok takes a file as an argument, it will use it to build the unique id. Acommon pattern is using the pidfile for it

What about other portable lock engines ?

In addition to “ipcsem”, uWSGI (where available) adds “posixsem” too.

They are used by default only on FreeBSD >= 9, but are available on Linux too.

They are not “ROBUST”, but they do not need shared kernel resources, so if youtrust our deadlock detector they are a pretty-good approach. (Note: GrahamDumpleton pointed me to the fact they can be enabled on Apache 2.x too)

Conclusions

You can have the best (or the worst) software of the whole universe, butwithout docs it does not exist.

The Apache team still slam the face of the vast majority of us trying to touchtheir market share :)

Bonus chapter: using the Zeeg approach in a uWSGI friendly way

I have to admit, I am not a big fan of supervisord. It is a good softwarewithout doubts, but I consider the Emperor and the –attach-daemon facilities abetter approach to the deployment problems. In addition to this, if you want tohave a “scriptable”/”extendable” process supervisor I think Circus(https://circus.readthedocs.io/) is a lot more fun and capable (the first thingI have done after implementing socket activation in the uWSGI Emperor wasmaking a pull request [merged, if you care] for the same feature in Circus).

Obviously supervisord works and is used by lot of people, but as a heavy uWSGIuser I tend to abuse its features to accomplish a result.

The first approach I would use is binding to 10 different ports and mappingeach of them to a specific process:

  1. [uwsgi]
  2. processes = 5
  3. threads = 5
  4.  
  5. ; create 5 sockets
  6. socket = :9091
  7. socket = :9092
  8. socket = :9093
  9. socket = :9094
  10. socket = :9095
  11.  
  12. ; map each socket (zero-indexed) to the specific worker
  13. map-socket = 0:1
  14. map-socket = 1:2
  15. map-socket = 2:3
  16. map-socket = 3:4
  17. map-socket = 4:5

Now you have a master monitoring 5 processes, each one bound to a differentaddress (no —thunder-lock needed)

For the Emperor fanboys you can make such a template (call it foo.template):

  1. [uwsgi]
  2. processes = 1
  3. threads = 10
  4. socket = :%n

Now make a symbolic link for each instance+port you want to spawn:

  1. ln -s foo.template 9091.ini
  2. ln -s foo.template 9092.ini
  3. ln -s foo.template 9093.ini
  4. ln -s foo.template 9094.ini
  5. ln -s foo.template 9095.ini
  6. ln -s foo.template 9096.ini

Bonus chapter 2: securing SysV IPC semaphores

My company hosting platform in heavily based on Linux cgroups and namespaces.

The first (cgroups) are used to limit/account resource usage, while the second(namespaces) are used to give an “isolated” system view to users (like seeing adedicated hostname or root filesystem).

As we allow users to spawn PostgreSQL instances in their accounts we need tolimit SysV objects.

Luckily, modern Linux kernels have a namespace for IPC, so callingunshare(CLONE_NEWIPC) will create a whole new set (detached from the others) ofIPC objects.

Calling —unshare ipc in customer-dedicated Emperors is a common approach.When combined with memory cgroup you will end with a pretty secure setup.

Credits:

Author: Roberto De Ioris

Fixed by: Honza Pokorny