7. CPU usage

  1. HAProxy normally spends most of its time in the system and a smaller part in
  2. userland. A finely tuned 3.5 GHz CPU can sustain a rate about 80000 end-to-end
  3. connection setups and closes per second at 100% CPU on a single core. When one
  4. core is saturated, typical figures are :
  5. - 95% system, 5% user for long TCP connections or large HTTP objects
  6. - 85% system and 15% user for short TCP connections or small HTTP objects in
  7. close mode
  8. - 70% system and 30% user for small HTTP objects in keep-alive mode
  9.  
  10. The amount of rules processing and regular expressions will increase the user
  11. land part. The presence of firewall rules, connection tracking, complex routing
  12. tables in the system will instead increase the system part.
  13.  
  14. On most systems, the CPU time observed during network transfers can be cut in 4
  15. parts :
  16. - the interrupt part, which concerns all the processing performed upon I/O
  17. receipt, before the target process is even known. Typically Rx packets are
  18. accounted for in interrupt. On some systems such as Linux where interrupt
  19. processing may be deferred to a dedicated thread, it can appear as softirq,
  20. and the thread is called ksoftirqd/0 (for CPU 0). The CPU taking care of
  21. this load is generally defined by the hardware settings, though in the case
  22. of softirq it is often possible to remap the processing to another CPU.
  23. This interrupt part will often be perceived as parasitic since it's not
  24. associated with any process, but it actually is some processing being done
  25. to prepare the work for the process.
  26.  
  27. - the system part, which concerns all the processing done using kernel code
  28. called from userland. System calls are accounted as system for example. All
  29. synchronously delivered Tx packets will be accounted for as system time. If
  30. some packets have to be deferred due to queues filling up, they may then be
  31. processed in interrupt context later (eg: upon receipt of an ACK opening a
  32. TCP window).
  33.  
  34. - the user part, which exclusively runs application code in userland. HAProxy
  35. runs exclusively in this part, though it makes heavy use of system calls.
  36. Rules processing, regular expressions, compression, encryption all add to
  37. the user portion of CPU consumption.
  38.  
  39. - the idle part, which is what the CPU does when there is nothing to do. For
  40. example HAProxy waits for an incoming connection, or waits for some data to
  41. leave, meaning the system is waiting for an ACK from the client to push
  42. these data.
  43.  
  44. In practice regarding HAProxy's activity, it is in general reasonably accurate
  45. (but totally inexact) to consider that interrupt/softirq are caused by Rx
  46. processing in kernel drivers, that user-land is caused by layer 7 processing
  47. in HAProxy, and that system time is caused by network processing on the Tx
  48. path.
  49.  
  50. Since HAProxy runs around an event loop, it waits for new events using poll()
  51. (or any alternative) and processes all these events as fast as possible before
  52. going back to poll() waiting for new events. It measures the time spent waiting
  53. in poll() compared to the time spent doing processing events. The ratio of
  54. polling time vs total time is called the "idle" time, it's the amount of time
  55. spent waiting for something to happen. This ratio is reported in the stats page
  56. on the "idle" line, or "Idle_pct" on the CLI. When it's close to 100%, it means
  57. the load is extremely low. When it's close to 0%, it means that there is
  58. constantly some activity. While it cannot be very accurate on an overloaded
  59. system due to other processes possibly preempting the CPU from the haproxy
  60. process, it still provides a good estimate about how HAProxy considers it is
  61. working : if the load is low and the idle ratio is low as well, it may indicate
  62. that HAProxy has a lot of work to do, possibly due to very expensive rules that
  63. have to be processed. Conversely, if HAProxy indicates the idle is close to
  64. 100% while things are slow, it means that it cannot do anything to speed things
  65. up because it is already waiting for incoming data to process. In the example
  66. below, haproxy is completely idle :
  67.  
  68. $ echo "show info" | socat - /var/run/haproxy.sock | grep ^Idle
  69. Idle_pct: 100
  70.  
  71. When the idle ratio starts to become very low, it is important to tune the
  72. system and place processes and interrupts correctly to save the most possible
  73. CPU resources for all tasks. If a firewall is present, it may be worth trying
  74. to disable it or to tune it to ensure it is not responsible for a large part
  75. of the performance limitation. It's worth noting that unloading a stateful
  76. firewall generally reduces both the amount of interrupt/softirq and of system
  77. usage since such firewalls act both on the Rx and the Tx paths. On Linux,
  78. unloading the nf_conntrack and ip_conntrack modules will show whether there is
  79. anything to gain. If so, then the module runs with default settings and you'll
  80. have to figure how to tune it for better performance. In general this consists
  81. in considerably increasing the hash table size. On FreeBSD, "pfctl -d" will
  82. disable the "pf" firewall and its stateful engine at the same time.
  83.  
  84. If it is observed that a lot of time is spent in interrupt/softirq, it is
  85. important to ensure that they don't run on the same CPU. Most systems tend to
  86. pin the tasks on the CPU where they receive the network traffic because for
  87. certain workloads it improves things. But with heavily network-bound workloads
  88. it is the opposite as the haproxy process will have to fight against its kernel
  89. counterpart. Pinning haproxy to one CPU core and the interrupts to another one,
  90. all sharing the same L3 cache tends to sensibly increase network performance
  91. because in practice the amount of work for haproxy and the network stack are
  92. quite close, so they can almost fill an entire CPU each. On Linux this is done
  93. using taskset (for haproxy) or using cpu-map (from the haproxy config), and the
  94. interrupts are assigned under /proc/irq. Many network interfaces support
  95. multiple queues and multiple interrupts. In general it helps to spread them
  96. across a small number of CPU cores provided they all share the same L3 cache.
  97. Please always stop irq_balance which always does the worst possible thing on
  98. such workloads.
  99.  
  100. For CPU-bound workloads consisting in a lot of SSL traffic or a lot of
  101. compression, it may be worth using multiple processes dedicated to certain
  102. tasks, though there is no universal rule here and experimentation will have to
  103. be performed.
  104.  
  105. In order to increase the CPU capacity, it is possible to make HAProxy run as
  106. several processes, using the "nbproc" directive in the global section. There
  107. are some limitations though :
  108. - health checks are run per process, so the target servers will get as many
  109. checks as there are running processes ;
  110. - maxconn values and queues are per-process so the correct value must be set
  111. to avoid overloading the servers ;
  112. - outgoing connections should avoid using port ranges to avoid conflicts
  113. - stick-tables are per process and are not shared between processes ;
  114. - each peers section may only run on a single process at a time ;
  115. - the CLI operations will only act on a single process at a time.
  116.  
  117. With this in mind, it appears that the easiest setup often consists in having
  118. one first layer running on multiple processes and in charge for the heavy
  119. processing, passing the traffic to a second layer running in a single process.
  120. This mechanism is suited to SSL and compression which are the two CPU-heavy
  121. features. Instances can easily be chained over UNIX sockets (which are cheaper
  122. than TCP sockets and which do not waste ports), and the proxy protocol which is
  123. useful to pass client information to the next stage. When doing so, it is
  124. generally a good idea to bind all the single-process tasks to process number 1
  125. and extra tasks to next processes, as this will make it easier to generate
  126. similar configurations for different machines.
  127.  
  128. On Linux versions 3.9 and above, running HAProxy in multi-process mode is much
  129. more efficient when each process uses a distinct listening socket on the same
  130. IP:port ; this will make the kernel evenly distribute the load across all
  131. processes instead of waking them all up. Please check the "process" option of
  132. the "bind" keyword lines in the configuration manual for more information.