3.5. Sizing

  1. Typical CPU usage figures show 15% of the processing time spent in HAProxy
  2. versus 85% in the kernel in TCP or HTTP close mode, and about 30% for HAProxy
  3. versus 70% for the kernel in HTTP keep-alive mode. This means that the operating
  4. system and its tuning have a strong impact on the global performance.
  5.  
  6. Usages vary a lot between users, some focus on bandwidth, other ones on request
  7. rate, others on connection concurrency, others on SSL performance. This section
  8. aims at providing a few elements to help with this task.
  9.  
  10. It is important to keep in mind that every operation comes with a cost, so each
  11. individual operation adds its overhead on top of the other ones, which may be
  12. negligible in certain circumstances, and which may dominate in other cases.
  13.  
  14. When processing the requests from a connection, we can say that :
  15.  
  16. - forwarding data costs less than parsing request or response headers;
  17.  
  18. - parsing request or response headers cost less than establishing then closing
  19. a connection to a server;
  20.  
  21. - establishing an closing a connection costs less than a TLS resume operation;
  22.  
  23. - a TLS resume operation costs less than a full TLS handshake with a key
  24. computation;
  25.  
  26. - an idle connection costs less CPU than a connection whose buffers hold data;
  27.  
  28. - a TLS context costs even more memory than a connection with data;
  29.  
  30. So in practice, it is cheaper to process payload bytes than header bytes, thus
  31. it is easier to achieve high network bandwidth with large objects (few requests
  32. per volume unit) than with small objects (many requests per volume unit). This
  33. explains why maximum bandwidth is always measured with large objects, while
  34. request rate or connection rates are measured with small objects.
  35.  
  36. Some operations scale well on multiple processes spread over multiple CPUs,
  37. and others don't scale as well. Network bandwidth doesn't scale very far because
  38. the CPU is rarely the bottleneck for large objects, it's mostly the network
  39. bandwidth and data buses to reach the network interfaces. The connection rate
  40. doesn't scale well over multiple processors due to a few locks in the system
  41. when dealing with the local ports table. The request rate over persistent
  42. connections scales very well as it doesn't involve much memory nor network
  43. bandwidth and doesn't require to access locked structures. TLS key computation
  44. scales very well as it's totally CPU-bound. TLS resume scales moderately well,
  45. but reaches its limits around 4 processes where the overhead of accessing the
  46. shared table offsets the small gains expected from more power.
  47.  
  48. The performance numbers one can expect from a very well tuned system are in the
  49. following range. It is important to take them as orders of magnitude and to
  50. expect significant variations in any direction based on the processor, IRQ
  51. setting, memory type, network interface type, operating system tuning and so on.
  52.  
  53. The following numbers were found on a Core i7 running at 3.7 GHz equipped with
  54. a dual-port 10 Gbps NICs running Linux kernel 3.10, HAProxy 1.6 and OpenSSL
  55. 1.0.2. HAProxy was running as a single process on a single dedicated CPU core,
  56. and two extra cores were dedicated to network interrupts :
  57.  
  58. - 20 Gbps of maximum network bandwidth in clear text for objects 256 kB or
  59. higher, 10 Gbps for 41kB or higher;
  60.  
  61. - 4.6 Gbps of TLS traffic using AES256-GCM cipher with large objects;
  62.  
  63. - 83000 TCP connections per second from client to server;
  64.  
  65. - 82000 HTTP connections per second from client to server;
  66.  
  67. - 97000 HTTP requests per second in server-close mode (keep-alive with the
  68. client, close with the server);
  69.  
  70. - 243000 HTTP requests per second in end-to-end keep-alive mode;
  71.  
  72. - 300000 filtered TCP connections per second (anti-DDoS)
  73.  
  74. - 160000 HTTPS requests per second in keep-alive mode over persistent TLS
  75. connections;
  76.  
  77. - 13100 HTTPS requests per second using TLS resumed connections;
  78.  
  79. - 1300 HTTPS connections per second using TLS connections renegotiated with
  80. RSA2048;
  81.  
  82. - 20000 concurrent saturated connections per GB of RAM, including the memory
  83. required for system buffers; it is possible to do better with careful tuning
  84. but this result it easy to achieve.
  85.  
  86. - about 8000 concurrent TLS connections (client-side only) per GB of RAM,
  87. including the memory required for system buffers;
  88.  
  89. - about 5000 concurrent end-to-end TLS connections (both sides) per GB of
  90. RAM including the memory required for system buffers;
  91.  
  92. Thus a good rule of thumb to keep in mind is that the request rate is divided
  93. by 10 between TLS keep-alive and TLS resume, and between TLS resume and TLS
  94. renegotiation, while it's only divided by 3 between HTTP keep-alive and HTTP
  95. close. Another good rule of thumb is to remember that a high frequency core
  96. with AES instructions can do around 5 Gbps of AES-GCM per core.
  97.  
  98. Having more cores rarely helps (except for TLS) and is even counter-productive
  99. due to the lower frequency. In general a small number of high frequency cores
  100. is better.
  101.  
  102. Another good rule of thumb is to consider that on the same server, HAProxy will
  103. be able to saturate :
  104.  
  105. - about 5-10 static file servers or caching proxies;
  106.  
  107. - about 100 anti-virus proxies;
  108.  
  109. - and about 100-1000 application servers depending on the technology in use.