12. Debugging and performance issues

  1. When HAProxy is started with the "-d" option, it will stay in the foreground
  2. and will print one line per event, such as an incoming connection, the end of a
  3. connection, and for each request or response header line seen. This debug
  4. output is emitted before the contents are processed, so they don't consider the
  5. local modifications. The main use is to show the request and response without
  6. having to run a network sniffer. The output is less readable when multiple
  7. connections are handled in parallel, though the "debug2ansi" and "debug2html"
  8. scripts found in the examples/ directory definitely help here by coloring the
  9. output.
  10.  
  11. If a request or response is rejected because HAProxy finds it is malformed, the
  12. best thing to do is to connect to the CLI and issue "show errors", which will
  13. report the last captured faulty request and response for each frontend and
  14. backend, with all the necessary information to indicate precisely the first
  15. character of the input stream that was rejected. This is sometimes needed to
  16. prove to customers or to developers that a bug is present in their code. In
  17. this case it is often possible to relax the checks (but still keep the
  18. captures) using "option accept-invalid-http-request" or its equivalent for
  19. responses coming from the server "option accept-invalid-http-response". Please
  20. see the configuration manual for more details.

Example :

  1. > show errors
  2. Total events captured on [13/Oct/2015:13:43:47.169] : 1
  3. [13/Oct/2015:13:43:40.918] frontend HAProxyLocalStats (#2): invalid request
  4. backend <NONE> (#-1), server <NONE> (#-1), event #0
  5. src 127.0.0.1:51981, session #0, session flags 0x00000080
  6. HTTP msg state 26, msg flags 0x00000000, tx flags 0x00000000
  7. HTTP chunk len 0 bytes, HTTP body len 0 bytes
  8. buffer flags 0x00808002, out 0 bytes, total 31 bytes
  9. pending 31 bytes, wrapping at 8040, error at position 13:
  10. 00000 GET /invalid request HTTP/1.1\r\n
  1. The output of "show info" on the CLI provides a number of useful information
  2. regarding the maximum connection rate ever reached, maximum SSL key rate ever
  3. reached, and in general all information which can help to explain temporary
  4. issues regarding CPU or memory usage. Example :
  5.  
  6. > show info
  7. Name: HAProxy
  8. Version: 1.6-dev7-e32d18-17
  9. Release_date: 2015/10/12
  10. Nbproc: 1
  11. Process_num: 1
  12. Pid: 7949
  13. Uptime: 0d 0h02m39s
  14. Uptime_sec: 159
  15. Memmax_MB: 0
  16. Ulimit-n: 120032
  17. Maxsock: 120032
  18. Maxconn: 60000
  19. Hard_maxconn: 60000
  20. CurrConns: 0
  21. CumConns: 3
  22. CumReq: 3
  23. MaxSslConns: 0
  24. CurrSslConns: 0
  25. CumSslConns: 0
  26. Maxpipes: 0
  27. PipesUsed: 0
  28. PipesFree: 0
  29. ConnRate: 0
  30. ConnRateLimit: 0
  31. MaxConnRate: 1
  32. SessRate: 0
  33. SessRateLimit: 0
  34. MaxSessRate: 1
  35. SslRate: 0
  36. SslRateLimit: 0
  37. MaxSslRate: 0
  38. SslFrontendKeyRate: 0
  39. SslFrontendMaxKeyRate: 0
  40. SslFrontendSessionReuse_pct: 0
  41. SslBackendKeyRate: 0
  42. SslBackendMaxKeyRate: 0
  43. SslCacheLookups: 0
  44. SslCacheMisses: 0
  45. CompressBpsIn: 0
  46. CompressBpsOut: 0
  47. CompressBpsRateLim: 0
  48. ZlibMemUsage: 0
  49. MaxZlibMemUsage: 0
  50. Tasks: 5
  51. Run_queue: 1
  52. Idle_pct: 100
  53. node: wtap
  54. description:
  55.  
  56. When an issue seems to randomly appear on a new version of HAProxy (eg: every
  57. second request is aborted, occasional crash, etc), it is worth trying to enable
  58. memory poisoning so that each call to malloc() is immediately followed by the
  59. filling of the memory area with a configurable byte. By default this byte is
  60. 0x50 (ASCII for 'P'), but any other byte can be used, including zero (which
  61. will have the same effect as a calloc() and which may make issues disappear).
  62. Memory poisoning is enabled on the command line using the "-dM" option. It
  63. slightly hurts performance and is not recommended for use in production. If
  64. an issue happens all the time with it or never happens when poisoning uses
  65. byte zero, it clearly means you've found a bug and you definitely need to
  66. report it. Otherwise if there's no clear change, the problem it is not related.
  67.  
  68. When debugging some latency issues, it is important to use both strace and
  69. tcpdump on the local machine, and another tcpdump on the remote system. The
  70. reason for this is that there are delays everywhere in the processing chain and
  71. it is important to know which one is causing latency to know where to act. In
  72. practice, the local tcpdump will indicate when the input data come in. Strace
  73. will indicate when haproxy receives these data (using recv/recvfrom). Warning,
  74. openssl uses read()/write() syscalls instead of recv()/send(). Strace will also
  75. show when haproxy sends the data, and tcpdump will show when the system sends
  76. these data to the interface. Then the external tcpdump will show when the data
  77. sent are really received (since the local one only shows when the packets are
  78. queued). The benefit of sniffing on the local system is that strace and tcpdump
  79. will use the same reference clock. Strace should be used with "-tts200" to get
  80. complete timestamps and report large enough chunks of data to read them.
  81. Tcpdump should be used with "-nvvttSs0" to report full packets, real sequence
  82. numbers and complete timestamps.
  83.  
  84. In practice, received data are almost always immediately received by haproxy
  85. (unless the machine has a saturated CPU or these data are invalid and not
  86. delivered). If these data are received but not sent, it generally is because
  87. the output buffer is saturated (ie: recipient doesn't consume the data fast
  88. enough). This can be confirmed by seeing that the polling doesn't notify of
  89. the ability to write on the output file descriptor for some time (it's often
  90. easier to spot in the strace output when the data finally leave and then roll
  91. back to see when the write event was notified). It generally matches an ACK
  92. received from the recipient, and detected by tcpdump. Once the data are sent,
  93. they may spend some time in the system doing nothing. Here again, the TCP
  94. congestion window may be limited and not allow these data to leave, waiting for
  95. an ACK to open the window. If the traffic is idle and the data take 40 ms or
  96. 200 ms to leave, it's a different issue (which is not an issue), it's the fact
  97. that the Nagle algorithm prevents empty packets from leaving immediately, in
  98. hope that they will be merged with subsequent data. HAProxy automatically
  99. disables Nagle in pure TCP mode and in tunnels. However it definitely remains
  100. enabled when forwarding an HTTP body (and this contributes to the performance
  101. improvement there by reducing the number of packets). Some HTTP non-compliant
  102. applications may be sensitive to the latency when delivering incomplete HTTP
  103. response messages. In this case you will have to enable "option http-no-delay"
  104. to disable Nagle in order to work around their design, keeping in mind that any
  105. other proxy in the chain may similarly be impacted. If tcpdump reports that data
  106. leave immediately but the other end doesn't see them quickly, it can mean there
  107. is a congested WAN link, a congested LAN with flow control enabled and
  108. preventing the data from leaving, or more commonly that HAProxy is in fact
  109. running in a virtual machine and that for whatever reason the hypervisor has
  110. decided that the data didn't need to be sent immediately. In virtualized
  111. environments, latency issues are almost always caused by the virtualization
  112. layer, so in order to save time, it's worth first comparing tcpdump in the VM
  113. and on the external components. Any difference has to be credited to the
  114. hypervisor and its accompanying drivers.
  115.  
  116. When some TCP SACK segments are seen in tcpdump traces (using -vv), it always
  117. means that the side sending them has got the proof of a lost packet. While not
  118. seeing them doesn't mean there are no losses, seeing them definitely means the
  119. network is lossy. Losses are normal on a network, but at a rate where SACKs are
  120. not noticeable at the naked eye. If they appear a lot in the traces, it is
  121. worth investigating exactly what happens and where the packets are lost. HTTP
  122. doesn't cope well with TCP losses, which introduce huge latencies.
  123.  
  124. The "netstat -i" command will report statistics per interface. An interface
  125. where the Rx-Ovr counter grows indicates that the system doesn't have enough
  126. resources to receive all incoming packets and that they're lost before being
  127. processed by the network driver. Rx-Drp indicates that some received packets
  128. were lost in the network stack because the application doesn't process them
  129. fast enough. This can happen during some attacks as well. Tx-Drp means that
  130. the output queues were full and packets had to be dropped. When using TCP it
  131. should be very rare, but will possibly indicate a saturated outgoing link.