Retries and Timeouts

Automatic retries are one the most powerful and useful mechanisms a service meshhas for gracefully handling partial or transient application failures. Ifimplemented incorrectly retries can amplify small errors into system wideoutages. For that reason, we made sure they were implemented in a way that wouldincrease the reliability of the system while limiting the risk.

Timeouts work hand in hand with retries. Once requests are retried a certainnumber of times, it becomes important to limit the total amount of time a clientwaits before giving up entirely. Imagine a number of retries forcing a clientto wait for 10 seconds.

A service profile may define certain routes asretryable or specify timeouts for routes. This will cause the Linkerd proxy toperform the appropriate retries or timeouts when calling that service. Retriesand timeouts are always performed on the outbound (client) side.

These can be setup by following the guides:

How Retries Can Go Wrong

Traditionally, when performing retries, you must specify a maximum number ofretry attempts before giving up. Unfortunately, there are two major problemswith configuring retries this way.

Choosing a maximum number of retry attempts is a guessing game

You need to pick a number that’s high enough to make a difference; allowingmore than one retry attempt is usually prudent and, if your service is lessreliable, you’ll probably want to allow several retry attempts. On the otherhand, allowing too many retry attempts can generate a lot of extra requests andextra load on the system. Performing a lot of retries can also seriouslyincrease the latency of requests that need to be retried. In practice, youusually pick a maximum retry attempts number out of a hat (3?) and then tweakit through trial and error until the system behaves roughly how you want it to.

Systems configured this way are vulnerable to retry storms

A retry stormbegins when one service starts (for any reason) to experience a larger thannormal failure rate. This causes its clients to retry those failed requests.The extra load from the retries causes the service to slow down further andfail more requests, triggering more retries. If each client is configured toretry up to 3 times, this can quadruple the number of requests being sent! Tomake matters even worse, if any of the clients’ clients are configured withretries, the number of retries compounds multiplicatively and can turn a smallnumber of errors into a self-inflicted denial of service attack.

Retry Budgets to the Rescue

To avoid the problems of retry storms and arbitrary numbers of retry attempts,retries are configured using retry budgets. Rather than specifying a fixedmaximum number of retry attempts per request, Linkerd keeps track of the ratiobetween regular requests and retries and keeps this number below a configurablelimit. For example, you may specify that you want retries to add at most 20%more requests. Linkerd will then retry as much as it can while maintaining thatratio.

Configuring retries is always a trade-off between improving success rate andnot adding too much extra load to the system. Retry budgets make that trade-offexplicit by letting you specify exactly how much extra load your system iswilling to accept from retries.