Configuring Retries

In order for Linkerd to do automatic retries of failures, there are twoquestions that need to be answered:

  • Which requests should be retried?
  • How many times should the requests be retried?

Both of these questions can be answered by specifying a bit of extra informationin the service profile for the service you'resending requests to.

The reason why these pieces of configuration are required is because retries canpotentially be dangerous. Automatically retrying a request that changes state(e.g. a request that submits a financial transaction) could potentially impactyour user's experience negatively. In addition, retries increase the load onyour system. A set of services that have requests being constantly retriedcould potentially get taken down by the retries instead of being allowed timeto recover.

Check out the retries section of the books demo for atutorial of how to configure retries.

Retries

For routes that are idempotent and don't have bodies, you can edit the serviceprofile and add isRetryable to the retryable route:

  1. spec:
  2. routes:
  3. - name: GET /api/annotations
  4. condition:
  5. method: GET
  6. pathRegex: /api/annotations
  7. isRetryable: true ### ADD THIS LINE ###

Retry Budgets

A retry budget is a mechanism that limits the number of retries that can beperformed against a service as a percentage of original requests. Thisprevents retries from overwhelming your system. By default, retries may add atmost an additional 20% to the request load (plus an additional 10 “free”retries per second). These settings can be adjusted by setting a retryBudgeton your service profile.

  1. spec:
  2. retryBudget:
  3. retryRatio: 0.2
  4. minRetriesPerSecond: 10
  5. ttl: 10s

Monitoring Retries

Retries can be monitored by using the linkerd routes command with the —toflag and the -o wide flag. Since retries are performed on the client-side,we need to use the —to flag to see metrics for requests that one resourceis sending to another (from the server's point of view, retries are justregular requests). When both of these flags are specified, the linkerd routescommand will differentiate between “effective” and “actual” traffic.

  1. ROUTE SERVICE EFFECTIVE_SUCCESS EFFECTIVE_RPS ACTUAL_SUCCESS ACTUAL_RPS LATENCY_P50 LATENCY_P95 LATENCY_P99
  2. HEAD /authors/{id}.json authors 100.00% 2.8rps 58.45% 4.7rps 7ms 25ms 37ms
  3. [DEFAULT] authors 0.00% 0.0rps 0.00% 0.0rps 0ms 0ms 0ms

Actual requests represent all requests that the client actually sends, includingoriginal requests and retries. Effective requests only count the originalrequests. Since an original request may trigger one or more retries, the actualrequest volume is usually higher than the effective request volume when retriesare enabled. Since an original request may fail the first time, but a retry ofthat request might succeed, the effective success rate is usually (but notalways) higher than theactual success rate.