Demo: Books

This demo is of a Ruby application that helps you manage your bookshelf. Itconsists of multiple microservices and uses JSON over HTTP to communicate withthe other services. There are three services:

  • webapp: thefrontend

  • authors: anAPI to manage the authors in the system

  • books: an APIto manage the books in the system

For demo purposes, the app comes with a simple traffic generator. The overalltopology looks like this:

Topology)

Topology

Prerequisites

To use this guide, you’ll need to have Linkerd installed on your cluster.Follow the Installing Linkerd Guide if you haven’t alreadydone this.

Install the app

To get started, let’s install the books app onto your cluster. In your localterminal, run:

  1. kubectl create ns booksapp && \
  2. curl -sL https://run.linkerd.io/booksapp.yml \
  3. | kubectl -n booksapp apply -f -

This command creates a namespace for the demo, downloads its Kubernetesresource manifest and uses kubectl to apply it to your cluster. The appcomprises the Kubernetes deployments and services that run in the booksappnamespace.

Downloading a bunch of containers for the first time takes a little while.Kubernetes can tell you when all the services are running and ready fortraffic. Wait for that to happen by running:

  1. kubectl -n booksapp rollout status deploy webapp

You can also take a quick look at all the components that were added to yourcluster by running:

  1. kubectl -n booksapp get all

Once the rollout has completed successfully, you can access the app itself byport-forwarding webapp locally:

  1. kubectl -n booksapp port-forward svc/webapp 7000 &

Open http://localhost:7000/ in your browser to see thefrontend.

Frontend)

Frontend

Unfortunately, there is an error in the app: if you click Add Book, it willfail 50% of the time. This is a classic case of non-obvious, intermittentfailure—the type that drives service owners mad because it is so difficult todebug. Kubernetes itself cannot detect or surface this error. From Kubernetes’sperspective, it looks like everything’s fine, but you know the application isreturning errors.

Failure)

Failure

Add Linkerd to the service

Now we need to add the Linkerd data plane proxies to the service. The easiestoption is to do something like this:

  1. kubectl get -n booksapp deploy -o yaml \
  2. | linkerd inject - \
  3. | kubectl apply -f -

This command retrieves the manifest of all deployments in the booksappnamespace, runs them through linkerd inject, and then re-applies withkubectl apply. The linkerd inject command annotates each resource tospecify that they should have the Linkerd data plane proxies added, andKubernetes does this when the manifest is reapplied to the cluster. Best ofall, since Kubernetes does a rolling deploy, the application stays running theentire time. (See Automatic Proxy Injection formore details on how this works.)

Debugging

Let’s use Linkerd to discover the root cause of this app’s failures. To checkout the Linkerd dashboard, run:

  1. linkerd dashboard &

Dashboard)

Dashboard

Select booksapp from the namespace dropdown and click on theDeployments workload.You should see all the deployments in the booksapp namespace show up. Therewill be success rate, requests per second, and latency percentiles.

That’s cool, but you’ll notice that the success rate for webapp is not 100%.This is because the traffic generator is submitting new books. You can do thesame thing yourself and push that success rate even lower. Click on webapp inthe Linkerd dashboard for a live debugging session.

You should now be looking at the detail view for the webapp service. You’llsee that webapp is taking traffic from traffic (the load generator), and ithas two outgoing dependencies: authors and book. One is the service forpulling in author information and the other is the service for pulling in bookinformation.

Detail)

Detail

A failure in a dependent service may be exactly what’s causing the errors thatwebapp is returning (and the errors you as a user can see when you click). Wecan see that the books service is also failing. Let’s scroll a little furtherdown the page, we’ll see a live list of all traffic endpoints that webapp isreceiving. This is interesting:

Top)

Top

Aha! We can see that inbound traffic coming from the webapp service going tothe books service is failing a significant percentage of the time. That couldexplain why webapp was throwing intermittent failures. Let’s click on the 🔬icon to look at the actual request and response stream.

Tap)

Tap

Indeed, many of these requests are returning 500’s.

It was surprisingly easy to diagnose an intermittent issue that affected only asingle route. You now have everything you need to open a detailed bug reportexplaining exactly what the root cause is. If the books service was your own,you know exactly where to look in the code.

Service Profiles

To understand the root cause, we used live traffic. For some issues this isgreat, but what happens if the issue is intermittent and happens in the middle ofthe night? Service profiles provide Linkerdwith some additional information about your services. These define the routesthat you’re serving and, among other things, allow for the collection of metricson a per route basis. With Prometheus storing these metrics, you’ll be able tosleep soundly and look up intermittent issues in the morning.

One of the easiest ways to get service profiles setup is by using existingOpenAPI (Swagger) specs. Thisdemo has published specs for each of its services. You can create a serviceprofile for webapp by running:

  1. curl -sL https://run.linkerd.io/booksapp/webapp.swagger \
  2. | linkerd -n booksapp profile --open-api - webapp \
  3. | kubectl -n booksapp apply -f -

This command will do three things:

  • Fetch the swagger specification for webapp.
  • Take the spec and convert it into a service profile by using the profilecommand.
  • Apply this configuration to the cluster.Alongside install and inject, profile is also a pure text operation. Checkout the profile that is generated:
  1. apiVersion: linkerd.io/v1alpha2
  2. kind: ServiceProfile
  3. metadata:
  4. creationTimestamp: null
  5. name: webapp.booksapp.svc.cluster.local
  6. namespace: booksapp
  7. spec:
  8. routes:
  9. - condition:
  10. method: GET
  11. pathRegex: /
  12. name: GET /
  13. - condition:
  14. method: POST
  15. pathRegex: /authors
  16. name: POST /authors
  17. - condition:
  18. method: GET
  19. pathRegex: /authors/[^/]*
  20. name: GET /authors/{id}
  21. - condition:
  22. method: POST
  23. pathRegex: /authors/[^/]*/delete
  24. name: POST /authors/{id}/delete
  25. - condition:
  26. method: POST
  27. pathRegex: /authors/[^/]*/edit
  28. name: POST /authors/{id}/edit
  29. - condition:
  30. method: POST
  31. pathRegex: /books
  32. name: POST /books
  33. - condition:
  34. method: GET
  35. pathRegex: /books/[^/]*
  36. name: GET /books/{id}
  37. - condition:
  38. method: POST
  39. pathRegex: /books/[^/]*/delete
  40. name: POST /books/{id}/delete
  41. - condition:
  42. method: POST
  43. pathRegex: /books/[^/]*/edit
  44. name: POST /books/{id}/edit

The name refers to the FQDN of your Kubernetes service,webapp.booksapp.svc.cluster.local in this instance. Linkerd uses the Hostheader of requests to associate service profiles with requests. When the proxysees a Host header of webapp.booksapp.svc.cluster.local, it will use that tolook up the service profile’s configuration.

Routes are simple conditions that contain the method (GET for example) and aregex to match the path. This allows you to group REST style resources togetherinstead of seeing a huge list. The names for routes can be whatever you’d like.For this demo, the method is appended to the route regex.

To get profiles for authors and books, you can run:

  1. curl -sL https://run.linkerd.io/booksapp/authors.swagger \
  2. | linkerd -n booksapp profile --open-api - authors \
  3. | kubectl -n booksapp apply -f -
  4. curl -sL https://run.linkerd.io/booksapp/books.swagger \
  5. | linkerd -n booksapp profile --open-api - books \
  6. | kubectl -n booksapp apply -f -

Verifying that this all works is easy when you use linkerd tap. Each liverequest will show up with what :authority or Host header is being seen aswell as the :path and rt_route being used. Run:

  1. linkerd -n booksapp tap deploy/webapp -o wide | grep req

This will watch all the live requests flowing through webapp and looksomething like:

  1. req id=0:1 proxy=in src=10.1.3.76:57152 dst=10.1.3.74:7000 tls=disabled :method=POST :authority=webapp.default:7000 :path=/books/2878/edit src_res=deploy/traffic src_ns=foobar dst_res=deploy/webapp dst_ns=default rt_route=POST /books/{id}/edit

As you can see:

  • :authority is the correct host
  • :path correctly matches
  • rt_route contains the name of the routeThese metrics are part of the linkerd routescommand instead of linkerd stat. To see the metricsthat have accumulated so far, run:
  1. linkerd -n booksapp routes svc/webapp

This will output a table of all the routes observed and their golden metrics.The [DEFAULT] route is a catch all for anything that does not match theservice profile.

Profiles can be used to observe outgoing requests as well as _incoming_requests. To do that, run:

  1. linkerd -n booksapp routes deploy/webapp --to svc/books

This will show all requests and routes that originate in the webapp deploymentand are destined to the books service. Similarly to using tap and topviews in the debugging section, the root cause of errors in thisdemo is immediately apparent:

  1. ROUTE SERVICE SUCCESS RPS LATENCY_P50 LATENCY_P95 LATENCY_P99
  2. DELETE /books/{id}.json books 100.00% 0.5rps 18ms 29ms 30ms
  3. GET /books.json books 100.00% 1.1rps 7ms 12ms 18ms
  4. GET /books/{id}.json books 100.00% 2.5rps 6ms 10ms 10ms
  5. POST /books.json books 52.24% 2.2rps 23ms 34ms 39ms
  6. PUT /books/{id}.json books 41.98% 1.4rps 73ms 97ms 99ms
  7. [DEFAULT] books 0.00% 0.0rps 0ms 0ms 0ms

Retries

As it can take awhile to update code and roll out a new version, let’stell Linkerd that it can retry requests to the failing endpoint. This willincrease request latencies, as requests will be retried multiple times, but notrequire rolling out a new version.

In this application, the success rate of requests from the books deployment tothe authors service is poor. To see these metrics, run:

  1. linkerd -n booksapp routes deploy/books --to svc/authors

The output should look like:

  1. ROUTE SERVICE SUCCESS RPS LATENCY_P50 LATENCY_P95 LATENCY_P99
  2. DELETE /authors/{id}.json authors 0.00% 0.0rps 0ms 0ms 0ms
  3. GET /authors.json authors 0.00% 0.0rps 0ms 0ms 0ms
  4. GET /authors/{id}.json authors 0.00% 0.0rps 0ms 0ms 0ms
  5. HEAD /authors/{id}.json authors 50.85% 3.9rps 5ms 10ms 17ms
  6. POST /authors.json authors 0.00% 0.0rps 0ms 0ms 0ms
  7. [DEFAULT] authors 0.00% 0.0rps 0ms 0ms 0ms

One thing that’s clear is that all requests from books to authors are to theHEAD /authors/{id}.json route and those requests are failing about 50% of thetime.

To correct this, let’s edit the authors service profile and make thoserequests retryable by running:

  1. kubectl -n booksapp edit sp/authors.booksapp.svc.cluster.local

You’ll want to add isRetryable to a specific route. It should look like:

  1. spec:
  2. routes:
  3. - condition:
  4. method: HEAD
  5. pathRegex: /authors/[^/]*\.json
  6. name: HEAD /authors/{id}.json
  7. isRetryable: true ### ADD THIS LINE ###

After editing the service profile, Linkerd will begin to retry requests tothis route automatically. We see a nearly immediate improvement in success rateby running:

  1. linkerd -n booksapp routes deploy/books --to svc/authors -o wide

This should look like:

  1. ROUTE SERVICE EFFECTIVE_SUCCESS EFFECTIVE_RPS ACTUAL_SUCCESS ACTUAL_RPS LATENCY_P50 LATENCY_P95 LATENCY_P99
  2. DELETE /authors/{id}.json authors 0.00% 0.0rps 0.00% 0.0rps 0ms 0ms 0ms
  3. GET /authors.json authors 0.00% 0.0rps 0.00% 0.0rps 0ms 0ms 0ms
  4. GET /authors/{id}.json authors 0.00% 0.0rps 0.00% 0.0rps 0ms 0ms 0ms
  5. HEAD /authors/{id}.json authors 100.00% 2.8rps 58.45% 4.7rps 7ms 25ms 37ms
  6. POST /authors.json authors 0.00% 0.0rps 0.00% 0.0rps 0ms 0ms 0ms
  7. [DEFAULT] authors 0.00% 0.0rps 0.00% 0.0rps 0ms 0ms 0ms

You’ll notice that the -o wide flag has added some columns to the routesview. These show the difference between EFFECTIVE_SUCCESS andACTUAL_SUCCESS. The difference between these two show how well retries areworking. EFFECTIVE_RPS and ACTUAL_RPS show how many requests are being sentto the destination service and and how many are being received by the client’sLinkerd proxy.

With retries automatically happening now, success rate looks great but the p95and p99 latencies have increased. This is to be expected because doing retriestakes time.

Timeouts

Linkerd can limit how long to wait before failing outgoing requests to anotherservice. These timeouts work by adding another key to a service profile’s routesconfiguration.

To get started, let’s take a look at the current latency for requests fromwebapp to the books service:

  1. linkerd -n booksapp routes deploy/webapp --to svc/books

This should look something like:

  1. ROUTE SERVICE SUCCESS RPS LATENCY_P50 LATENCY_P95 LATENCY_P99
  2. DELETE /books/{id}.json books 100.00% 0.7rps 10ms 27ms 29ms
  3. GET /books.json books 100.00% 1.3rps 9ms 34ms 39ms
  4. GET /books/{id}.json books 100.00% 2.0rps 9ms 52ms 91ms
  5. POST /books.json books 100.00% 1.3rps 45ms 140ms 188ms
  6. PUT /books/{id}.json books 100.00% 0.7rps 80ms 170ms 194ms
  7. [DEFAULT] books 0.00% 0.0rps 0ms 0ms 0ms

Requests to the books service’s PUT /books/{id}.json route include retriesfor when that service calls the authors service as part of serving thoserequests, as described in the previous section. This improves success rate, atthe cost of additional latency. For the purposes of this demo, let’s set a 25mstimeout for calls to that route. Your latency numbers will vary depending on thecharacteristics of your cluster. To edit the books service profile, run:

  1. kubectl -n booksapp edit sp/books.booksapp.svc.cluster.local

Update the PUT /books/{id}.json route to have a timeout:

  1. spec:
  2. routes:
  3. - condition:
  4. method: PUT
  5. pathRegex: /books/[^/]*\.json
  6. name: PUT /books/{id}.json
  7. timeout: 25ms ### ADD THIS LINE ###

Linkerd will now return errors to the webapp REST client when the timeout isreached. This timeout includes retried requests and is the maximum amount oftime a REST client would wait for a response.

Run routes to see what has changed:

  1. linkerd -n booksapp routes deploy/webapp --to svc/books -o wide

With timeouts happening now, the metrics will change:

  1. ROUTE SERVICE EFFECTIVE_SUCCESS EFFECTIVE_RPS ACTUAL_SUCCESS ACTUAL_RPS LATENCY_P50 LATENCY_P95 LATENCY_P99
  2. DELETE /books/{id}.json books 100.00% 0.7rps 100.00% 0.7rps 8ms 46ms 49ms
  3. GET /books.json books 100.00% 1.3rps 100.00% 1.3rps 9ms 33ms 39ms
  4. GET /books/{id}.json books 100.00% 2.2rps 100.00% 2.2rps 8ms 19ms 28ms
  5. POST /books.json books 100.00% 1.3rps 100.00% 1.3rps 27ms 81ms 96ms
  6. PUT /books/{id}.json books 86.96% 0.8rps 100.00% 0.7rps 75ms 98ms 100ms
  7. [DEFAULT] books 0.00% 0.0rps 0.00% 0.0rps 0ms 0ms 0ms

The latency numbers include time spent in the webapp application itself, soit’s expected that they exceed the 25ms timeout that we set for requests fromwebapp to books. We can see that the timeouts are working by observing thatthe effective success rate for our route has dropped below 100%.

Clean Up

To remove the books app and the booksapp namespace from your cluster, run:

  1. curl -sL https://run.linkerd.io/booksapp.yml \
  2. | kubectl -n booksapp delete -f - \
  3. && kubectl delete ns booksapp