Telemetry and Monitoring

One of Linkerd’s most powerful features is its extensive set of tooling aroundobservability—the measuring and reporting of observed behavior inmeshed applications. While Linkerd doesn’t have insight directly into theinternals of service code, it has tremendous insight into the externalbehavior of service code.

Linkerd’s telemetry and monitoring features function automatically, withoutrequiring any work on the part of the developer. These features include:

  • Recording of top-line (“golden”) metrics (request volume, success rate, andlatency distributions) for HTTP, HTTP/2, and gRPC traffic.
  • Recording of TCP-level metrics (bytes in/out, etc) for other TCP traffic.
  • Reporting metrics per service, per caller/callee pair, or per route/path(with Service Profiles).
  • Generating topology graphs that display the runtime relationship betweenservices.
  • Live, on-demand request sampling.This data can be consumed in several ways:

  • Through the Linkerd CLI, e.g. with linkerd stat andlinkerd routes.

  • Through the Linkerd dashboard, andpre-built Grafana dashboards.
  • Directly from Linkerd’s built-in Prometheus instance

Golden metrics

Success Rate

This is the percentage of successful requests during a time window (1 minute bydefault).

In the output of the command linkerd routes —o wide, this metric is splitinto EFFECTIVE_SUCCESS and ACTUAL_SUCCESS. For routes configured with retries,the former calculates the percentage of success after retries (as perceived bythe client-side), and the latter before retries (which can expose potentialproblems with the service).

Traffic (Requests Per Second)

This gives an overview of how much demand is placed on the service/route. Aswith success rates, linkerd routes —o wide splits this metric intoEFFECTIVE_RPS and ACTUAL_RPS, corresponding to rates after and before retriesrespectively.

Latencies

Times taken to service requests per service/route are split into 50th, 95th and99th percentiles. Lower percentiles give you an overview of the averageperformance of the system, while tail percentiles help catch outlier behavior.

Lifespan of Linkerd metrics

Linkerd is not designed as a long-term historical metrics store. WhileLinkerd’s control plane does include a Prometheus instance, this instanceexpires metrics at a short, fixed interval (currently 6 hours).

Rather, Linkerd is designed to supplement your existing metrics store. IfLinkerd’s metrics are valuable, you should export them into your existinghistorical metrics store.

See Exporting Metrics for more.