Alerting

We recommend that you read My Philosophy on Alertingbased on Rob Ewaschuk's observations at Google.

To summarize: keep alerting simple, alert on symptoms, have good consoles toallow pinpointing causes, and avoid having pages where there is nothing to do.

What to alert on

Aim to have as few alerts as possible, by alerting on symptoms that areassociated with end-user pain rather than trying to catch every possible waythat pain could be caused. Alerts should link to relevant consolesand make it easy to figure out which component is at fault.

Allow for slack in alerting to accommodate small blips.

Online serving systems

Typically alert on high latency and error rates as high up in the stack as possible.

Only page on latency at one point in a stack. If a lower-level component isslower than it should be, but the overall user latency is fine, then there isno need to page.

For error rates, page on user-visible errors. If there are errors further downthe stack that will cause such a failure, there is no need to page on themseparately. However, if some failures are not user-visible, but are otherwisesevere enough to require human involvement (for example, you are losing a lot ofmoney), add pages to be sent on those.

You may need alerts for different types of request if they have differentcharacteristics, or problems in a low-traffic type of request would be drownedout by high-traffic requests.

Offline processing

For offline processing systems, the key metric is how long data takes to getthrough the system, so page if that gets high enough to cause user impact.

Batch jobs

For batch jobs it makes sense to page if the batch job has not succeededrecently enough, and this will cause user-visible problems.

This should generally be at least enough time for 2 full runs of the batch job.For a job that runs every 4 hours and takes an hour, 10 hours would be areasonable threshold. If you cannot withstand a single run failing, run thejob more frequently, as a single failure should not require human intervention.

Capacity

While not a problem causing immediate user impact, being close to capacityoften requires human intervention to avoid an outage in the near future.

Metamonitoring

It is important to have confidence that monitoring is working. Accordingly, havealerts to ensure that Prometheus servers, Alertmanagers, PushGateways, andother monitoring infrastructure are available and running correctly.

As always, if it is possible to alert on symptoms rather than causes, this helpsto reduce noise. For example, a blackbox test that alerts are getting fromPushGateway to Prometheus to Alertmanager to email is better than individualalerts on each.

Supplementing the whitebox monitoring of Prometheus with external blackboxmonitoring can catch problems that are otherwise invisible, and also serves asa fallback in case internal systems completely fail.