- Prometheus alertmanager docs
- github alertmanager
- alertmanager docs
- Routing tree editor
- My Philosophy on Alerting (Google SRE)
- Prometheus best practice alerting guide
- Alerting on SLOs like Pros
Example alert rule:
groups:
- name: demo-service-alerts # Name of the group of rules.
rules: # A list of alerting rules in this group.
- alert: HighErrorRate # The name of the alert.
expr: | # A PromQL expression whose output series become alerts.
(
sum by(path, instance, job) (
rate(demo_api_request_duration_seconds_count{status=~"5..",job="demo"}[1m])
)
/
sum by(path, instance, job) (
rate(demo_api_request_duration_seconds_count{job="demo"}[1m])
) * 100 > 0.5
)
for: 5m # How long each result time series needs to be present to become a firing alert.
labels: # Extra labels to attach for routing.
severity: critical
annotations: # Non-identifying annotations that can be used in Alertmanager notifications.
title: "{{$labels.instance}} high 5xx rate on {{$labels.path}}"
description: "The 5xx error rate for path {{$labels.path}} on {{$labels.instance}} is {{$value}}%."
Don’t loose labels:
decent:
rate(errors_total{job="my-job"}[5m]) > 10
bad:
sum by(job) (rate(errors_total{job="my-job"}[5m])) > 10
best:
sum without(instance, type) (rate(errors_total{job="my-job"}[5m])) > 10