Prometheus Alerting Rules

Example alert rule:

				
					groups:
- name: demo-service-alerts # Name of the group of rules.
  rules:                    # A list of alerting rules in this group.
  - alert: HighErrorRate    # The name of the alert.
    expr: |                 # A PromQL expression whose output series become alerts.
      (
        sum by(path, instance, job) (
          rate(demo_api_request_duration_seconds_count{status=~"5..",job="demo"}[1m])
        )
      /
        sum by(path, instance, job) (
          rate(demo_api_request_duration_seconds_count{job="demo"}[1m])
        ) * 100 > 0.5
      )
    for: 5m                 # How long each result time series needs to be present to become a firing alert.
    labels:                 # Extra labels to attach for routing.
      severity: critical
    annotations:            # Non-identifying annotations that can be used in Alertmanager notifications.
      title: "{{$labels.instance}} high 5xx rate on {{$labels.path}}"
      description: "The 5xx error rate for path {{$labels.path}} on {{$labels.instance}} is {{$value}}%."

Don’t loose labels:

				
					decent:
rate(errors_total{job="my-job"}[5m]) > 10
bad:
sum by(job) (rate(errors_total{job="my-job"}[5m])) > 10
best:
sum without(instance, type) (rate(errors_total{job="my-job"}[5m])) > 10