What is Prometheus?

Prometheus is an open-source monitoring system with a powerful query language and service discovery integrations. With that, it has become one of the monitoring systems best suited for this day and age of microservices, containers and cloud computing.

Alerts in Prometheus are configured and generated on the Prometheus server instance, and then sent to Alertmanager for routing to appropriate targets.

Our initial experience

When we just started to use Prometheus, we did not have much experience with it, so, among other things, we ended up with rather rigid way of defining alerts.

In general, to send an alert in Prometheus, at least 2 things are needed:

  1. Alert rule on the Prometheus side to generate an alert
  2. Alertmanager configuration to receive and process the generated alert

When a new alert was to be added, we had to define an alert rule:

- alert: SomeWorkflowProcessFailed
  expr: changes(some_wf_process_total{result="failed"}[1m]) > 0
  labels:
    severity: critical
    app: some-wf
  annotations:
    description: Some Workflow {{ $labels.action }} ran out of retries
    summary: Some Workflow process issue

Also, Alertmanager configuration had to be changed, so it would receive and route their alert based on app label from above alert rule:

routes:
- match_re:
    app: ^(something|some-wf)$
  receiver: team-something-slack

Lastly, an appropriate receiver had to be defined, to handle this alert, so that it would be sent to a specific Slack channel monitored by the team responsible for that particular service.

All these steps had to be repeated, if someone wanted to add a new alert for another service, maintained by a different team, which meant a different target for the alert notification.

This could (and did) work, when there are a few alerts and not that many teams using them, but, as Prometheus was slowly being adopted by more and more teams and services, it became apparent, that we need an easier way to add new alerts.

How we improved

The alert rules are rather simple, at least in the way how and where they are defined. Alertmanager configuration, on the other hand, is a bit more tricky, especially the matching and routing part in our initial approach. One mistake could break alerts for someone or even everyone else.

The devised approach takes advantage of the fact, that it is possible to use variables in various parts of Alertmanager configuration.

An example: suppose someone needs to add a new alert for some metric. The first step, as before, is to add the alert rule file, which would contain the actual alert definition:

- alert: SomeWorkflowProcessFailed
  expr: changes(some_wf_process_total{result="failed"}[1m]) > 0
  labels:
    severity: critical
    alert_route_to: slack
    alert_slack_channel: team-something-alerts
  annotations:
    description: Some Workflow {{ $labels.action }} ran out of retries
    summary: Some Workflow process issue

Everything that is required to route the alert, is defined within the alert rule itself - it defines, where and how the notification should be sent. Alertmanager configuration would fill in the required things from the information sent with the alerts, and we would only need to change the Alertmanager configuration once, not with every new alert.

To enable that, we define the possible alert routes first in the Alertmanager configuration:

route:
...
  receiver: slack

  routes:
  - match:
      alert_route_to: email
    group_by:
    - alert_email_address
    receiver: email
  - match:
      alert_route_to: slack
    receiver: slack

We try to match the alert_route_to label from the alert with email or slack (could be other routes as well), and point it to the respective receiver below. The default route is to slack, if none of the routes would match.

Finally, the Alertmanager receiver configuration part:

receivers:
- name: email
  email_configs:
  - to: '{{ if .CommonLabels.alert_email_address }}{{ .CommonLabels.alert_email_address }}{{ else }}default@somecorp.lv{{ end }}'
    send_resolved: true

- name: slack
  slack_configs:
  - send_resolved: true
    username: Prometheus
    channel: '{{ if .CommonLabels.alert_slack_channel }}#{{ .CommonLabels.alert_slack_channel }}{{ else }}#alerts-default{{ end }}'
    pretext: '{{ .CommonAnnotations.summary }}'
    text: '{{ template "slack.custom.text" . }}'

We have defined 2 routes for email and slack. We fill in the recipients from the labels, that are sent with the alert. That is either to field for the emails or channel for Slack. An alert receiver can have multiple *_configs blocks, in case there is a need to send a particular alert to more than 1 destination. More on that in a blog post from Robust Perception.

Notification content itself (text, links..) are filled in by templates, which are pretty much the same as provided with Prometheus examples.

Using this approach, we only need to set up the alert rule on Prometheus side, not forgetting to set appropriate labels, of course. Provided that is done, in our situation, no Alertmanager configuration changes after the initial setup, outlined above, are necessary. In practice, every alert could be sent to a different destination, even if they came from the same service or source.

Summary

Using custom labels, which are defined within the alert rule and sent with the alert, we simplified configuration of Prometheus alerts, and made it a bit more flexible and easier to set up for anyone using Prometheus within the organization.