Suppose you have a distributed monitoring system with alerting and mail servers. A common problem is the monitoring of the monitoring system itelf. In case you are not familiar with the problem, imagine your mailserver is down, and you rely on email to receive alerts about services that are unexpectedly offline. You might want to set up some monitoring through an external provider and use a different communication channel such as SMS.

Alternatively, I thought of programming an API call to my alerting service and executing it every day at exactly noon. This would be a pager-level alert which does not require acknowledgement and self-resolves quickly. The procedure for the human receiving the alert is simply to habituate and come to expect the daily chime at noon.

This idea builds on two existing concepts: the canary in the coal mine and symptom-based monitoring.

The advantages are less components to manage and low time and money cost to implement. It also provides a positive confirmation, instead of only adding an additional “silence means good” layer. It also has the benefit of indirectly monitoring crond on my tools server.

The disadvantage is the average minimum awareness latency for an alerting service outage is 12 hours. And of course the risk is having a critical outage and an alerting system outage at the same time.

Example with Prometheus and Alertmanager

This script will create an alert in Alertmanager with the label severity: test.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
#!/bin/bash
template='[
    {
        "labels": {
            "alertname": "Canary alarm — Test of the alert system",
            "instance": "1",
            "severity": "test",
            "job": "alertmanager"
        },
        "annotations": {
            "info": "This is a test",
            "summary": "This should chime daily at noon."
        },
        "endsAt": "ENDSAT",
        "generatorURL": "https://prometheus.sc.iweb.com/graph?g0.range_input=1h&g0.expr=alertmanager_alerts&g0.tab=0"
    },
    {
        "labels": {
            "alertname": "Canary alarm — Test of the alert system",
            "instance": "2",
            "severity": "test",
            "job": "alertmanager"
        },
        "annotations": {
            "info": "This is a test",
            "summary": "This should chime daily at noon."
        },
        "endsAt": "ENDSAT",
        "generatorURL": "https://prometheus.sc.iweb.com/graph?g0.range_input=1h&g0.expr=alertmanager_alerts&g0.tab=0"

    }
]'
endsAt="$(date --date="45 seconds" +%Y-%m-%dT%H:%M:%S.%N%:z)"
alerts="$(sed "s/ENDSAT/${endsAt}/" <<<"${template}")"
curl --data "${alerts}" http://prometheus02.private.deverteuil.net:9093/api/v1/alerts -s | grep -v success
sleep 2
curl --data "${alerts}" http://prometheus02.private.deverteuil.net:9093/api/v1/alerts -s | grep -v success

This is the cron job:

0 12 * * * ~/src/dotfiles/scripts/send-test-alert.sh

This trick also requires it's own alert router in alertmanager.yml

route:
  receiver: notify_email
  group_by: [job]
  routes:
    - match:
        severity: alert
      receiver: alert_email
      group_by: [job]
    - match:
        severity: test
      group_wait: 5s
      group_interval: 1m
      repeat_interval: 1m
      receiver: alert_email

The severity: test match sends alerts to the same receiver as severity: alert, but duplicates are waited for a shorter period. The repeat_interval is also shorter should you need to test it.

References

Comments

Comment Atom Feed

There are no comments yet.

Add a Comment

You can use the Markdown syntax to format your comment.