Best practices for avoiding race conditions in inhibition rules
24th June, 2023
On the surface of it inhibition rules in Prometheus seem incredibly simple. You have a rule that when fires inhibits the alerts of one or more other rules. What more to it could there be?
Well it may surprise you to hear that there are a number of subtle cases where your inhibition rules might not work as you would expect, often due to a race condition between the inhibiting rules and the rules they inhibit. Today we will look at some best practices for avoiding race conditions in inhibition rules that when followed will ensure your inhibition rules always work reliably.
1. Inhibition rules should go at the start of a rule group
When Prometheus evaluates the rules in a rule group it does so in the same order as the rules appear in your configuration file. This means you should put your inhibition rules at the start of the group to ensure that they are not only evaluated first but also sent to Alertmanager before all other rules in the group. This is important because you want to ensure that your inhibition rules are sent to Alertmanager before the rules that they inhibit, which can be the difference between your inhibition rules working or not depending on your chosen Group wait and Group interval and whether Alertmanager has either crashed or been restarted.
For example, suppose you have the following rules in your Prometheus configuration:
groups: - name: Example group 1 rules: - alert: Inhibited rule expr: 1 for: 5m labels: inhibited: "true" annotations: summary: "This is an inhibited rule" - alert: Inhibiting rule expr: 1 for: 5m labels: inhibit: "true" annotations: summary: "This is an inhibiting rule"
and the following route and inhibit_rules in your Alertmanager configuration:
route: receiver: test group_wait: 15s group_interval: 1m repeat_interval: 5m inhibit_rules: - target_matchers: - inhibited="true" source_matchers: - inhibit="true"
If Prometheus sends "Inhibited rule" to Alertmanager at time t=0
, but does not send "Inhibiting rule" to Alertmanager until time t=15
(for example because the inhibition rule queries a lot of data which is slow, or the notifier is still busy sending an earlier batch of alerts to Alertmanager such that alerts from "Inhibiting rule" are queued) then Alertmanager will send a notification for "Inhibited rule" after 15 seconds (Group wait) as "Inhibiting rule" did not arrive in time.
You might be thinking well all you need to do is just increase Group wait from 15s to a larger interval such as 1m (the default evaluation interval in Prometheus). However, you will see in a second that this does not work if Alertmanager crashes or is restarted.
Here you can see that following an Alertmanager restart a notification was sent for "Inhibited rule", despite having inhibit_rules in the Alertmanager configuration. The issue here is that because Alertmanager is stateless it does not remember which alerts it has seen before. Instead Prometheus must resend it all alerts at the next evaluation interval (default 1m). What happens next is that Alertmanager sees an incoming alert that has been firing more than Group wait (15s) and flushes it immediately instead of waiting Group wait for any other alerts to arrive (such as the inhibiting rule):
ts=2023-06-25T14:44:30.539Z caller=cluster.go:700 level=info component=cluster msg="gossip settled; proceeding" elapsed=10.001976208s ts=2023-06-25T14:45:29.598Z caller=dispatch.go:163 level=debug component=dispatcher msg="Received alert" alert="Inhibited rule[f33ff29][active]" ts=2023-06-25T14:45:29.598Z caller=dispatch.go:515 level=debug component=dispatcher aggrGroup={}:{} msg=flushing alerts="[Inhibited rule[f33ff29][active]]" ts=2023-06-25T14:45:29.598Z caller=dispatch.go:163 level=debug component=dispatcher msg="Received alert" alert="Inhibiting rule[93cf445][active]" ts=2023-06-25T14:45:29.759Z caller=notify.go:752 level=debug component=dispatcher receiver=test integration=email[0] msg="Notify success" attempts=1 ts=2023-06-25T14:46:29.597Z caller=dispatch.go:515 level=debug component=dispatcher aggrGroup={}:{} msg=flushing alerts="[Inhibited rule[f33ff29][active] Inhibiting rule[93cf445][active]]" ts=2023-06-25T14:46:29.720Z caller=notify.go:752 level=debug component=dispatcher receiver=test integration=email[0] msg="Notify success" attempts=1
If the inhibiting rule had been put at the start of the rule group it would have been evaluated and sent to Alertmanager before the other rules in the group and the inhibition would have worked:
ts=2023-06-25T14:57:20.075Z caller=cluster.go:700 level=info component=cluster msg="gossip settled; proceeding" elapsed=10.002022458s ts=2023-06-25T14:57:44.598Z caller=dispatch.go:163 level=debug component=dispatcher msg="Received alert" alert="Inhibiting rule[93cf445][active]" ts=2023-06-25T14:57:44.598Z caller=dispatch.go:515 level=debug component=dispatcher aggrGroup={}:{} msg=flushing alerts="[Inhibiting rule[93cf445][active]]" ts=2023-06-25T14:57:44.598Z caller=dispatch.go:163 level=debug component=dispatcher msg="Received alert" alert="Inhibited rule[f33ff29][active]" ts=2023-06-25T14:58:44.597Z caller=dispatch.go:515 level=debug component=dispatcher aggrGroup={}:{} msg=flushing alerts="[Inhibited rule[f33ff29][active] Inhibiting rule[93cf445][active]]" ts=2023-06-25T14:58:44.760Z caller=notify.go:752 level=debug component=dispatcher receiver=test integration=email[0] msg="Notify success" attempts=1
2. Inhibition rules should not inhibit across rule groups
While Prometheus evaluates rules in a rule group in the same order as they appear in your configuration file, groups themselves are not only evaluated concurrently but also offset from one another in order to smooth out the load on Prometheus. This means you cannot rely on rules in one rule group to inhibit rules in another rule group. Instead inhibiting rules should be duplicated for each rule group in which they are needed.
For example, let's suppose you have the following rules in your Prometheus configuration:
groups: - name: Example group 1 rules: - alert: Inhibiting rule expr: 1 for: 5m labels: inhibit: "true" annotations: summary: "This is an inhibiting rule" - alert: Inhibited rule expr: 1 for: 5m labels: inhibited: "true" annotations: summary: "This is an inhibited rule" - name: Example group 2 rules: - alert: Another inhibited rule expr: 1 for: 5m labels: inhibited: "true" annotations: summary: "This is another inhibited rule"
Here a rule in Example group 1 is inhibiting a rule from Example group 2. However, Prometheus makes no guarantees that "Inhibiting rule" in Example group 1 will be evaluated and sent to Alertmanager before "Inhibited rule" in Example group 2.
An example of this offset Prometheus uses between rule groups can be seen here:
ts=2023-06-25T15:08:44.918Z caller=manager.go:363 level=debug component="rule manager" file=rules.yml group="Example group 2" msg="Evaluation offset" time=2023-06-25T15:08:48.192661849Z ts=2023-06-25T15:08:44.919Z caller=manager.go:363 level=debug component="rule manager" file=rules.yml group="Example group 1" msg="Evaluation offset" time=2023-06-25T15:08:59.61331345Z
In this Prometheus process Example group 2 is evaluated once per minute, starting at 2023-06-25T15:08:48Z
, 11 seconds before Example group 1 at 2023-06-25T15:08:59Z
. This is not a large enough difference to present a problem for a Group interval of 15s and Group wait of 1m, but it shows how it can be an issue for larger gaps between rule groups or a Group interval less than 15s and a Group wait less than 1m.
That said if Alertmanager crashes or is restarted then we still have a problem. Here following a restart of Alertmanager a notification is sent for "Another inhibited alert" as it was delivered to Alertmanager 11 seconds before the inhibiting rule:
ts=2023-06-25T15:18:40.241Z caller=cluster.go:700 level=info component=cluster msg="gossip settled; proceeding" elapsed=10.001935375s ts=2023-06-25T15:19:03.171Z caller=dispatch.go:163 level=debug component=dispatcher msg="Received alert" alert="Another inhibited rule[3320a36][active]" ts=2023-06-25T15:19:03.172Z caller=dispatch.go:515 level=debug component=dispatcher aggrGroup={}:{} msg=flushing alerts="[Another inhibited rule[3320a36][active]]" ts=2023-06-25T15:19:03.338Z caller=notify.go:752 level=debug component=dispatcher receiver=test integration=email[0] msg="Notify success" attempts=1 ts=2023-06-25T15:19:14.591Z caller=dispatch.go:163 level=debug component=dispatcher msg="Received alert" alert="Inhibiting rule[93cf445][active]" ts=2023-06-25T15:19:14.592Z caller=dispatch.go:163 level=debug component=dispatcher msg="Received alert" alert="Inhibited rule[f33ff29][active]" ts=2023-06-25T15:20:03.170Z caller=dispatch.go:515 level=debug component=dispatcher aggrGroup={}:{} msg=flushing alerts="[Another inhibited rule[3320a36][active] Inhibited rule[f33ff29][active] Inhibiting rule[93cf445][active]]" ts=2023-06-25T15:20:03.299Z caller=notify.go:752 level=debug component=dispatcher receiver=test integration=email[0] msg="Notify success" attempts=1
To prevent this situation from happening "Inhibiting rule" should be duplicated for each rule group in which it is needed as in the following Prometheus configuration:
groups: - name: Example group 1 rules: - alert: Inhibiting rule expr: 1 for: 5m labels: inhibit: "true" annotations: summary: "This is an inhibiting rule" - alert: Inhibited rule expr: 1 for: 5m labels: inhibited: "true" annotations: summary: "This is an inhibited rule" - name: Example group 2 rules: - alert: Inhibiting rule expr: 1 for: 5m labels: inhibit: "true" annotations: summary: "This is an inhibiting rule" - alert: Another inhibited rule expr: 1 for: 5m labels: inhibited: "true" annotations: summary: "This is another inhibited rule"
Here "Inhibiting rule" is defined twice with the same name and labels. This is actually fine as Alertmanager will see them as the same alert and de-duplicate them.
You can see in the following example that both received "Inhibiting rule" have the same fingerprint 93cf445
which in Alertmanager means they are the same alert, even if they came from different rule groups in Prometheus:
ts=2023-06-25T15:29:48.185Z caller=dispatch.go:163 level=debug component=dispatcher msg="Received alert" alert="Inhibiting rule[93cf445][active]" ts=2023-06-25T15:29:48.186Z caller=dispatch.go:163 level=debug component=dispatcher msg="Received alert" alert="Another inhibited rule[3320a36][active]" ts=2023-06-25T15:29:59.602Z caller=dispatch.go:163 level=debug component=dispatcher msg="Received alert" alert="Inhibiting rule[93cf445][active]" ts=2023-06-25T15:29:59.602Z caller=dispatch.go:515 level=debug component=dispatcher aggrGroup={}:{} msg=flushing alerts="[Another inhibited rule[3320a36][active] Inhibiting rule[93cf445][active]]" ts=2023-06-25T15:29:59.603Z caller=dispatch.go:163 level=debug component=dispatcher msg="Received alert" alert="Inhibited rule[f33ff29][active]"
You'll also see that at least one of the inhibiting rules is always sent to Alertmanager before any other rules:
ts=2023-06-25T15:35:49.991Z caller=cluster.go:700 level=info component=cluster msg="gossip settled; proceeding" elapsed=10.002213625s ts=2023-06-25T15:36:03.161Z caller=dispatch.go:163 level=debug component=dispatcher msg="Received alert" alert="Inhibiting rule[93cf445][active]" ts=2023-06-25T15:36:03.161Z caller=dispatch.go:515 level=debug component=dispatcher aggrGroup={}:{} msg=flushing alerts="[Inhibiting rule[93cf445][active]]" ts=2023-06-25T15:36:03.161Z caller=dispatch.go:163 level=debug component=dispatcher msg="Received alert" alert="Another inhibited rule[3320a36][active]" ts=2023-06-25T15:36:03.322Z caller=notify.go:752 level=debug component=dispatcher receiver=test integration=email[0] msg="Notify success" attempts=1 ts=2023-06-25T15:36:14.579Z caller=dispatch.go:163 level=debug component=dispatcher msg="Received alert" alert="Inhibiting rule[93cf445][active]" ts=2023-06-25T15:36:14.580Z caller=dispatch.go:163 level=debug component=dispatcher msg="Received alert" alert="Inhibited rule[f33ff29][active]"
This is how you can ensure your inhibition rules are sent to Alertmanager before any of the rules that they inhibit.
3. Check logs and metrics for dropped notifications
When Prometheus sends alerts to Alertmanager it buffers alerts in a ring buffer. If the buffer is full then Prometheus will drop the oldest alerts to make space for newer alerts which could result in an inhibition rule at the start of the buffer being dropped. You can tell if this is happening as Prometheus will emit logs containing "dropping alerts" and increment the prometheus_notifications_dropped_total metric. The default ring buffer size is 10,000 alerts but can be increased if necessary by passing the alertmanager.notification-queue-capacity command line flag when starting Prometheus.
Summary
In summary inhibition rules should go at the start of a rule group, and inhibition rules should not inhibit across rule groups. Instead inhibition rules should be duplicated for each rule group in which they are needed. You'll want to have alerts on the metric prometheus_notifications_dropped_total to make sure alerts, and in particular inhibition alerts, are not being dropped before they are sent to Alertmanager.