Tracking down data corruption in Alertmanager notifications

16th September, 2023

Back in March of this year, a Grafana user opened an issue on GitHub reporting data corruption in their Alertmanager notifications. Sometimes, although not always, text such as "Waltz, bad nymph, for quick jigs vex" would come out as "Waltz, BaD", "Nym", "Wph, For icc Jigs Vex" or some variant thereof.

The user in question happened to be using their own notification templates instead of the default templates. This is not uncommon, in fact quite the opposite. However, in this case data corruption seemed to be occuring somewhere in the template, and the user had suspected that the cause was a single line that looked similar to this:

{{ reReplaceAll "Waltz" "Fox" .Message | title }}

The template itself is quite simple. It replaces all occurences of "Waltz" with "Fox" in the message and then transforms the result into title case. The issue could be in one of the functions, the piping of data from one function to another, or even in text/template.

Reproducing the issue

I needed to establish whether the issue was reproducible, and if so document how to reproduce it. However, as data corruption in memory-safe languages like Go is usually caused by race conditions rather than due to memory-safety violations like buffer overflows, my main worry was whether I would be able to reproduce the issue at all.

Luckily, as it turned out reproducing the issue was rather straightforward. With the following configuration (grouping was disabled so I would receive one notification per alert) and just a couple of firing alerts I was immediately receiving emails with corrupted subjects:

receivers:
- name: email
  email_configs:
  - subject: '{{ with (index .Alerts 0) }}{{ reReplaceAll "Waltz" "Fox" (index .Labels "alertname") | title }}{{ end }}'
route:
  receiver: email
  group_by: ['...']

The issue was not reproducible though if there was just one firing alert.

Identifying the issue

Since the issue was not reproducible with just a single alert, but instead required there to be at least two alerts firing at the same time, I suspected the cause to be either a race condition in ExecuteTextString or in text/template.

To see if I could verify this hypothesis I added a mutex that would prevent two (or more) goroutines from running ExecuteTextString at the same time:

diff --git a/template/template.go b/template/template.go
index a42cf03d..9e56bf1f 100644
--- a/template/template.go
+++ b/template/template.go
@@ -23,6 +23,7 @@ import (
        "regexp"
        "sort"
        "strings"
+       "sync"
        tmpltext "text/template"
        "time"

@@ -38,6 +39,7 @@ import (
 type Template struct {
        text *tmpltext.Template
        html *tmplhtml.Template
+       mu   sync.Mutex

        ExternalURL *url.URL
 }
@@ -131,6 +133,8 @@ func (t *Template) FromGlob(path string) error {

 // ExecuteTextString needs a meaningful doc comment (TODO(fabxc)).
 func (t *Template) ExecuteTextString(text string, data interface{}) (string, error) {
+       t.mu.Lock()
+       defer t.mu.Unlock()
        if text == "" {
                return "", nil
        }

The issue was fixed! However, I still didn't understand where the issue was or what was causing it. I knew that text/template was safe to be used concurrently, so why did I need to use a mutex?

Still suspecting the issue to be in either ExecuteTextString or in text/template, I then decided to isolate the template from the rest of the Alertmanager code. I copied the minimum amount of code into a new file and ran it just like a normal program:

package main

import (
    "bytes"
    "log"
    "regexp"
    "sync"
    "text/template"

    "golang.org/x/text/cases"
    "golang.org/x/text/language"
)

func main() {
    data := map[string]string{"Message": "Waltz, bad nymph, for quick jigs vex"}
    tmplText := `{{ reReplaceAll "Waltz" "Fox" .Message | title }}`

    tmpl, err := template.New("").Parse("")
    if err != nil {
        log.Fatal(err)
    }
    funcs := map[string]any{
        "title": cases.Title(language.AmericanEnglish).String,
        "reReplaceAll": func(pattern, repl, text string) string {
            re := regexp.MustCompile(pattern)
            return re.ReplaceAllString(text, repl)
        },
    }
    tmpl = tmpl.Funcs(funcs)

    wg := sync.WaitGroup{}
    for i := 0; i < 10; i++ {
        wg.Add(1)
        go func() {
            defer wg.Done()
            newTmpl, err := tmpl.Clone()
            if err != nil {
                log.Fatal(err)
            }
            newTmpl, err = newTmpl.New("").Parse(tmplText)
            if err != nil {
                log.Fatal(err)
            }
            buf := bytes.Buffer{}
            newTmpl.Execute(&buf, data)
            log.Println(buf.String())
        }()
    }
    wg.Wait()
}

Voila! I saw the same data corruption as reported in the GitHub issue:

2023/09/16 10:28:25 fox, Bad Nymph, For Quick Jigs Vex
2023/09/16 10:28:25 oFoO, BA nymh, FFo  quick JIis vex
2023/09/16 10:28:25 F
2023/09/16 10:28:25 Fox, Bad Nymph, For Quick Jigs Vex
2023/09/16 10:28:25 Fox, Bad Nymph, For Quick Jigs VeX
2023/09/16 10:28:25 fox, Bad Nymph, For Quick Jigs Vex
2023/09/16 10:28:25 fOx BA Nmh, For Quick Jiis VeX
2023/09/16 10:28:25 Fox, Bad Nymph, For Quick JiG
2023/09/16 10:28:25 Fox, Bad Nymph, For Quick Jigs Vex
2023/09/16 10:28:25 Fox, Bad Nymph, For Quick Jigs Vex

This meant the issue could not be in ExecuteTextString because it was happening when I used just text/template. Perhaps the data corruption was occuring in one of the functions instead. What happened if I removed reReplaceAll from the template?

2023/09/16 10:30:11 WaLtz, 
2023/09/16 10:30:11 Waltz, Bad Nymph, For Quick Jigs Vex
2023/09/16 10:30:11 Waltz, Bad Nymph, For Quick Jigs Vex
2023/09/16 10:30:11 Waltz, Bad Nymph, For Quick Jigs Vex
2023/09/16 10:30:11 Waltz, Bad Nymph, For Quick Jigs Vex
2023/09/16 10:30:11 Wwwaltz, BBd NYmph,,, FOr qUuick Jigs Vex
2023/09/16 10:30:11 Waltz, Bad Nymph, For Quick Jigs Vex
2023/09/16 10:30:11 Waltz, Bad Nymph, For Quick Jigs Vex
2023/09/16 10:30:11 Waltz, Bad Nymph, For Quick Jigs Vex
2023/09/16 10:30:11 Waltz, Bad Nymph, For Quick Jigs Vex

The issue was not in reReplaceAll. Could it have been in title?

2023/09/16 10:31:15 Fox, bad nymph, for quick jigs vex
2023/09/16 10:31:15 Fox, bad nymph, for quick jigs vex
2023/09/16 10:31:15 Fox, bad nymph, for quick jigs vex
2023/09/16 10:31:15 Fox, bad nymph, for quick jigs vex
2023/09/16 10:31:15 Fox, bad nymph, for quick jigs vex
2023/09/16 10:31:15 Fox, bad nymph, for quick jigs vex
2023/09/16 10:31:15 Fox, bad nymph, for quick jigs vex
2023/09/16 10:31:15 Fox, bad nymph, for quick jigs vex
2023/09/16 10:31:15 Fox, bad nymph, for quick jigs vex
2023/09/16 10:31:15 Fox, bad nymph, for quick jigs vex

Having removed title from the template I could no longer reproduce the data corruption I had seen earlier. I then went go read the docs for golang.org/x/text/cases in case it was not-thread safe and to my surprise it said:

A Caser transforms given input to a certain case. It implements transform.Transformer. A Caser may be stateful and should therefore not be shared between goroutines.

But that's exactly what the code was doing as .String is a reference to a method on a shared caser:

"title": cases.Title(language.AmericanEnglish).String,

Fixing the issue

The fix here was quite simple. If a caser can be stateful, all we needed to do was to create a new caser for each call to title, so a caser will never be shared between goroutines and can instead be garbage collected once having executed the template:

"title": func(text string) string {
    return cases.Title(language.AmericanEnglish).String(text)
},

Here we can see the output of the isolated program with the fix:

2023/09/16 10:40:47 Fox, Bad Nymph, For Quick Jigs Vex
2023/09/16 10:40:47 Fox, Bad Nymph, For Quick Jigs Vex
2023/09/16 10:40:47 Fox, Bad Nymph, For Quick Jigs Vex
2023/09/16 10:40:47 Fox, Bad Nymph, For Quick Jigs Vex
2023/09/16 10:40:47 Fox, Bad Nymph, For Quick Jigs Vex
2023/09/16 10:40:47 Fox, Bad Nymph, For Quick Jigs Vex
2023/09/16 10:40:47 Fox, Bad Nymph, For Quick Jigs Vex
2023/09/16 10:40:47 Fox, Bad Nymph, For Quick Jigs Vex
2023/09/16 10:40:47 Fox, Bad Nymph, For Quick Jigs Vex
2023/09/16 10:40:47 Fox, Bad Nymph, For Quick Jigs Vex

Summary

There are a number of lessons that I think can be taken away from this issue:

  1. Never assume a function is thread-safe unless it tells you so.
  2. Be cautious of functions that operate on immutable types like strings. Even if the type is immutable, the function may keep state which isn't.
  3. Write concurrent tests that test thread-safe code. This can be achieved at some basic level with unit tests, however fuzz tests are more suitable for longer running test cases.

Last it should be mentioned that the fix for this issue is available in Alertmanager versions 0.26 and newer. You can find the pull request here.