How to flatten JSON with pattern matching in Rust

Published on Jul 1, 2019

JSON is a text-based format that is human-readable, straightforward to parse and even easier to generate. It is most popular on the web where it is used to send data between client and server.

However despite being human-readable, one frequent complaint with JSON is that it can be quite difficult to grep. To help people work with the structured nature of JSON a number of programs exist, perhaps the most popular being a program called jq. But programs such as jq tend to have their own embedded languages with built-in functions such as select, map and filter which although very powerful presents a steep learning curve if the intention is solely to match patterns against JSON objects. Can we not just use grep instead?

Lets take a JSON object, such as the example below:

{
    "name": "John Doe",
    "age": 43,
    "email": "[email protected]",
    "phones": [
        "+44 1234567",
        "+44 2345678"
     ],
     "addresses": [{
         "number": 1,
         "street": "Borough High Street",
         "city": "London",
         "zip": "SE1 1"
     }]
}

and for each field take the keys and indexes that could be considered its parent(s) and concatenate them together. If we did that for each field in the example above then we would end up with something like this:

.name = "John Doe"
.age = 43
.email = "[email protected]"
.phones.0 = "+44 1234567"
.phones.1 = "+44 2345678"
.addresses.0.number = 1
.addresses.0.street = "Borough High Street"
.addresses.0.city = "London"
.addresses.0.zip = "SE1 1"

This is what we mean when we talk about flattened JSON and as you can see it is much easier to grep than structured JSON:

$ flatjson example.json | grep ".phones.[0-9+]"
.phones.0 = "+44 1234567"
.phones.1 = "+44 2345678"

So how do we flatten JSON? The answer to that question is with an algorithm, of course! So let's now look at how we can design an algorithm to achieve just that.

The pseudocode

A value in JSON can be either a string, a number, an object, an array, a boolean or null, as shown in this railroad diagram:

An array in JSON can contain zero or more values:

And an object in JSON can contain one or more string/value pairs:

This means that if a value can be either a string, a number, an object, an array, a boolean or null; and that arrays and objects can contain zero or more values, then arrays and objects can also contain other arrays and objects. We call this a Recursive Type, and such types lend themselves quite nicely to what we call recursive algorithms.

What is a recursive algorithm? A recursive algorithm, quite simply, is an algorithm which calls itself with smaller or simpler inputs until it has a value for the smallest or most simple input, known as the base case. It obtains the result for the current case from the result of each recursive call until the final result is known.

All recursive algorithms can be defined by the following two properties:

  1. A base case, or inductive step. This is the terminating case in a recursive function where no more recursive calls are required to produce a result
  2. A set of rules which reduce all other cases to the base case

To understand how this works let's write some pseudocode to see how we could flatten JSON with a recursive algorithm. First we need to define the recursive function and its arguments. Our recursive function accepts the following arguments:

  • A vector result to store the flattened fields
  • A value which is the JSON value from the railroad diagram above
  • Afield which is the concatenated keys and indexes that could be considered the value's parent(s)

In psuedocode it looks like this:

function flatten_value(result, value, field):

Next we need a base case, or inductive step as it is also known. The base case is the terminating case in recursion where no more recursive calls are required to produce a result. In JSON this is when we have a string, a number, a boolean or null:

    if value is null then
        result.push(field + " = " + "null")
    elif value is bool then
        result.push(field + " = " + bool_to_string(value))
    elif value is number then
        result.push(field + " = " + number_to_string(value))
    elif value is string then
        result.push(field + " = " + value)

With the base case complete we now need to define the set of rules that reduce all other cases to the base case. In JSON this is when we have an array or an object and is where we have our recursive calls to flatten_value:

    elif value is array then
        for i in value do
            result.push(flatten_value(
                result,
                value[i],
                field + "." + number_to_string(i)))
        end
    elif value is object then
        for name, value in object do
            result.push(flatten_value(
                result,
                value,
                field + "." + name))
        end
    end
end

The Rust implementation

Let's start to write a program in Rust that can take a JSON object and flatten it, just like the psuedocode we wrote earlier.

First we need to create a new project, and to do this we use cargo:

$ cargo new flatjson

Next we need to edit Cargo.toml and add the serde_json crate to our project. We need serde_json to be able to deserialize JSON without writing our own deserializer. Note that we also require the preserve_order feature for serde_json. This is so we can preserve the order of name/value pairs to ensure that objects in the flattened JSON are in the same order as the input JSON. This feature is not required for the correctness of the program, but it does make testing much easier later on.

[dependencies]
failure = "0.1.1"

[dependencies.serde_json]
version = "1.0"
features = ["preserve_order"]

It's now time to write some Rust code. To start create a file called flatjson.rs in src and import the Value enum from the serde_json package:

use serde_json::Value;

Below the import for serde_json create the signature for a public function called flatten. This function will accept as an argument a reference to a Value enum and return a Vec<String>. Note that this is a vector of heap-allocated strings, not the immutable string slice &str.

pub fn flatten(v: &Value) -> Vec<String> {

In the body of function allocate a new vector to store the flattened fields from the input JSON. Next we call a recursive function flatten_value, just like the pseudcode we wrote earlier, passing a mutable reference to the vector before returning it to the caller. Note that this does not violate Rust's ownership rules because the mutable reference to the vector &mut items is dropped before the vector is moved out of function and returned to the caller.

    let mut items = Vec::new();
    flatten_value(&mut items, v, "");
    items
}

Now we can write the recursive flatten_value function, like we did in pseudocode earlier. Note that unlike flatten, flatten_value is internal to the flatjson module and so does not start with pub. You should also be able to see that the function signature is the same as our pseudocode too!

fn flatten_value(items: &mut Vec<String>, value: &Value, field: &str) {

Just like when we wrote our pseudocode we will first start by implementing the base case, or the inductive step as it is also known.

Unlike some languages which require the programmer to use a large number of if and else if statements, in Rust we can use a neat feature called pattern matching to see if a value is null, a boolean, a number or a string. That same pseudocode written in Rust looks like this:

    match value {
        Value::Null => items.push(format!("{} = {}", field, "null")),
        Value::Bool(b) => items.push(format!("{} = {}", field, b)),
        Value::Number(n) => items.push(format!("{} = {}", field, n)),
        Value::String(s) => items.push(format!("{} = \"{}\"", field, s)),

All that we have left to do now is implement the reduction case for arrays and objects.

When the value is an array we want to call flatten_value for each of its elements. To do this we create an iterator over the array arr by calling the iter() method. We additonally call the enumerate() method on the iterator to get the index of each element. Then for each element in the array we call flatten_value, passing a mutable reference to items each time. The name of the field for each element is the concatenation of the current value of field and the index of the element in the array.

        Value::Array(arr) => {
            for (i, v) in arr.iter().enumerate() {
                flatten_value(items, v, format!("{}.{}", field, i).as_str())
            }
        }

When the value is an object we want to call flatten_value for each name/value pair. To do this we also create an iterator by calling the iter() method. Then for each name/value pair in the object we call flatten_value, again passing a mutable reference to items each time. The name of the field for each name/pair is the concatenation of the current value of field and the current key from the iterator.

        Value::Object(obj) => {
            for (k, v) in obj.iter() {
                flatten_value(items, v, format!("{}.{}", field, k).as_str())
            }
        }
    }
}

With the flatten and flatten_value functions now complete we should write some unit tests to make sure that the code is correct. Recall earlier that we required the preserve_order feature for serde_json. This is so we can preserve the order of name/value pairs to ensure that objects in the flattened JSON are in the same order as the input JSON. You will see now why this is useful.

At the end of the file create a module called test and import all packages from the parent module, aliased as super. Write a test function called test_flatten and decode a JSON object from a string literal with serde_json. Take the result from serde_json, a reference to Value, and call flatten. Your test should assert the flattened JSON is as expected.

#[cfg(test)]
mod test {
    use super::*;
    use serde_json;

    #[test]
    fn test_flatten() {
        let v: Value = serde_json::from_str(
            r#"
            {
                "name": "John Doe",
                "age": 43,
                "email": null,
                "phones": [
                    "+44 1234567",
                    "+44 2345678"
                ],
                "addresses": [{
                    "number": 1,
                    "street": "Borough High Street",
                    "city": "London",
                    "zip": "SE1 1"
                }]
            }"#,
        )
        .unwrap();
        assert_eq!(
            vec![
                ".name = \"John Doe\"",
                ".age = 43",
                ".email = null",
                ".phones.0 = \"+44 1234567\"",
                ".phones.1 = \"+44 2345678\"",
                ".addresses.0.number = 1",
                ".addresses.0.street = \"Borough High Street\"",
                ".addresses.0.city = \"London\"",
                ".addresses.0.zip = \"SE1 1\"",
            ],
            flatten(&v)
        );
    }
}

And that's it! To see the complete code, and a working CLI, please refer to flatjson on GitHub.