May 21, 2013

Aggregating Hierarchies With Postgres

Last time we talked about one approach for modeling hierarchical data with a materialized path. Now, we’ll look at how to report on this data.

We will use the same tiny dataset of employees as before:

 id |   name       |   path    
----+--------------+-----------
  1 | Harold       | {1}
  2 |   Arthur     | {1,2}
  3 |     Alice    | {1,2,3}
  4 |       Rose   | {1,2,3,4}
  5 |     Bob      | {1,2,5}
  6 |   Sally      | {1,6}
  7 |     Mortimer | {1,6,7}
  8 |     Penny    | {1,6,8}

Aggregating Parents

Using a materialized path, we can aggregate information about each record’s parents in a single query. For example, let’s show the names of all the people above each employee. Alice should show ['Harold','Arthur','Alice'], while Sally should show ['Harold','Sally']. We will want get all the parent records for each employee, and then group them by the original employee id.

I’m going to show you the final query, then we will figure out how it works:

SELECT 
  employees.id, 
  employees.name, 
  array_agg(parents.name ORDER BY parents.depth) AS superiors 
FROM employees
LEFT JOIN (
  SELECT *, array_length(path,1) as depth FROM employees
) AS parents ON employees.path && ARRAY[parents.id]
GROUP BY 
  employees.id;

This gives us back:

 id |   name       |         superiors          
----+--------------+----------------------------
  1 | Harold       | {Harold}
  2 |   Arthur     | {Harold,Arthur}
  3 |     Alice    | {Harold,Arthur,Alice}
  4 |       Rose   | {Harold,Arthur,Alice,Rose}
  5 |     Bob      | {Harold,Arthur,Bob}
  6 |   Sally      | {Harold,Sally}
  7 |     Mortimer | {Harold,Sally,Mortimer}
  8 |     Penny    | {Harold,Sally,Penny}

Great! So how did that work?

We started with a table of employees (SELECT employees.id, employees.name, ... FROM employees). We include the id so that we can group the results again, and we include the name so we don’t have to remember that id:3 is Alice.

Further down, we select from employees again, (SELECT ... FROM employees) as parents. This subquery lets us reference the employees a second time. Now we have two sets of employees that can be joined together.

The LEFT JOIN is where all the magic is. Notice that instead of joining on something like employees.id = parents.id, we joined with: employees.path && ARRAY[parents.id]. If you remember how found the parents of a record from the previous article, you’ll recognize this. Every time the employee’s path contains the parent’s id, a new row will get generated. So if a record has three parents, we’ll get three rows back.

To see this in action, let’s isolate just the join between employees and their parents:

SELECT 
  employees.id   AS employee_id, 
  employees.name AS employee_name, 
  parents.id     AS parent_id, 
  parents.name   AS parent_name
FROM employees 
LEFT JOIN (
  SELECT id, name FROM employees
) AS parents ON employees.path && ARRAY[parents.id]

Now you can see each of the rows we are generating:

 employee_id | employee_name | parent_id | parent_name 
-------------+---------------+-----------+-------------
           1 | Harold        |         1 | Harold
           2 | Arthur        |         1 | Harold
           2 | Arthur        |         2 | Arthur
           5 | Bob           |         1 | Harold
           5 | Bob           |         2 | Arthur
           5 | Bob           |         5 | Bob
           ...

See how Harold got one row, while Arthur got two, and Bob three? Now we have something to aggregate.

Lets look at the aggregation itself. We group the data with GROUP BY employees.id, that brings it back down to a single row per employee.

To actually generate aggregated column, we use array_agg(parents.name ORDER BY parents.depth) AS superiors. The array_agg function returns an array of everything we grouped by, parents.name in this case. The ORDER BY is a little unusual, most aggregate functions are not ordered, but in this case, we want to make sure that results represent the chain of command. Ordering the aggregate by the parents’ depth will make sure that each person is in the correct order. The depth is provided by array_length(path,1) as depth, which we defined in the parents subquery. Again, we covered finding the depth in the previous article.

Putting it all together, we can now report on the parents in our hierarchy. Here’s a challenge: how would you determine whether a record has some parent that matches a certain condition? For instance what if we wanted a column that returned true if any of an employee’s superiors were on vacation (assuming there was an on_vacation column)? Go ahead and fork the sample if you want to try this.

Aggregating Children

We can aggregate up the hierarchy, but how about down the hierarchy?

Behind the scenes I have added a salary column using a highly technical compensation formula. Again, all the details are available on GitHub.

 id |   name       |   path    | salary 
----+--------------+-----------+--------
  1 | Harold       | {1}       | 101000
  2 |   Arthur     | {1,2}     |  82000
  3 |     Alice    | {1,2,3}   |  63000
  4 |       Rose   | {1,2,3,4} |  44000
  5 |     Bob      | {1,2,5}   |  65000
  6 |   Sally      | {1,6}     |  86000
  7 |     Mortimer | {1,6,7}   |  67000
  8 |     Penny    | {1,6,8}   |  68000

How can we report on the total pay that each person is responsible for? For instance since Alice is in charge of Rose we would like to see the sum of their salaries: $63,000 + $44,000 = $107,000.

There are actually two ways to achieve this. The first is to use what we just learned about aggregating parents, but flip that join around so that instead of returning the parents of each row, we return all the children of that row:

SELECT 
  employees.id   AS employee_id, 
  employees.name AS employee_name, 
  children.id    AS child_id, 
  children.name  AS child_name
FROM employees 
LEFT JOIN (
  SELECT id, name, path FROM employees
) AS children ON children.path && ARRAY[employees.id]

The join condition children.path && ARRAY[employees.id], will now return the children for each employee_id:

 employee_id | employee_name | child_id | child_name 
-------------+---------------+----------+------------
           1 | Harold        |        1 | Harold
           1 | Harold        |        2 | Arthur
           1 | Harold        |        5 | Bob
           1 | Harold        |        6 | Sally
           1 | Harold        |        7 | Mortimer
           1 | Harold        |        8 | Penny
           1 | Harold        |        3 | Alice
           1 | Harold        |        4 | Rose
           2 | Arthur        |        2 | Arthur
           2 | Arthur        |        5 | Bob
           ...

If you like to learn by doing, go ahead and try this approach out, it should be similar to our parent aggregation.

Although this works, there is a very handy way to generate child records that is often faster. You can use unnest(path) to generate a row for every employee’s child.

Let’s just dive into this approach, and then we will look at how it works:

SELECT 
  employees.id, 
  employees.name, 
  total_salary
FROM (
  SELECT unnest(path) AS parent_id,
         sum(salary)  AS total_salary
  FROM employees
  GROUP BY parent_id
) AS children INNER JOIN employees ON children.parent_id = employees.id
ORDER BY id DESC;

Now we can see how many people work for each employee:

 id |   name       | total_salary 
----+--------------+--------------
  1 | Harold       |       576000
  2 |   Arthur     |       254000
  3 |     Alice    |       107000
  4 |       Rose   |        44000
  5 |     Bob      |        65000
  6 |   Sally      |       221000
  7 |     Mortimer |        67000
  8 |     Penny    |        68000

Neat. Why does it work? Take a careful look at the children subquery and join. The difference is that we unnest the path, and then aggregate on that directly in our children subquery.

What does calling unnest(path) actually do?

SELECT id, name, path, salary, unnest(path) AS parent_id
FROM employees;

As you can see, we get one row for every item in our path array:

 id |   name   |   path    | salary | parent_id 
----+----------+-----------+--------+----------
  1 | Harold   | {1}       | 101000 |        1
  2 | Arthur   | {1,2}     |  82000 |        1
  2 | Arthur   | {1,2}     |  82000 |        2
  5 | Bob      | {1,2,5}   |  65000 |        1
  5 | Bob      | {1,2,5}   |  65000 |        2
  5 | Bob      | {1,2,5}   |  65000 |        5
  ...

Each of these generated rows has the array element that got unnested in the parent_id column, and then the other values are all duplicated. Harold, who has the path [1], just gets one row. Meanwhile, Bob’s path is [1,2,5], and returns three rows that are identical except for parent_id.

The children subquery uses unnest(path) AS employee_id to generate a row for every employee’s child record. The GROUP BY employee_id collapses all those child records back into a single row, and the sum(salary) AS total_salary generates our aggregate.

The join, INNER JOIN employees ON children.employee_id = employees.id, just ties that aggregated data back to the original rows so that we can show the employees’ names.

Play around with unnest, its behavior can be confusing at first, but once grasp what it is doing, it can be very powerful. As a challenge, see if you can figure out how to return the aggregated total salary along with the id and name without using the join and a subquery.

Once you have this technique under your belt, you will be able to generate a wide variety of useful reports on hierarchies.

Aggregating Hierarchies With Postgres

Aggregating Parents

Aggregating Children

Further Reading