Aggregating Hierarchies With Postgres
Last time we talked about one approach for modeling hierarchical data with a materialized path. Now, we’ll look at how to report on this data.
We will use the same tiny dataset of employees as before:
id | name | path
----+--------------+-----------
1 | Harold | {1}
2 | Arthur | {1,2}
3 | Alice | {1,2,3}
4 | Rose | {1,2,3,4}
5 | Bob | {1,2,5}
6 | Sally | {1,6}
7 | Mortimer | {1,6,7}
8 | Penny | {1,6,8}
Aggregating Parents
Using a materialized path, we can aggregate information about each record’s parents in a single query. For example, let’s show the names of all the people above each employee. Alice should show ['Harold','Arthur','Alice']
, while Sally should show ['Harold','Sally']
. We will want get all the parent records for each employee, and then group them by the original employee id.
I’m going to show you the final query, then we will figure out how it works:
SELECT
employees.id,
employees.name,
array_agg(parents.name ORDER BY parents.depth) AS superiors
FROM employees
LEFT JOIN (
SELECT *, array_length(path,1) as depth FROM employees
) AS parents ON employees.path && ARRAY[parents.id]
GROUP BY
employees.id;
This gives us back:
id | name | superiors
----+--------------+----------------------------
1 | Harold | {Harold}
2 | Arthur | {Harold,Arthur}
3 | Alice | {Harold,Arthur,Alice}
4 | Rose | {Harold,Arthur,Alice,Rose}
5 | Bob | {Harold,Arthur,Bob}
6 | Sally | {Harold,Sally}
7 | Mortimer | {Harold,Sally,Mortimer}
8 | Penny | {Harold,Sally,Penny}
Great! So how did that work?
We started with a table of employees
(SELECT employees.id, employees.name, ... FROM employees
). We include the id
so that we can group the results again, and we include the name
so we don’t have to remember that id:3
is Alice.
Further down, we select from employees
again, (SELECT ... FROM employees) as parents
. This subquery lets us reference the employees
a second time. Now we have two sets of employees that can be joined together.
The LEFT JOIN
is where all the magic is. Notice that instead of joining on something like employees.id = parents.id
, we joined with: employees.path && ARRAY[parents.id]
. If you remember how found the parents of a record from the previous article, you’ll recognize this. Every time the employee’s path
contains the parent’s id
, a new row will get generated. So if a record has three parents, we’ll get three rows back.
To see this in action, let’s isolate just the join between employees and their parents:
SELECT
employees.id AS employee_id,
employees.name AS employee_name,
parents.id AS parent_id,
parents.name AS parent_name
FROM employees
LEFT JOIN (
SELECT id, name FROM employees
) AS parents ON employees.path && ARRAY[parents.id]
Now you can see each of the rows we are generating:
employee_id | employee_name | parent_id | parent_name
-------------+---------------+-----------+-------------
1 | Harold | 1 | Harold
2 | Arthur | 1 | Harold
2 | Arthur | 2 | Arthur
5 | Bob | 1 | Harold
5 | Bob | 2 | Arthur
5 | Bob | 5 | Bob
...
See how Harold got one row, while Arthur got two, and Bob three? Now we have something to aggregate.
Lets look at the aggregation itself. We group the data with GROUP BY employees.id
, that brings it back down to a single row per employee.
To actually generate aggregated column, we use array_agg(parents.name ORDER BY parents.depth) AS superiors
. The array_agg
function returns an array of everything we grouped by, parents.name
in this case. The ORDER BY
is a little unusual, most aggregate functions are not ordered, but in this case, we want to make sure that results represent the chain of command. Ordering the aggregate by the parents’ depth
will make sure that each person is in the correct order. The depth is provided by array_length(path,1) as depth
, which we defined in the parents subquery. Again, we covered finding the depth in the previous article.
Putting it all together, we can now report on the parents in our hierarchy. Here’s a challenge: how would you determine whether a record has some parent that matches a certain condition? For instance what if we wanted a column that returned true
if any of an employee’s superiors were on vacation (assuming there was an on_vacation
column)? Go ahead and fork the sample if you want to try this.
Aggregating Children
We can aggregate up the hierarchy, but how about down the hierarchy?
Behind the scenes I have added a salary
column using a highly technical compensation formula. Again, all the details are available on GitHub.
id | name | path | salary
----+--------------+-----------+--------
1 | Harold | {1} | 101000
2 | Arthur | {1,2} | 82000
3 | Alice | {1,2,3} | 63000
4 | Rose | {1,2,3,4} | 44000
5 | Bob | {1,2,5} | 65000
6 | Sally | {1,6} | 86000
7 | Mortimer | {1,6,7} | 67000
8 | Penny | {1,6,8} | 68000
How can we report on the total pay that each person is responsible for? For instance since Alice is in charge of Rose we would like to see the sum of their salaries: $63,000 + $44,000 = $107,000.
There are actually two ways to achieve this. The first is to use what we just learned about aggregating parents, but flip that join around so that instead of returning the parents of each row, we return all the children of that row:
SELECT
employees.id AS employee_id,
employees.name AS employee_name,
children.id AS child_id,
children.name AS child_name
FROM employees
LEFT JOIN (
SELECT id, name, path FROM employees
) AS children ON children.path && ARRAY[employees.id]
The join condition children.path && ARRAY[employees.id]
, will now return the children for each employee_id
:
employee_id | employee_name | child_id | child_name
-------------+---------------+----------+------------
1 | Harold | 1 | Harold
1 | Harold | 2 | Arthur
1 | Harold | 5 | Bob
1 | Harold | 6 | Sally
1 | Harold | 7 | Mortimer
1 | Harold | 8 | Penny
1 | Harold | 3 | Alice
1 | Harold | 4 | Rose
2 | Arthur | 2 | Arthur
2 | Arthur | 5 | Bob
...
If you like to learn by doing, go ahead and try this approach out, it should be similar to our parent aggregation.
Although this works, there is a very handy way to generate child records that is often faster. You can use unnest(path)
to generate a row for every employee’s child.
Let’s just dive into this approach, and then we will look at how it works:
SELECT
employees.id,
employees.name,
total_salary
FROM (
SELECT unnest(path) AS parent_id,
sum(salary) AS total_salary
FROM employees
GROUP BY parent_id
) AS children INNER JOIN employees ON children.parent_id = employees.id
ORDER BY id DESC;
Now we can see how many people work for each employee:
id | name | total_salary
----+--------------+--------------
1 | Harold | 576000
2 | Arthur | 254000
3 | Alice | 107000
4 | Rose | 44000
5 | Bob | 65000
6 | Sally | 221000
7 | Mortimer | 67000
8 | Penny | 68000
Neat. Why does it work? Take a careful look at the children subquery and join. The difference is that we unnest the path
, and then aggregate on that directly in our children subquery.
What does calling unnest(path)
actually do?
SELECT id, name, path, salary, unnest(path) AS parent_id
FROM employees;
As you can see, we get one row for every item in our path
array:
id | name | path | salary | parent_id
----+----------+-----------+--------+----------
1 | Harold | {1} | 101000 | 1
2 | Arthur | {1,2} | 82000 | 1
2 | Arthur | {1,2} | 82000 | 2
5 | Bob | {1,2,5} | 65000 | 1
5 | Bob | {1,2,5} | 65000 | 2
5 | Bob | {1,2,5} | 65000 | 5
...
Each of these generated rows has the array element that got unnested in the parent_id
column, and then the other values are all duplicated. Harold, who has the path [1]
, just gets one row. Meanwhile, Bob’s path is [1,2,5]
, and returns three rows that are identical except for parent_id
.
The children subquery uses unnest(path) AS employee_id
to generate a row for every employee’s child record. The GROUP BY employee_id
collapses all those child records back into a single row, and the sum(salary) AS total_salary
generates our aggregate.
The join, INNER JOIN employees ON children.employee_id = employees.id
, just ties that aggregated data back to the original rows so that we can show the employees’ names.
Play around with unnest
, its behavior can be confusing at first, but once grasp what it is doing, it can be very powerful. As a challenge, see if you can figure out how to return the aggregated total salary along with the id and name without using the join and a subquery.
Once you have this technique under your belt, you will be able to generate a wide variety of useful reports on hierarchies.
Further Reading
- Check out the other Set Returning Functions.
- Curious about the
ORDER BY
inarray_agg
? Read up on the Aggregate Syntax. - Are your sub queries getting out of hand? Look into Common Table Expressions.
- Want to try this at home? Fork the examples on GitHub.
If you have any questions, or would like me to cover something else, just let me know.