May 11, 2013

Hierarchies With Postgres

Relational databases do not make modeling hierarchies easy. Luckily there is a fairly straightforward way to encode trees in Postgres.

We will look at how arrays can used for materialized path encoding, and touch on some common hierarchical operations such as finding the depth of nodes, children, parents, and finally moving records.

If you are not familiar with arrays in relational databases yet, you should read Tagging in Postgres and ActiveRecord first, it covers some basic array operators we will be using.

Modeling Hierarchies

There are a number of ways you can represent a tree in a relational database. The most obvious approach is for each node to have a reference to its parent. For instance if we might model a giant eight person company like this:

id | name
---+------------
1  | Harold
2  |   Arthur
3  |     Alice
4  |       Rose
5  |     Bob
6  |   Sally
7  |     Mortimer
8  |     Penny

One common approach is to simply store the id of each person’s boss. Although this works for simple cases, it requires multiple queries to get all the parents or children of any node. For instance to find all the people who work under Arthur, you need at least one query per level beneath him.

An alternative is to encode the hierarchy as a materialized path. A materialized path is an array that contains all the ids of the record’s parents.

CREATE TABLE employees (
    id        integer primary key,
    path      integer[],
    name      text
);

For example, Alice would have a materialized path of [1,2,3] since the people above her are Harold (id:1), Arthur (id:2), and Alice herself is (id:3).

In these examples, we also include the current record’s id in its path, this is optional, but is often helpful if you want to get a record with its parents or children. If you do not encode the current record’s id in its path, you may need to cast your empty arrays with ARRAY[]::integer[].

Depth

The depth of a record indicates how deeply nested it is. With a materialized path, this is just the length of the array. Postgres provides array_length(array, dim), which will return the length of an array in dimension dim. Since we are using one dimensional arrays, dim will always be 1.

We can now find all the people in the top two levels of our massive eight person company:

SELECT id, name, path, array_length(path,1) as depth
FROM employees 
WHERE array_length(path,1) <= 2

Doing this will give us back just 3 employees:

 id |  name    | path  | depth 
----+----------+-------+-------
  1 | Harold   | {1}   |     1
  2 |   Arthur | {1,2} |     2
  6 |   Sally  | {1,6} |     2

Children

Finding a record’s children is easy with a materialized path. We can apply what we learned about the overlap operator from Tagging in Postgres and ActiveRecord to this problem. Since each record contains the ids of all the parent records, we just look for all the records containing the id we are interested in.

For example, if we want to find all the people who work under Arthur (id:2), we would look for all the paths that overlap him (path && ARRAY[2]):

SELECT id, name, path FROM employees
WHERE path && ARRAY[2]

As you can see, each of the employees returned have Arthur’s id in their materialized path:

 id |  name      |   path    
----+------------+-----------
  2 | Arthur     | {1,2}
  3 |   Alice    | {1,2,3}
  4 |     Rose   | {1,2,3,4}
  5 |   Bob      | {1,2,5}

Parents

Finding the parents of a record is just as easy as finding their children. Instead of looking for a path that overlaps a given id, we can look for any id that overlaps a given path.

So if we wanted to find all of Alice’s superiors, we would look for all the records whose id overlaps her path ([1,2,3]).

SELECT id, name, path FROM employees
WHERE ARRAY[id] && ARRAY[1,2,3]

This will give you the three people listed in Alice’s path:

 id |  name      |  path   
----+------------+---------
  1 | Harold     | {1}
  2 |   Arthur   | {1,2}
  3 |     Alice  | {1,2,3}

Alternatively, since you already know the ids you are looking for you could also query for those records directly:

SELECT id, name, path
FROM employees 
WHERE id in (1,2,3)

Although both approaches yield the same results, Postgres will use different indices, so if this is a performance critical query, benchmark both approaches in your application.

Moving Records

Up to this point, we have focused on querying a hierarchy. Updating a hierarchy can be a little tricky since child records need to be updated as well.

Let’s imagine that Alice and everyone working for her are moved under Mortimer (path: [1,6,7]):

id | name         | path
---+--------------+----------
1  | Harold       | {1}
2  |   Arthur     | {1,2}
5  |     Bob      | {1,2,5}
6  |   Sally      | {1,6}
7  |     Mortimer | {1,6,7}
3  |       Alice  | {1,6,7,3}
4  |         Rose | {1,6,7,3,4}
8  |     Penny    | {1,6,8}

Alice’s and her subordinates’s path will now need to start with Mortimer’s path instead of Arthur’s. To perform this operation, we will slice off Arthur’s path, and replace it with Mortimer’s path:

UPDATE employees
SET path = ARRAY[1,6,7] || path[3:array_length(path,1)]
WHERE path && ARRAY[3]

There are a few things going on here, so let’s work through this from the bottom up.

We limit the update to the employees who work under Alice by selecting her, and her child records. To slice off Arthur’s path, we get the path from Alice’s depth (3) to the end of the path. Now we can tack on Mortimer’s path ([1,6,7]).

Now you can move people around your organization.