PostgreSQL Preloading

Fetching complex data while avoiding large numbers of queries

Avoiding n+1 queries

If you're using an ORM, or any kind of abstraction over your database, you've likely run into the n+1 selects issue. This is when a property access in your programming language triggers a query to fetch an associated object through a relation. When looping over a set of objects, a new query is triggered each time the field is accessed. This results in 1 query to fetch the initial set of items, then n additional queries for each association. Hence the name: n+1.

n+1 queries should be avoided because the round-trip latency of a query can add up, even if the queries themselves are fast.

Many web frameworks have ways to mitigate this using preloading (or eager loading). The idea is that you load all of your data in as few queries as possible before looping over it.

There are two common approaches to solving this:

  • Use a join
  • Use a second query

Using a join

A join enables a query to include data from another relation.

  • Single query — A join can pull data from many different tables all in a single query
  • It’s necessary to manage the column names — the columns of the related tables must be renamed to avoid conflicts with the original table. After fetching the data you must then map everything back to how you expect it.
  • Duplicate data — If many of your results are related to the same object, then you'll duplicate that objects data many times in your result
  • All tables used must be on same database

Using a second query

In this approach you take the ids you recieved from the initial query, and use another query using the where id in (a,b,c...) syntax.

  • Multiple queries — Each relation you want to include requires a new query. Instead of n+1 queries, there are k queries, where k is the number of relations. (Generally much smaller than the number of rows)
  • No duplication
  • General purpose — You can preload objects even if you've recieved a list of ids form someplace other than the database where you're querying from

More preloading strategies

The two approaches above will get you through the majority of preloading you'll need to do. In the rest of this post I'll cover some techniques I've used for handling more complex requirements.

Composite types

When selecting fields for a result PostgreSQL expects a single value where you might want to return multiple values. The most common example is something retrurned from a select query:

select a from SomeTable

If you want to replace a with a subquery that returns multiple rows you'll see this error:

ERROR:  more than one row returned by a subquery used as an expression

We can use a composite type in order to return many values in the place of one. The most primitive composite type is the result type created by the row constructor (or with plain ()):

select id, row(1, now(), 'hello', NULL, name) from SomeTable

The result type is convenient for when queries need to pass composite values to one another, but not ideal when returning data to your host language.

When implementing my own PostgreSQL client, pgmoon, I realized that the server did not return all the sub-types when returning a result value. The client would have to parse a string to get the contained values. For this reason I decided to focus on the json composite type instead, since the JSON format is well defined and parsers are fast and ubiquitous. If the PostgreSQL client you're using doesn’t already parse the JSON result, then it would be trivial to handle it in your language of choice.

You can read more about composite types in the PostgreSQL documentation.

Using JSON as a join alternative

PostgreSQL 9.2 added the row_to_json function, which lets you transform a row to a JSON object.

Modern PostgreSQL clients are aware of the JSON type and will automatically deserialize them when loading the results of the query into your host language.

Consider the following tables: Posts and Users. Each post has a user_id column that acts as a foreign key to the users table. Given a query to select a collection of posts, we'd like to also preload the users.

You could write the query like this:

select *,
  (select row_to_json(Users) from Users where Users.id = user_id) as user
from posts order by id desc limit 20

This will get all the columns from the Posts table, along with all the columns from the Users table namespaced inside the JSON object called user:

id, user_id, title, user
1, 12, "My first post", {"id": 12, "username": "leafo"}
1, 14, "Another thing", {"id": 14, "username": "lee"}

Selecting a subset of columns

If you need to select only a few columns from the associated object then you'll need to modify the query a bit. The easiest way is to use an additional subquery to create a new anonymous row type with names on each field:

select *,
  (select row_to_json(t) from (select user.name, user.id
    from Users
    where Users.id = user_id
  ) as t)
from posts order by id desc limit 20

You might be tempted to use a result type, but there is no way to give names to the columns so you'll be stuck with autogenerated names that aren’t descriptive

Getting a subset of a one to many relationship

Consider the following tables: Collections and CollectionItems. Each collection item has a column, collection_id, which serves as a foreign key pointing back to the collection it’s in, along with a added_at column holding the date when the item was added. Each collection has many collection items in it. Sometimes there can be thousands of items.

You're coding a page that shows a summary of collections in your database. On this page you'd like to show the 10 most recent collection items for each collection.

Getting the list of collections we want to show is trivial:

select * from Collections order by id desc limit 20

For each collection, getting the list of items is also a simple query:

select * from CollectionItems where collection_id = ? order by added_at desc limit 10

Sadly this suffers from the n+1 query problem if we try to fetch the set of items for each collection in our host application.

One technique I've seen to solve this is a union. You generate the SQL for selecting the subset of items for each collection, then contatenate them all together with aunion clause. Although you're still written n queries, you're sending them all together in a single statement, so you avoid the overhead that makes n+1 queries slow.

Here’s what that query might look like for a few collections:

(select * from CollectionItems where collection_id = 1 order by added_at desc limit 10)
union
(select * from CollectionItems where collection_id = 2 order by added_at desc limit 10)
union
(select * from CollectionItems where collection_id = 3 order by added_at desc limit 10)

Parenthesis are used for each query, otherwise the limit clause will be applied to the union of all the rows.

With a lot of collections this query becomes quite unwieldy. We can write a cleaner query using PostgreSQL arrays along with a subquery.

Because we can’t use a select statement to return multiple rows, we'll use the array compsite type to bundle the ids together:

select *,
  array(
    select id from CollectionItems where collection_id = collections.id
    order by added_at desc limit 10
  )
from Collections order by id desc limit 20

With this query you'll get back an array of foreign keys for each row returned from the database. Using the two query approach you can select all the collection items needed by id, then map them back to the collection objects with a hash table.

select * from CollectionItems where id in (1,2,3,4,...)

Your PostgreSQL library should be able to deserialize an array type. If not, you might get back a string representation of the array which you'll need to parse. Alternatively, you can use the json type instead.

Subquery and row_to_json

We can reduce the above example to a single query by combining arrays and row_to_json. The final query looks something like this:

select *,
array(
  select row_to_json(CollectionItems)
  from CollectionItems where collection_id = collections.id
  order by added_at desc limit 10
)
from Collections order by id desc limit 20

There are two great things about this query: it’s easy to read and it’s easy to process back in the application layer. We've solved a problem that’s typically annoying in a simple way.

The only change from the previous example rewriting id as row_to_json(CollectionItems). In any query you can use the table name as an alias to the current row and the row_to_json function takes a row as its argument.

Querying hierarchical tables with recursive queries

Mapping a tree-like hierarchy in a relational database is a common problem. The easiest schema is to use a parent_id foreign key that points to a row in the same table.

This schema is problematic when querying an entire hierarchy. For each row you fetch you'll get ids you need to fetch the next level in the hierarchy, giving you n queries where n is the depth of the tree.

Because of this there are a few clever schema tricks you can do it enable more efficient queries. Sadly they make your schema more complicated, and require managing more columns when making updates to the hierarchy.

Using PostgreSQL’s recursive queries syntax we can use the simple parent_id schema while still having efficient queries.

The are two common queries we'll want to do on hierarchical data:

  1. Given a row, find all the ancestors in a single query. Additionally, we might want to preload the ancestors for many categories in a single query
  2. Given a row, find all the children in a single query. Additionally, we should be able to limit by either depth, or number of children returned for a single depth

Finding ancestors

Here’s a simple schema that you would use to represent a hierarchical structure:

Categories
--------
id
parent_category_id
name

The data in this table might look like this

id, parent_category_id, name
1, NULL, "Top"
2, 1, "Child 1"
3, 1, "Child 2"
4, 2, "Deep child"

If we have a reference to category id 4, a query to access the ancestors should return:

id, parent_category_id, name
4, 2, "Deep child"
2, 1, "Child 1"
1, NULL, "Top"

We can do this using a recursive query like this:

with recursive nested as (
  (select * from Categories where id = 4)
  union
  select pr.* from Categories pr, nested
    where pr.id = nested.parent_category_id
)
select * from nested

Recursive queries and comprised of a base case and a recursive case.

Our base case returns the row identified by id 4, our starting point.

select * from Categories where id = 4

The base case is unioned with the recursive case. In this case, the recursive case operates on an existing row fetched by the query aliased by nested.

select pr.* from Categories pr, nested
  where pr.id = nested.parent_category_id

Every time a new query is added to the results, it will be used against the recursive query to see if any new rows should be added.

Finding ancestors for many objects at once

To preload the ancestors of many categories at once we can make a slight modification to the query:

with recursive nested as (
  (select * from Categories where id in (4,3))
  union
  select pr.* from Categories pr, nested
    where pr.id = nested.parent_category_id
)
select * from nested

All we've done is update the base case to start with multiple categories using the in syntax.

The resulting rows will be returned breadth first. You'll need to use some code in your application to organize all the returned values into arrays for each starting category.

Finding children

Finding child nodes in a hierarchical structure works very similarly. We'll use the same schema from above, but we'll introduce a position field that controls the ordering of the children in the parent.

Categories
--------
id
parent_category_id
position
name

If we wanted to select all the nested children from the 0 id category we can write a query like this:

with recursive nested as (
  select * from Categories where id = 0
  union
  select pr.* from Categories pr, nested
    where pr.parent_category_id = nested.id
)
select * from nested

PostgreSQL will not allow us to add an order or limit clause in the recursive query, giving us the errors:

ERROR:  ORDER BY in a recursive query is not implemented
ERROR:  LIMIT in a recursive query is not implemented

This is not an issue in practice since it would be your applications job to rebuild the nested structure after fetching all the rows, and that could include sorting what’s returned. The real advantage of having a position field (ordered naturally) is to limit the number of rows returned. If we wanted to preview the first N categories the we can filter by the position field:

with recursive nested as (
  select * from Categories where id = 0
  union
  select pr.* from Categories pr, nested
    where pr.parent_category_id = nested.id and pr.position < 10
)
select * from nested

Indexes

The same index rules apply to recursive queries as regular queries. If you want your recursive query to be efficient then you'll need an appropriate index. Although recursive queries may trigger many queries internally in the database, they're fast because they don’t impose any round trip overhead for each recursive step.

In the ancestors example, no additional indexes are needed other than the primary key id. Accessing a field on nested does not need an index since that’s the current row, but fields referenced in pr will need an index.

In the children example, an index on parent_category_id is needed:

create index on Categories (parent_category_id)

In the example with fetching children with a limit, we should have a composite index on parent_category_id and position. Column order is important for this index since we're using the less than range operator, position needs to be last.

create index on Categories (parent_category_id, position)

With the composite index, a separate index on parent_category_id is redundant.