Avoiding n+1 queries
If you're using an ORM, or any kind of abstraction over your database,
you've likely run into the n+1 selects issue. This is when a property access
in your programming language triggers a query to fetch an associated object
through a relation. When looping over a set of objects, a new query is
triggered each time the field is accessed. This results in 1 query to fetch
the initial set of items, then n
additional queries for each association.
Hence the name: n+1
.
n+1 queries should be avoided because the round-trip latency of a query can add up, even if the queries themselves are fast.
Many web frameworks have ways to mitigate this using preloading (or eager loading). The idea is that you load all of your data in as few queries as possible before looping over it.
There are two common approaches to solving this:
- Use a
join
- Use a second query
Using a join
A join enables a query to include data from another relation.
- Single query — A join can pull data from many different tables all in a single query
- It’s necessary to manage the column names — the columns of the related tables must be renamed to avoid conflicts with the original table. After fetching the data you must then map everything back to how you expect it.
- Duplicate data — If many of your results are related to the same object, then you'll duplicate that objects data many times in your result
- All tables used must be on same database
Using a second query
In this approach you take the ids you recieved from the initial query, and use
another query using the where id in (a,b,c...)
syntax.
- Multiple queries — Each relation you want to include requires a new query. Instead of
n+1
queries, there arek
queries, wherek
is the number of relations. (Generally much smaller than the number of rows) - No duplication
- General purpose — You can preload objects even if you've recieved a list of ids form someplace other than the database where you're querying from
More preloading strategies
The two approaches above will get you through the majority of preloading you'll need to do. In the rest of this post I'll cover some techniques I've used for handling more complex requirements.
Composite types
When selecting fields for a result PostgreSQL expects a single value where you might want to return multiple values. The most common example is something retrurned from a select query:
select a from SomeTable
If you want to replace a
with a subquery that returns multiple rows you'll
see this error:
ERROR: more than one row returned by a subquery used as an expression
We can use a composite type in order to return many values in the place of
one. The most primitive composite type is the result
type created by the
row
constructor (or with plain ()
):
select id, row(1, now(), 'hello', NULL, name) from SomeTable
The result
type is convenient for when queries need to pass composite values
to one another, but not ideal when returning data to your host language.
When implementing my own PostgreSQL client, pgmoon, I realized that
the server did not return all the sub-types when returning a result
value.
The client would have to parse a string to get the contained values. For this
reason I decided to focus on the json
composite type instead, since the JSON
format is well defined and parsers are fast and ubiquitous. If the PostgreSQL
client you're using doesn’t already parse the JSON result, then it would be
trivial to handle it in your language of choice.
You can read more about composite types in the PostgreSQL documentation.
Using JSON as a join alternative
PostgreSQL 9.2 added the row_to_json
function, which lets you
transform a row to a JSON object.
Modern PostgreSQL clients are aware of the JSON type and will automatically deserialize them when loading the results of the query into your host language.
Consider the following tables: Posts
and Users
. Each post has a user_id
column that acts as a foreign key to the users table. Given a query to select a
collection of posts, we'd like to also preload the users.
You could write the query like this:
select *,
(select row_to_json(Users) from Users where Users.id = user_id) as user
from posts order by id desc limit 20
This will get all the columns from the Posts
table, along with all the
columns from the Users
table namespaced inside the JSON object called user
:
id, user_id, title, user
1, 12, "My first post", {"id": 12, "username": "leafo"}
1, 14, "Another thing", {"id": 14, "username": "lee"}
Selecting a subset of columns
If you need to select only a few columns from the associated object then you'll need to modify the query a bit. The easiest way is to use an additional subquery to create a new anonymous row type with names on each field:
select *,
(select row_to_json(t) from (select user.name, user.id
from Users
where Users.id = user_id
) as t)
from posts order by id desc limit 20
You might be tempted to use a result type, but there is no way to give names to the columns so you'll be stuck with autogenerated names that aren’t descriptive
Getting a subset of a one to many relationship
Consider the following tables: Collections
and CollectionItems
. Each
collection item has a column, collection_id
, which serves as a foreign key
pointing back to the collection it’s in, along with a added_at
column holding
the date when the item was added. Each collection has many collection items in
it. Sometimes there can be thousands of items.
You're coding a page that shows a summary of collections in your database. On this page you'd like to show the 10 most recent collection items for each collection.
Getting the list of collections we want to show is trivial:
select * from Collections order by id desc limit 20
For each collection, getting the list of items is also a simple query:
select * from CollectionItems where collection_id = ? order by added_at desc limit 10
Sadly this suffers from the n+1 query problem if we try to fetch the set of items for each collection in our host application.
One technique I've seen to solve this is a union
. You generate the SQL for
selecting the subset of items for each collection, then contatenate them all
together with aunion
clause. Although you're still written n queries, you're
sending them all together in a single statement, so you avoid the overhead that
makes n+1 queries slow.
Here’s what that query might look like for a few collections:
(select * from CollectionItems where collection_id = 1 order by added_at desc limit 10)
union
(select * from CollectionItems where collection_id = 2 order by added_at desc limit 10)
union
(select * from CollectionItems where collection_id = 3 order by added_at desc limit 10)
Parenthesis are used for each query, otherwise the
limit
clause will be applied to the union of all the rows.
With a lot of collections this query becomes quite unwieldy. We can write a cleaner query using PostgreSQL arrays along with a subquery.
Because we can’t use a select
statement to return multiple rows, we'll use
the array
compsite type to bundle the ids together:
select *,
array(
select id from CollectionItems where collection_id = collections.id
order by added_at desc limit 10
)
from Collections order by id desc limit 20
With this query you'll get back an array of foreign keys for each row returned from the database. Using the two query approach you can select all the collection items needed by id, then map them back to the collection objects with a hash table.
select * from CollectionItems where id in (1,2,3,4,...)
Your PostgreSQL library should be able to deserialize an array type. If not, you might get back a string representation of the array which you'll need to parse. Alternatively, you can use the
json
type instead.
Subquery and row_to_json
We can reduce the above example to a single query by combining arrays and
row_to_json
. The final query looks something like this:
select *,
array(
select row_to_json(CollectionItems)
from CollectionItems where collection_id = collections.id
order by added_at desc limit 10
)
from Collections order by id desc limit 20
There are two great things about this query: it’s easy to read and it’s easy to process back in the application layer. We've solved a problem that’s typically annoying in a simple way.
The only change from the previous example rewriting id
as
row_to_json(CollectionItems)
. In any query you can use the table name as an
alias to the current row and the row_to_json
function takes a row as its
argument.
Querying hierarchical tables with recursive queries
Mapping a tree-like hierarchy in a relational database is a common problem. The
easiest schema is to use a parent_id
foreign key that points to a row in the
same table.
This schema is problematic when querying an entire hierarchy. For each row you fetch you'll get ids you need to fetch the next level in the hierarchy, giving you n queries where n is the depth of the tree.
Because of this there are a few clever schema tricks you can do it enable more efficient queries. Sadly they make your schema more complicated, and require managing more columns when making updates to the hierarchy.
Using PostgreSQL’s recursive queries syntax we can use the simple parent_id
schema while still having efficient queries.
The are two common queries we'll want to do on hierarchical data:
- Given a row, find all the ancestors in a single query. Additionally, we might want to preload the ancestors for many categories in a single query
- Given a row, find all the children in a single query. Additionally, we should be able to limit by either depth, or number of children returned for a single depth
Finding ancestors
Here’s a simple schema that you would use to represent a hierarchical structure:
Categories
--------
id
parent_category_id
name
The data in this table might look like this
id, parent_category_id, name
1, NULL, "Top"
2, 1, "Child 1"
3, 1, "Child 2"
4, 2, "Deep child"
If we have a reference to category id 4
, a query to access the ancestors
should return:
id, parent_category_id, name
4, 2, "Deep child"
2, 1, "Child 1"
1, NULL, "Top"
We can do this using a recursive query like this:
with recursive nested as (
(select * from Categories where id = 4)
union
select pr.* from Categories pr, nested
where pr.id = nested.parent_category_id
)
select * from nested
Recursive queries and comprised of a base case and a recursive case.
Our base case returns the row identified by id 4
, our starting point.
select * from Categories where id = 4
The base case is union
ed with the recursive case. In this case, the recursive
case operates on an existing row fetched by the query aliased by nested
.
select pr.* from Categories pr, nested
where pr.id = nested.parent_category_id
Every time a new query is added to the results, it will be used against the recursive query to see if any new rows should be added.
Finding ancestors for many objects at once
To preload the ancestors of many categories at once we can make a slight modification to the query:
with recursive nested as (
(select * from Categories where id in (4,3))
union
select pr.* from Categories pr, nested
where pr.id = nested.parent_category_id
)
select * from nested
All we've done is update the base case to start with multiple categories using
the in
syntax.
The resulting rows will be returned breadth first. You'll need to use some code in your application to organize all the returned values into arrays for each starting category.
Finding children
Finding child nodes in a hierarchical structure works very similarly. We'll use the same schema from above, but we'll introduce a position field that controls the ordering of the children in the parent.
Categories
--------
id
parent_category_id
position
name
If we wanted to select all the nested children from the 0
id category we can
write a query like this:
with recursive nested as (
select * from Categories where id = 0
union
select pr.* from Categories pr, nested
where pr.parent_category_id = nested.id
)
select * from nested
PostgreSQL will not allow us to add an order
or limit
clause in the
recursive query, giving us the errors:
ERROR: ORDER BY in a recursive query is not implemented
ERROR: LIMIT in a recursive query is not implemented
This is not an issue in practice since it would be your applications job to rebuild the nested structure after fetching all the rows, and that could include sorting what’s returned. The real advantage of having a position field (ordered naturally) is to limit the number of rows returned. If we wanted to preview the first N categories the we can filter by the position field:
with recursive nested as (
select * from Categories where id = 0
union
select pr.* from Categories pr, nested
where pr.parent_category_id = nested.id and pr.position < 10
)
select * from nested
Indexes
The same index rules apply to recursive queries as regular queries. If you want your recursive query to be efficient then you'll need an appropriate index. Although recursive queries may trigger many queries internally in the database, they're fast because they don’t impose any round trip overhead for each recursive step.
In the ancestors example, no additional indexes are needed other than the
primary key id
. Accessing a field on nested
does not need an index since
that’s the current row, but fields referenced in pr
will need an index.
In the children example, an index on parent_category_id
is needed:
create index on Categories (parent_category_id)
In the example with fetching children with a limit, we should have a composite
index on parent_category_id
and position
. Column order is important for
this index since we're using the less than range operator, position
needs to
be last.
create index on Categories (parent_category_id, position)
With the composite index, a separate index on
parent_category_id
is redundant.