Calculating Percentile (and Median) in PostgreSQL
Modern versions of PostgreSQL (≥ 9.4) make it easy to calculate the percentile
for a list of values at any percentage using the
percentile_disc ordered-set aggregation functions.
The two functions work very similarly, but differ in how they merge the final result:
percentile_discwill return a value from the input set closest to the percentile you request
percentile_contwill return an interpolated value between multiple values based on the distribution. You can think of this as being more accurate, but can return a fractional value between the two values from the input
In all these examples we'll imagine we have a table called
things with a column called
Calculating The Median
The median is generally the 50th percentile. You can calculate it with the following query:
select percentile_disc(0.5) within group (order by things.value) from things
within group syntax is called an ordered-set aggregate
For a percentile that lies between values from the input, use
to get the interpolated result:
select percentile_const(0.5) within group (order by things.value) from things
What does interpolated mean? The easiest way to illustrate the difference is to consider a list of 2 numbers: [1, 2].
percentile_disc(0.5)(the discete median) of this list will be
1, the cloest value to the center that is in the list. On the other hand,
1.5, the average of the two numbers since there’s no individual item in the list that represents the median.
Calculating A Perentile
Any percentile can be calculated by providing a fractional amount from 0 to 1. The percentile function can be used multiple times within the query. In this example, 3 quartiles are calculated.
select percentile_disc(0.25) within group (order by things.value), percentile_disc(0.5) within group (order by things.value), percentile_disc(0.75) within group (order by things.value) from things
within group can be used with other claues like
group by. Here’s an example
where different percentiles are calculated for numbers less than 100, and
select things.value < 100 as less_than_100, percentile_disc(0.75) within group (order by things.value) from things group by 1
group by 1groups by the first expression in the select statement
Calculating All Percentiles
It might be useful to calculate what the percentile value is for each percent, 1 to 100. This can be used to get an idea of the shape of your data, or you can cache it into a materialized view for fast percentile estimation for new values.
One approach is to combine the
percentile_ functions discussed above with
generate_series. In this example, every percentile from 1 to 100 is returned
as individual rows:
select k, percentile_disc(k) within group (order by things.value) from things, generate_series(0.01, 1, 0.01) as k group by k
generate_series generates temporary table with values between a starting and
ending value, with an optional step. In this example,
generate_series(0.01, 1, 0.01) returns
The arguments of
generate_seriescan be modified to specify what percentiles we want to calculate. For example, quartiles can be calculated with
generate_series(0.25, 1, 0.25).
This method can be slow because it performs a percentile lookup over the entire dataset for each of the percentiles it calculates. For example, if you are calculating 100 percentiles then it will scan your data 100 times.
Ideally we want to scan the data only once while keeping track of what
percentile a particular value is in. PostgreSQL has a window
can be used to accomplish this:
ntile will assign each value in your dataset into bucket, and you specify how
many buckets there are.
The completed example requires a subquery, so we'll look at the inner query
first. In this example we'll calculate the 100 percentiles, so we'll use
select things.value, ntile(100) over (order by things.value) from things
This query returns all of our values, along with what bucket each one belongs
to. The next step is to interpret these buckets into percentile values. Because
each value has been evenly distributed over 50 buckets then we can look at the
max value of the 50th bucket to see where the 50th percentile ends.
Generalizing this, we can use an aggregate query to calculate the
for each distinct bucket number to find all the percentiles!
Here’s the final query calculating all the percentiles from 1 to 100.
select max(buckets.value), ntile as percentile from (select things.value, ntile(100) over (order by things.value) from things) as buckets group by 2 order by 2
This should generally run about 100 times faster than the example using
percentile_disc with 100 percentiles, since it will only scan the data once
instead of 100 times.
Note: The percentile is returned as an integer, instead of a fraction. You may need to adjust your query appropriately before consuming the results elsewhere.
Keep in mind that if you want to calculate a number of percentiles other than
100 then some math will be needed to convert the
ntile bucket to the
percentile. For example, when calculating quartiles you would use
Buckets 1,2,3,4 are assigned. To convert bucket number to percentile:
ntile / 4.0.