Calculating Percentile (and Median) in PostgreSQL
Modern versions of PostgreSQL (≥ 9.4) make it easy to calculate the percentile
for a list of values at any percentage using the percentile_cont
and
percentile_disc
ordered-set aggregation functions.
The two functions work very similarly, but differ in how they merge the final result:
percentile_disc
will return a value from the input set closest to the percentile you requestpercentile_cont
will return an interpolated value between multiple values based on the distribution. You can think of this as being more accurate, but can return a fractional value between the two values from the input
In all these examples we'll imagine we have a table called things
with a column called value
.
Calculating The Median
The median is generally the 50th percentile. You can calculate it with the following query:
select percentile_disc(0.5) within group (order by things.value)
from things
The within group
syntax is called an ordered-set aggregate
For a percentile that lies between values from the input, use percentile_cont
to get the interpolated result:
select percentile_const(0.5) within group (order by things.value)
from things
What does interpolated mean? The easiest way to illustrate the difference is to consider a list of 2 numbers: [1, 2].
percentile_disc(0.5)
(the discete median) of this list will be1
, the cloest value to the center that is in the list. On the other hand,percentile_const(0.5)
will return1.5
, the average of the two numbers since there’s no individual item in the list that represents the median.
Calculating A Perentile
Any percentile can be calculated by providing a fractional amount from 0 to 1. The percentile function can be used multiple times within the query. In this example, 3 quartiles are calculated.
select
percentile_disc(0.25) within group (order by things.value),
percentile_disc(0.5) within group (order by things.value),
percentile_disc(0.75) within group (order by things.value)
from things
within group
can be used with other claues like group by
. Here’s an example
where different percentiles are calculated for numbers less than 100, and
everything else.
select
things.value < 100 as less_than_100,
percentile_disc(0.75) within group (order by things.value)
from things
group by 1
group by 1
groups by the first expression in the select statement
Calculating All Percentiles
It might be useful to calculate what the percentile value is for each percent, 1 to 100. This can be used to get an idea of the shape of your data, or you can cache it into a materialized view for fast percentile estimation for new values.
One approach is to combine the percentile_
functions discussed above with
generate_series
. In this example, every percentile from 1 to 100 is returned
as individual rows:
select k, percentile_disc(k) within group (order by things.value)
from things, generate_series(0.01, 1, 0.01) as k
group by k
generate_series
generates temporary table with values between a starting and
ending value, with an optional step. In this example, generate_series(0.01, 1, 0.01)
returns 0.01
,
0.02
, etc..
The arguments of
generate_series
can be modified to specify what percentiles we want to calculate. For example, quartiles can be calculated withgenerate_series(0.25, 1, 0.25)
.
This method can be slow because it performs a percentile lookup over the entire dataset for each of the percentiles it calculates. For example, if you are calculating 100 percentiles then it will scan your data 100 times.
Ideally we want to scan the data only once while keeping track of what
percentile a particular value is in. PostgreSQL has a window
function that
can be used to accomplish this: ntile
.
ntile
will assign each value in your dataset into bucket, and you specify how
many buckets there are.
The completed example requires a subquery, so we'll look at the inner query
first. In this example we'll calculate the 100 percentiles, so we'll use ntile(100)
:
select things.value, ntile(100) over (order by things.value) from things
This query returns all of our values, along with what bucket each one belongs
to. The next step is to interpret these buckets into percentile values. Because
each value has been evenly distributed over 50 buckets then we can look at the
max value of the 50th bucket to see where the 50th percentile ends.
Generalizing this, we can use an aggregate query to calculate the max(value)
for each distinct bucket number to find all the percentiles!
Here’s the final query calculating all the percentiles from 1 to 100.
select max(buckets.value), ntile as percentile
from
(select things.value, ntile(100) over (order by things.value) from things) as buckets
group by 2 order by 2
This should generally run about 100 times faster than the example using
percentile_disc
with 100 percentiles, since it will only scan the data once
instead of 100 times.
The percentile is returned as an integer, instead of a fraction. You may need to adjust your query appropriately before consuming the results elsewhere.
Keep in mind that if you want to calculate a number of percentiles other than
100 then some math will be needed to convert the ntile
bucket to the
percentile. For example, when calculating quartiles you would use ntile(4)
.
Buckets 1,2,3,4 are assigned. To convert bucket number to percentile: ntile / 4.0
.