Calculating Percentile (and Median) in PostgreSQL

Modern versions of PostgreSQL (≥ 9.4) make it easy to calculate the percentile for a list of values at any percentage using the percentile_cont and percentile_disc ordered-set aggregation functions.

The two functions work very similarly, but differ in how they merge the final result:

  • percentile_disc will return a value from the input set closest to the percentile you request
  • percentile_cont will return an interpolated value between multiple values based on the distribution. You can think of this as being more accurate, but can return a fractional value between the two values from the input

In all these examples we'll imagine we have a table called things with a column called value.

Calculating The Median

The median is generally the 50th percentile. You can calculate it with the following query:

select percentile_disc(0.5) within group (order by things.value)
from things

The within group syntax is called an ordered-set aggregate

For a percentile that lies between values from the input, use percentile_cont to get the interpolated result:

select percentile_const(0.5) within group (order by things.value)
from things

What does interpolated mean? The easiest way to illustrate the difference is to consider a list of 2 numbers: [1, 2]. percentile_disc(0.5) (the discete median) of this list will be 1, the cloest value to the center that is in the list. On the other hand, percentile_const(0.5) will return 1.5, the average of the two numbers since there’s no individual item in the list that represents the median.

Calculating A Perentile

Any percentile can be calculated by providing a fractional amount from 0 to 1. The percentile function can be used multiple times within the query. In this example, 3 quartiles are calculated.

  percentile_disc(0.25) within group (order by things.value),
  percentile_disc(0.5) within group (order by things.value),
  percentile_disc(0.75) within group (order by things.value)
from things

within group can be used with other claues like group by. Here’s an example where different percentiles are calculated for numbers less than 100, and everything else.

  things.value < 100 as less_than_100,
  percentile_disc(0.75) within group (order by things.value)
from things
group by 1

group by 1 groups by the first expression in the select statement

Calculating All Percentiles

It might be useful to calculate what the percentile value is for each percent, 1 to 100. This can be used to get an idea of the shape of your data, or you can cache it into a materialized view for fast percentile estimation for new values.

One approach is to combine the percentile_ functions discussed above with generate_series. In this example, every percentile from 1 to 100 is returned as individual rows:

select k, percentile_disc(k) within group (order by things.value)
from things, generate_series(0.01, 1, 0.01) as k
group by k

generate_series generates temporary table with values between a starting and ending value, with an optional step. In this example, generate_series(0.01, 1, 0.01) returns 0.01, 0.02, etc..

The arguments of generate_series can be modified to specify what percentiles we want to calculate. For example, quartiles can be calculated with generate_series(0.25, 1, 0.25).

This method can be slow because it performs a percentile lookup over the entire dataset for each of the percentiles it calculates. For example, if you are calculating 100 percentiles then it will scan your data 100 times.

Ideally we want to scan the data only once while keeping track of what percentile a particular value is in. PostgreSQL has a window function that can be used to accomplish this: ntile.

ntile will assign each value in your dataset into bucket, and you specify how many buckets there are.

The completed example requires a subquery, so we'll look at the inner query first. In this example we'll calculate the 100 percentiles, so we'll use ntile(100):

select things.value, ntile(100) over (order by things.value) from things

This query returns all of our values, along with what bucket each one belongs to. The next step is to interpret these buckets into percentile values. Because each value has been evenly distributed over 50 buckets then we can look at the max value of the 50th bucket to see where the 50th percentile ends. Generalizing this, we can use an aggregate query to calculate the max(value) for each distinct bucket number to find all the percentiles!

Here’s the final query calculating all the percentiles from 1 to 100.

select max(buckets.value), ntile as percentile
  (select things.value, ntile(100) over (order by things.value) from things) as buckets
group by 2 order by 2

This should generally run about 100 times faster than the example using percentile_disc with 100 percentiles, since it will only scan the data once instead of 100 times.

The percentile is returned as an integer, instead of a fraction. You may need to adjust your query appropriately before consuming the results elsewhere.

Keep in mind that if you want to calculate a number of percentiles other than 100 then some math will be needed to convert the ntile bucket to the percentile. For example, when calculating quartiles you would use ntile(4). Buckets 1,2,3,4 are assigned. To convert bucket number to percentile: ntile / 4.0.