# Calculating Percentile (and Median) in PostgreSQL

Modern versions of PostgreSQL (≥ 9.4) make it easy to calculate the percentile
for a list of values at any percentage using the `percentile_cont`

and
`percentile_disc`

*ordered-set aggregation functions*.

The two functions work very similarly, but differ in how they merge the final result:

`percentile_disc`

will return a value from the input set closest to the percentile you request`percentile_cont`

will return an interpolated value between multiple values based on the distribution. You can think of this as being more accurate, but can return a fractional value between the two values from the input

In all these examples we'll imagine we have a table called `things`

with a column called `value`

.

## Calculating The Median

The median is generally the 50th percentile. You can calculate it with the following query:

```
select percentile_disc(0.5) within group (order by things.value)
from things
```

The `within group`

syntax is called an ordered-set aggregate

For a percentile that lies between values from the input, use `percentile_cont`

to get the interpolated result:

```
select percentile_const(0.5) within group (order by things.value)
from things
```

What does interpolated mean?The easiest way to illustrate the difference is to consider a list of 2 numbers: [1, 2].`percentile_disc(0.5)`

(the discete median) of this list will be`1`

, the cloest value to the center that is in the list. On the other hand,`percentile_const(0.5)`

will return`1.5`

, the average of the two numbers since there’s no individual item in the list that represents the median.

## Calculating A Perentile

Any percentile can be calculated by providing a fractional amount from 0 to 1. The percentile function can be used multiple times within the query. In this example, 3 quartiles are calculated.

```
select
percentile_disc(0.25) within group (order by things.value),
percentile_disc(0.5) within group (order by things.value),
percentile_disc(0.75) within group (order by things.value)
from things
```

`within group`

can be used with other claues like `group by`

. Here’s an example
where different percentiles are calculated for numbers less than 100, and
everything else.

```
select
things.value < 100 as less_than_100,
percentile_disc(0.75) within group (order by things.value)
from things
group by 1
```

`group by 1`

groups by the first expression in the select statement

## Calculating All Percentiles

It might be useful to calculate what the percentile value is for each percent, 1 to 100. This can be used to get an idea of the shape of your data, or you can cache it into a materialized view for fast percentile estimation for new values.

One approach is to combine the `percentile_`

functions discussed above with
`generate_series`

. In this example, every percentile from 1 to 100 is returned
as individual rows:

```
select k, percentile_disc(k) within group (order by things.value)
from things, generate_series(0.01, 1, 0.01) as k
group by k
```

`generate_series`

generates temporary table with values between a starting and
ending value, with an optional step. In this example, `generate_series(0.01, 1, 0.01)`

returns `0.01`

,
`0.02`

, etc..

The arguments of

`generate_series`

can be modified to specify what percentiles we want to calculate. For example, quartiles can be calculated with`generate_series(0.25, 1, 0.25)`

.

This method can be slow because it performs a percentile lookup over the entire dataset for each of the percentiles it calculates. For example, if you are calculating 100 percentiles then it will scan your data 100 times.

Ideally we want to scan the data only once while keeping track of what
percentile a particular value is in. PostgreSQL has a window
function that
can be used to accomplish this: `ntile`

.

`ntile`

will assign each value in your dataset into bucket, and you specify how
many buckets there are.

The completed example requires a subquery, so we'll look at the inner query
first. In this example we'll calculate the 100 percentiles, so we'll use `ntile(100)`

:

`select things.value, ntile(100) over (order by things.value) from things`

This query returns all of our values, along with what bucket each one belongs
to. The next step is to interpret these buckets into percentile values. Because
each value has been evenly distributed over 50 buckets then we can look at the
max value of the 50th bucket to see where the 50th percentile ends.
Generalizing this, we can use an aggregate query to calculate the `max(value)`

for each distinct bucket number to find all the percentiles!

Here’s the final query calculating all the percentiles from 1 to 100.

```
select max(buckets.value), ntile as percentile
from
(select things.value, ntile(100) over (order by things.value) from things) as buckets
group by 2 order by 2
```

This should generally run about 100 times faster than the example using
`percentile_disc`

with 100 percentiles, since it will only scan the data once
instead of 100 times.

The percentile is returned as an integer, instead of a fraction. You may need to adjust your query appropriately before consuming the results elsewhere.

Keep in mind that if you want to calculate a number of percentiles other than
100 then some math will be needed to convert the `ntile`

bucket to the
percentile. For example, when calculating quartiles you would use `ntile(4)`

.
Buckets 1,2,3,4 are assigned. To convert bucket number to percentile: `ntile / 4.0`

.