# Detecting outliers in water quality data

This post proposes a simple method to detect outliers in water quality data.  This is based on a proposal by John Tukey as discussed in the Wikipedia article: Outliers.

Tukey proposed a method that could be used to flag outliers based on the interquartile range.  An outlier is any observation falls outside:

${[} Q_1 - k(Q_3 - Q_1) {]}, {[} Q_3 + k(Q_3 - Q_1) {]}$

Where, Q1 and Q3 are the lower and upper quartiles.

Tukey proposed that k = 1.5 could be used to flag outliers, while k = 3 suggests observations that are “far out”.

A guide to whether the maximum or minimum value in a dataset are outliers is to calculate their k values.

$k_{max} = \frac{max(O) - Q_3}{Q_3 - Q_1}$

$k_{min} = \frac{ Q_1 - min(O)}{Q_3 - Q_1}$

The appealing feature of using the interquartile range is that it is resistant to outliers so there is reduced change of the statistic used to flag any outlier being influenced by outliers, an issue known as masking.

A function to calculate the k value for the maximum and minimum of a dataset is as follows:

Tukey_k <- function(x){
my.quantile <- quantile(x, na.rm = TRUE)
Q_25 <- my.quantile[2]
Q_75 <- my.quantile[4]

k_max <- as.vector((max(x, na.rm = TRUE) - Q_75)/(Q_75 - Q_25))
k_min <- as.vector((Q_25 - min(x, na.rm = TRUE))/(Q_75 - Q_25))

data.frame(k_max = k_max, k_min = k_min)

}

Let’s run some tests.

If we calculate the k values for a large normally distributed dataset (say 100,000 observations) they are less than 3.  For these data, getting a value larger than 3 is obviously rare.

set.seed(2016)
Tukey_k(rnorm(100000))
# k_max k_min
# 1 2.530309 2.531518

For water quality observations, a review by Duncan (1999) suggested the log-normal distribution was a good approximation for all the data he looked at so we need to take logs of the observations before calculating the k values.

For some data on Nickel concentration I’m working on, the k for the maximum value is 4.3.  Clearly a ‘far out’ value and one that deserves checking, via a histogram or QQ plot (see below).  Of the 12,000 values in this dataset there are clearly 15 or so that are very large (relative to the bulk of the data).

Code is available as a gist.