Confidence interval for the mean of non-normal data

Most flow and rainfall data are non-normal and it is important to use correct approaches when calculating the mean, and confidence intervals for the mean.

Here I’m following the guidance provided by Olsson (2005) and Wang (2001) and discussion on stack overflow.  There are two main approaches:

  • Bootstrapping
  • The ‘modified Cox method’ which is mainly applicable for log-normal data.

Modified Cox

Considering the modified Cox method first (Olsson, 2005). If X is log-normally distributed and the expected value of X can be expressed as \rm{E}(X) = \theta, then the confidence interval for \log(\theta) can be approximated as:

\bar{Y} = \frac{S^2}{2} \pm t_{df}\sqrt{\frac{S^2}{n} + \frac{S^4}{2(n-1)} }

Where Y = \log(X), the sample mean of Y is \bar{Y} and the sample variance of Y is S^2. To get the confidence limits we use the t distribution with degrees of freedom equal to n-1 (one less than the number of data points).

An R function is below:

 
ModifiedCox <- function(x){
n <- length(x)
y <- log(x)
y.m <- mean(y)
y.var <- var(y)

my.t <- qt(0.975, df = n-1) # 95% Confidence interval

my.mean <- mean(x)
upper <- y.m + y.var/2 + my.t*sqrt(y.var/n + y.var^2/(2*(n - 1)))
lower <- y.m + y.var/2 - my.t*sqrt(y.var/n + y.var^2/(2*(n - 1)))

return(list(upper = exp(upper), mean = my.mean, lower = exp(lower)))

}

As an example, we’ll use the annual series for the Hunter River at Singleton from 1938-1968.  This dataset is also used in examples in Australian Rainfall and Runoff, particularly Book 3 – Peak Flow Estimation.  The data are available on in the file Singleton.csv on dropbox.  The annual series is plotted below, clearly showing the large peak of 1955 which led to extreme flooding.

 

Singleton

The data is approximately log-normal (see QQ plots below) so the modified Cox approach will give a reasonable estimate of the confidence interval around the mean.

Singleton-QQ

The 95% confidence interval for the mean flood can be calculated as follows

Singleton <- c(76.26, 171.87, 218.21, 668.79, 1374.42, 124.12,
276.30, 895.50, 1374.42, 280.18, 202.62, 4052.42,
2323.77, 2536.31, 3315.62, 1232.73,1391.43,
12525.66, 1099.54, 447.75, 478.92, 180.52, 164.36,
229.54, 2125.4, 966.35, 2751.68, 49.03, 79.51, 912.5, 926.67)

ModifiedCox(Singleton)

$upper
[1] 2973.871

$mean
[1] 1401.69

$lower
[1] 763.9697

So the 95% confidence limits are 764 and 2974 cumecs, with the mean estimated as 1402 cumec.

Bootstrapping

An alternative approach is to use Bootstrapping.  This is straightforward in R using boot package.   Code is as follows (borrowing heavily from this post).

# function to obtain the mean
Bmean <- function(data, i) {
d <- data[i] # allows boot to select sample
return(mean(d))
}

# bootstrapping with 1000 replications
results <- boot(data=Singleton, statistic=Bmean, R=1000)

# view results
results
plot(results)

# get 95% confidence interval
boot.ci(results, type=c("norm", "basic", "perc", "bca"))

BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 1000 bootstrap replicates

CALL : 
boot.ci(boot.out = results, type = c("norm", "basic", "perc", 
    "bca"))

Intervals : 
Level      Normal              Basic         
95%   ( 592, 2216 )   ( 461, 2049 )  

Level     Percentile            BCa          
95%   ( 754, 2343 )   ( 883, 2952 )  
Calculations and Intervals on Original Scale
Some BCa intervals may be unstable

There are several different bootstrap confidence intervals as discussed by Wang (2001).  The code above has calculated 4 different types, normal, basic, percentile and BCa.  Wang (2001) recommends the BCa method, so, in this case, the 95% confidence interval for the mean of the annual flood series has confidence limits of 883 and 2952 cumecs, which is similar to the modified Cox results.

This Cross Validated post points out some limitations of bootstrap confidence intervals for skewed data.

There are also challenges when calculating confidence intervals for the mean when data are correlated (see for example Massah and Kantz, 2016).  There is also discussion of the effect of correlation in Chapter 19 of the Handbook of Hydrology.

Code, from this blog, is available as a gist.

 

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s