Spurious correlation between skewed independent variables

There is an interesting article in the latest edition of the Journal of Hydrology (New Zealand):

Tendency toward negative correlations for positively skewed independent random variables. Beardsley, E. (2014) Journal of Hydrology New Zealand 53 (2):175-177.

The main message is that if independent variables are skewed they may appear to be correlated when in fact they are not.  As Bardsley says:

“in certain circumstances this effect could be misinterpreted as discovery of a weak but real association between the variables concerned”.

If you calculate the correlation between two independent variables, it is likely to be small – so you won’t be misled. The problem comes from from comparison between multiple variables. For skewed data the correlation coefficient is likely to be small but also likely to be negative. A large number of small negative results begins to look meaningful.

As Bardsley suggests, it could happen that some climatic variable exhibits apparent negative correlation with seasonal rainfall in the sense of a preponderance of negative correlation coefficients over a large number of sites. This could be reported as the climatic variable having a ‘weak but strongly significant regional association with rainfall’.

We can explore the effect with simulation.

If we plot independent data with zero skew, we don’t expect to see correlation, except by chance and it is not likely to be statistically significant.

set.seed(2015)
x <- rnorm(100)
y <- rnorm(100)
plot(x, y)
abline(lm(y ~ x))
cor(x,y)
## [1] -0.1099988
Scatterplot with independent normally distributed data

Scatterplot with independent normally distributed data

Note that the slope of a regression line and the correlation coefficient are not the same thing, but they are related and do have the same sign.

For many pairs of independent random variables with zero skew, the average correlation will be close to zero and proportion less than zero will be about 50%.

my.cor <- replicate(1000, cor(rnorm(100), rnorm(100)))
mean(my.cor)
## [1] -0.0007247909
mean(my.cor < 0) # proportion less than zero
## [1] 0.505

So far, everything is as we would expect.  However, things get interesting with skewed data.

We can generate a skewed variable from the Pearson III distribution using a function from the lmomco package, see the rpe3 function below.

# Pearson III random variate
# n = number of variates
# g = skew

require(lmomco)
rpe3 <- function(n, g){
  
  my.para <- vec2par(c(0, 1, g), type = "pe3")
  rlmomco(n, my.para)
}

Ploting skewed independent variables provides the classic type of plot that you often see when you were hoping your data were actually related.

set.seed(2016)
x <- rpe3(100, 4)
y <- rpe3(100, 4)
plot(x, y, las = 1)
Scatterplot with skewed independent data

Scatterplot with skewed independent data

Any correlation will on average be small, because the variables are independent but – and this is the surprising thing – correlation is more likely to be less than zero. The larger the skew, the greater the proportion of correlations that are negative.

my.cor <- replicate(1000, cor(rpe3(100, 4), rpe3(100, 4)))
mean(my.cor)
## [1] 0.0009872852  # very small, as expected
mean(my.cor < 0)
## [1] 0.58 # was expecting 0.5?

Doing these calculations for a range of skews, shows the effect.

CorG <- function(g) {
  x <- replicate(1000, cor(rpe3(100,g),rpe3(100,g)))
  mean(x < 0) # proportion less than zero
}

my.g <- -10:10
my.corr <- sapply(my.g, CorG)

par(oma=c(1,2,0,0))
plot(my.g, my.corr, 
     xlab = 'skew', 
     ylab = '',
     las = 1)
mtext(side = 2, line = 3.5, text = 'proportion of calculated \n correlations less than zero' )
Proportion of correlations less than zero as a function of skew

Proportion of correlations less than zero as a function skew for independent variates

The plot shows that as the skew becomes more extreme, the proportion of correlations less than zero increases far above the expected value of 50%.

Sure there are some large values of skewness here, but if we just naively calculate the skewness of daily rainfall in an arid area, the value can be around 7 or more so these values are not unrealistic.

The upshot is, take care when interpreting correlations when data are skewed.

Note that using alternative measures of correlation e.g. Spearman and Kendall, do seem to help, although I only did a small amount of exploration.   The CorG function can be modified to explore the effect.  Alternatively, data can be transformed to reduce skew before calculating correlations.

Issues associated with calculating correlations with non-normal data have been known for some time. Kowalski (1972) concluded that the distribution of r (the Pearson sample correlation coefficient) may be sensitive to non-normality.  Specific issues associated with rainfall data are discussed by Habib et al (2001) who showed that a transformation-based estimator of correlation has improved characteristics.  Stedinger, (1981) looked at correlation between streamflow records.  There is also an interesting discussion at CrossValidated.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s